scispace - formally typeset
Search or ask a question

Showing papers on "Cache published in 1995"


Patent
13 Oct 1995
TL;DR: In this article, a method and apparatus for reconstructing data in a computer system employing a modified RAID 5 data protection scheme is described, where the computer system includes a write back cache composed of non-volatile memory for storing metadata information in the nonvolatile memories.
Abstract: Disclosed is a method and apparatus for reconstructing data in a computer system employing a modified RAID 5 data protection scheme. The computer system includes a write back cache composed of non-volatile memory for storing (1) writes outstanding to a device and associated data read, and (2) storing metadata information in the non-volatile memory. The metadata includes a first field containing the logical block number or address (LBN or LBA) of the data, a second field containing the device ID, and a third field containing the block status. From the metadata information it is determined where the write was intended when the crash occurred. An examination is made to determine whether parity is consistent across the slice, and if not, the data in the non-volatile write back cache is used to reconstruct the write that was occurring when the crash occurred to insure consistent parity, so that only those blocks affected by the crash have to be reconstructed.

1,069 citations


Proceedings ArticleDOI
03 Dec 1995
TL;DR: This paper shows how to use application-disclosed access patterns (hints) to expose and exploit I/O parallelism and to allocate dynamically file buffers among three competing demands: prefetching hinted blocks, caching hinted blocks for reuse, and caching recently used data for unhinted accesses.
Abstract: The underutilization of disk parallelism and file cache buffers by traditional file systems induces I/O stall time that degrades the performance of modern microprocessor-based systems. In this paper, we present aggressive mechanisms that tailor file system resource management to the needs of I/O-intensive applications. In particular, we show how to use application-disclosed access patterns (hints) to expose and exploit I/O parallelism and to allocate dynamically file buffers among three competing demands: prefetching hinted blocks, caching hinted blocks for reuse, and caching recently used data for unhinted accesses. Our approach estimates the impact of alternative buffer allocations on application execution time and applies a cost-benefit analysis to allocate buffers where they will have the greatest impact. We implemented informed prefetching and caching in DEC''s OSF/1 operating system and measured its performance on a 150 MHz Alpha equipped with 15 disks running a range of applications including text search, 3D scientific visualization, relational database queries, speech recognition, and computational chemistry. Informed prefetching reduces the execution time of the first four of these applications by 20% to 87%. Informed caching reduces the execution time of the fifth application by up to 30%.

770 citations


Patent
18 Aug 1995
TL;DR: In this paper, access rights of users of a computer network with respect to data entities are specified by a relational database stored on one or more security servers, and an access rights cache on each application server caches the access rights lists of the users that are connected to the respective application server, so that user access rights to specific data entities can rapidly be determined.
Abstract: Access rights of users of a computer network with respect to data entities are specified by a relational database stored on one or more security servers. Application servers on the network that provide user access to the data entities generate queries to the relational database in order to obtain access rights lists of specific users. An access rights cache on each application server caches the access rights lists of the users that are connected to the respective application server, so that user access rights to specific data entities can rapidly be determined. Each user-specific access rights list includes a series of category identifiers plus a series of access rights values. The category identifiers specify categories of data entities to which the user has access, and the access rights values specify privilege levels of the users with respect to the corresponding data entity categories. The privilege levels are converted into specific access capabilities by application programs running on the application servers.

548 citations


Journal ArticleDOI
TL;DR: The results show that the three hardware prefetching schemes all yield significant reductions in the data access penalty when compared with regular caches, the benefits are greater when the hardware assist augments small on-chip caches, and the lookahead scheme is the preferred one cost-performance wise.
Abstract: Memory latency and bandwidth are progressing at a much slower pace than processor performance. In this paper, we describe and evaluate the performance of three variations of a hardware function unit whose goal is to assist a data cache in prefetching data accesses so that memory latency is hidden as often as possible. The basic idea of the prefetching scheme is to keep track of data access patterns in a reference prediction table (RPT) organized as an instruction cache. The three designs differ mostly on the timing of the prefetching. In the simplest scheme (basic), prefetches can be generated one iteration ahead of actual use. The lookahead variation takes advantage of a lookahead program counter that ideally stays one memory latency time ahead of the real program counter and that is used as the control mechanism to generate the prefetches. Finally the correlated scheme uses a more sophisticated design to detect patterns across loop levels. These designs are evaluated by simulating the ten SPEC benchmarks on a cycle-by-cycle basis. The results show that 1) the three hardware prefetching schemes all yield significant reductions in the data access penalty when compared with regular caches, 2) the benefits are greater when the hardware assist augments small on-chip caches, and 3) the lookahead scheme is the preferred one cost-performance wise. >

543 citations


18 Jul 1995
TL;DR: This work assesses the potential of proxy servers to cache documents retrieved with the HTTP protocol, and finds that a proxy server really functions as a second level cache, and its hit rate may tend to decline with time after initial loading given a more or less constant set of users.
Abstract: As the number of World-Wide Web users grow, so does the number of connections made to servers. This increases both network load and server load. Caching can reduce both loads by migrating copies of server files closer to the clients that use those files. Caching can either be done at a client or in the network (by a proxy server or gateway). We assess the potential of proxy servers to cache documents retrieved with the HTTP protocol. We monitored traffic corresponding to three types of educational workloads over a one semester period, and used this as input to a cache simulation. Our main findings are (1) that with our workloads a proxy has a 30-50% maximum possible hit rate no matter how it is designed; (2) that when the cache is full and a document is replaced, least recently used (LRU) is a poor policy, but simple variations can dramatically improve hit rate and reduce cache size; (3) that a proxy server really functions as a second level cache, and its hit rate may tend to decline with time after initial loading given a more or less constant set of users; and (4) that certain tuning configuration parameters for a cache may have little benefit.

495 citations


Proceedings ArticleDOI
01 Jun 1995
TL;DR: This paper presents a new algorithm for choosing problem-size dependent tile sizes based on the cache size and cache line size for a direct-mapped cache that eliminates both capacity and self-interference misses and reduces cross-Interference misses.
Abstract: When dense matrix computations are too large to fit in cache, previous research proposes tiling to reduce or eliminate capacity misses. This paper presents a new algorithm for choosing problem-size dependent tile sizes based on the cache size and cache line size for a direct-mapped cache. The algorithm eliminates both capacity and self-interference misses and reduces cross-interference misses. We measured simulated miss rates and execution times for our algorithm and two others on a variety of problem sizes and cache organizations. At higher set associativity, our algorithm does not always achieve the best performance. However on direct-mapped caches, our algorithm improves simulated miss rates and measured execution times when compared with previous work.

434 citations


Patent
Stevan Charles Allen1
05 Dec 1995
TL;DR: In this paper, a hierarchy of data set preference/requirement parameter hierarchy is established for each data set, listing each parameter from a "most important" parameter to a "least important".
Abstract: A method and system for automatically allocating space within a data storage system for multiple data sets which may include units of data, databases, files or objects. Each data set preferably includes a group of associated preference/requirement parameters which are arranged in a hierarchical order and then compared to corresponding data storage system characteristics for available devices. The data set preference/requirement parameters may include performance, size, availability, location, portability, share status and other attributes which affect data storage system selection. Data storage systems may include solid-state memory, disk drives, tape drives, and other peripheral storage systems. Data storage system characteristics may thus represent available space, cache, performance, portability, volatility, location, cost, fragmentation, and other characteristics which address user needs. The data set preference/requirement parameter hierarchy is established for each data set, listing each parameter from a "most important" parameter to a "least important" parameter. Each attempted storage of a data set will result in an analysis of all available data storage systems and the creation of a linked chain of available data storage systems representing an ordered sequence of preferred data storage systems. Data storage system selection is then performed utilizing this preference chain, which includes all candidate storage systems.

433 citations


Proceedings ArticleDOI
04 May 1995
TL;DR: This work presents an architecture that allows a Web server to autonomously replicate HTML pages and proposes geographical push-caching as a way of bringing the server back into the loop.
Abstract: Most wide-area caching schemes are client initiated. Decisions on when and where to cache information are made without the benefit of the server's global knowledge of the situation. We believe that the server should play a role in making these caching decisions, and we propose geographical push-caching as a way of bringing the server back into the loop. The World Wide Web is an excellent example of a wide-area system that will benefit from geographical push-caching, and we present an architecture that allows a Web server to autonomously replicate HTML pages.

371 citations


Book
01 Jan 1995
TL;DR: A new technique for evaluating cache coherent, shared-memory computers and the Wisconsin Wind Tunnel (WWT) is developed, which correctly interleaves target machine events and calculates target program execution time.
Abstract: We have developed a new technique for evaluating cache coherent, shared-memory computers. The Wisconsin Wind Tunnel (WWT) runs a parallel shared-memory program on a parallel computer (CM-5) and uses execution-driven, distributed, discrete-event simulation to accurately calculate program execution time. WWT is a virtual prototype that exploits similarities between the system under design (the target) and an existing evaluation platform (the host). The host directly executes all target program instructions and memory references that hit in the target cache. WWT's shared memory uses the CM-5 memory's error-correcting code (ECC) as valid bits for a fine-grained extension of shared virtual memory. Only memory references that miss in the target cache trap to WWT, which simulates a cache-coherence protocol. WWT correctly interleaves target machine events and calculates target program execution time. WWT runs on parallel computers with greater speed and memory capacity than uniprocessors. WWT's simulation time decreases as target system size increases for fixed-size problems and holds roughly constant as the target system and problem scale.

338 citations


Proceedings ArticleDOI
23 Apr 1995
TL;DR: This paper examines performance and power trade-offs in cache designs and the effectiveness of energy reduction for several novel cache design techniques targeted for low power.
Abstract: Caches consume a significant amount of energy in modern microprocessors. To design an energy-efficient microprocessor, it is important to optimize cache energy consumption. This paper examines performance and power trade-offs in cache designs and the effectiveness of energy reduction for several novel cache design techniques targeted for low power.

335 citations



Proceedings Article
01 Jan 1995
TL;DR: The results show that tight WCET bounds can be obtained by using the revised timing schema approach, which accurately accounts for the timing effects of pipelined execution and cache memory not only within but also across program constructs.
Abstract: An accurate and safe estimation of a task's worst case execution time (WCET) is crucial for reasoning about the timing properties of real-time systems. In RISC processors, the execution time of a program construct (e.g., a statement) is affected by various factors such as cache hits/misses and pipeline hazards, and these factors impose serious problems in analyzing the WCETs of tasks. To analyze the timing effects of RISC's pipelined execution and cache memory, we propose extensions to the original timing schema where the timing information associated with each program construct is a simple time-bound. In our approach, associated with each program construct is worst case timing abstraction, (WCTA), which contains detailed timing information of every execution path that might be the worst case execution path of the program construct. This extension leads to a revised timing schema that is similar to the original timing schema except that concatenation and pruning operations on WCTAs are newly defined to replace the add and max operations on time-bounds in the original timing schema. Our revised timing schema accurately accounts for the timing effects of pipelined execution and cache memory not only within but also across program constructs. This paper also reports on preliminary results of WCET analysis for a RISC processor. Our results show that tight WCET bounds (within a maximum of about 30% overestimation) can be obtained by using the revised timing schema approach.

01 Jan 1995
TL;DR: The technology described, known as HP AutoRAID, automatically and transparently manages migration of data blocks between these two levels as access patterns change, resulting in a fully redundant storage system that is extremely easy to use, is suitable for a wide variety of workloads, and is largely insensitive to dynamic workload changes.
Abstract: Configuring redundant disk arrays is a black art. To properly configure an array, a system administrator must understand the details of both the array and the workload it will support; incorrect understanding of eithec or changes in the workload over time, can lead to poor performance. We present a solution to this problem: a two-level storage hierarchy implemented inside a single disk-array controller In theupper level of this hierarchy, two copies of active data are stored to provide full redundancy and excellent performance. In the lower level, RAID 5 parity protection is used to provide excel[ent storage cost for inactive data, at somewhat lower performance. The technology we describe in this pape< known as HP AutoRAID, automatically and transparently manages migration of data blocks between these two levels as access patterns change. The result is a fully-redundant storage system that is extremely easy to use, suitable for a wide variety of workloads, largely insensitive to dynamic workload changes, and that performs much better than disk arrays with comparable numbers of spindles and much larger amounts offront-end ~~ cache. Because the implementation of the I-W AUtORAID technology is almost entirely in embedded software, the additional hardware cost for these benejits is very small. We describe the HP AUtORAID technology in detail, and provide performance data for an embodiment of it in a prototype storage array, together with the results of simulation studies used to choose algorithms used in the array.

Patent
14 Jul 1995
TL;DR: In this article, an on-line multiple media viewer system provides an interactive presentation at a client viewing station of multiple media content retrieved over a remote connection from a server at which the content resides using a set of client-initiated and server-driven remote services for anticipatory caching of media content.
Abstract: An on-line multiple media viewer system provides a responsive interactive presentation at a client viewing station of multiple media content retrieved over a remote connection from a server at which the content resides using a set of client-initiated and server-driven remote services for anticipatory caching of media content. In response to an initial request for an item of media content from the server, the remote services predict additional items of media content likely to be requested and transmit these items in advance of their request. Transmitted items are cached by services at the client viewing station in a cache storage. The client checks the cache storage before making additional requests for transfer over the remote connection. The items are transmitted in multi-channel asynchronous operations over the remote connection.

Proceedings ArticleDOI
A. Goldberg1, J. Trotter1
02 Oct 1995
TL;DR: This paper describes how to combine simple hardware support and sampling techniques to obtain empirical data on memory system behavior without appreciably perturbing system performance.
Abstract: Fueled by higher clock rates and superscalar technologies, growth in processor speed continues to outpace improvement in memory system performance. Reflecting this trend, architects are developing increasingly complex memory hierarchies to mask the speed gap, compiler writers are adding locality enhancing transformations to better utilize complex memory hierarchies, and applications programmers are recoding their algorithms to exploit memory systems. All of these groups need empirical data on memory system behavior to guide their optimizations. This paper describes how to combine simple hardware support and sampling techniques to obtain such data without appreciably perturbing system performance. The idea is implemented in the Mprof prototype that profiles data stall cycles, first level cache misses, and second level misses on the Sun Sparc 10/41.

Book
01 Apr 1995
TL;DR: This chapter discusses Cluster Hardware Structures, Symmetric Multiprocessors, "NUMA," and Clusters, and explains why the concept of Cluster is so important.
Abstract: I. WHAT ARE CLUSTERS, AND WHY USE THEM? 1. Introduction. Working Harder. Working Smarter. Getting Help. The Road to Lowly Parallel Processing. A Neglected Paradigm. What is to Come. 2. Examples. Beer & Subpoenas. Serving the Web. The Farm. Fermilab. Other Compute Clusters. Full System Clusters. Cluster Software Products. Basic (Availability) Clusters. Not the End. 3. Why Clusters? The Standard Litany. Why Now? Why Not Now? Commercial Node Performance. The Need for High Availability. 4. Definition, Distinctions, and Initial Comparisons. Definition. Distinction from Parallel Systems. Distinctions from Distributed Systems. Concerning "Single System Image." Other Comparisons. Reactions. II. HARDWARE. 5. A Cluster Bestiary. Exposed vs. Enclosed. "Glass-House" vs. "Campus-Wide" Cluster. Cluster Hardware Structures. Communication Requirements. Cluster Acceleration Techniques. 6. Symmetric Multiprocessors. What is an SMP? What is a Cache, and Why Is It Necessary? Memory Contention. Cache Coherence. Sequential and Other Consistencies. Input/Output. Summary. 7. NUMA and Friends. UMA, NUMA, NORMA, and CC-NUMA. How CC-NUMA Works. The "N" in CC-NUMA. Software Implications. Other CC-NUMA Implications. Is "NUMA" Inevitable? Great Big CC-NUMA. Simple COMA. III. SOFTWARE. 8. Workloads. Why Discuss Workloads? Serial: Throughput. Parallel. Amdahl's Law. The Point of All This. 9. Basic Programming Models and Issues. What is a Programming Model? The Sample Problem. Uniprocessor. Shared Memory. Message-Passing. CC-NUMA. SIMD and All That. Importance. 10. Commercial Programming Models. Small N vs. Large N. Small N Programming Models. Large-N I/O Programming Models. Large-N Processor-Memory Models. Shared Disk or not Shared Disk? 11. Single System Image. Single System Image Boundaries. Single System Image Levels. The Application and Subsystem Levels. The Operating System Kernel Levels. Hardware Levels. SSI and System Management. IV. SYSTEMS. 12. High Availability. What Does "High Availability" Mean? The Basic Idea: Failover. Resources. Failing Over Data. Failing Over Communications. Towards Instant Failover. Failover to Where? Lock Data Reconstruction. Heartbeats, Events, and Failover Processing. System Structure. Related Issues. 13. Symmetric Multiprocessors, "NUMA," and Clusters. Preliminaries. Performance. Cost. High Availability. Other Issues. Partitioning. Conclusion. 14. Why We Need the Concept of Cluster. Benchmarks. Development Directions. Confusion of Issues. The Lure of Large Numbers. 15. Conclusion. Cluster Operating Systems. Exploitation. Standards. Software Pricing. What About 2010?. Coda: The End of Parallel Computer Architecture. Annotated Bibliography. Index. About the Author.

Patent
Hans Hurvig1
07 Jun 1995
TL;DR: In this article, a file allocation and management system for a multi-user network environment is described, where at least one server and two or more clients are disposed along the network in communicating via a request/response transfer protocol.
Abstract: A file allocation and management system for a multi-user network environment is disclosed. At least one server and two or more clients are disposed along the network in communicating via a request/response transfer protocol. Files directed for shared usage among the clients along the network are stored at the server. Each client is adapted to communicate with the server through a plurality of identifier sockets, wherein a first identifier socket is configured for bi-directional communication and a second identifier socket is configured for uni-directional communications initiated by the server. Files normally stored at the server, under appropriate circumstances may be temporarily stored in an internal cache or other memory at each client location, when the file is in use.

Proceedings ArticleDOI
01 Dec 1995
TL;DR: The bare minimum amount of local memories that programs require to run without delay is measured by using the Value Reuse Profile, which contains the dynamic value reuse information of a program's execution, and by assuming the existence of efficient memory systems.
Abstract: As processor performance continues to improve, more emphasis must be placed on the performance of the memory system. In this paper, a detailed characterization of data cache behavior for individual load instructions is given. We show that by selectively applying cache line allocation according the characteristics of individual load instructions, overall performance can be improved for both the data cache and the memory system. This approach can improve some aspects of memory performance by as much as 60 percent on existing executables.

Proceedings ArticleDOI
03 Dec 1995
TL;DR: The paper presents a detailed decomposition of execution time for important kernel services in the three workloads, finding that disk I/O is the first-order bottleneck for workloads such as program development and transaction processing and its importance continues to grow over time.
Abstract: Computer systems are rapidly changing. Over the next few years, we will see wide-scale deployment of dynamically-scheduled processors that can issue multiple instructions every clock cycle, execute instructions out of order, and overlap computation and cache misses. We also expect clock-rates to increase, caches to grow, and multiprocessors to replace uniprocessors. Using SimOS, a complete machine simulation environment, this paper explores the impact of the above architectural trends on operating system performance. We present results based on the execution of large and realistic workloads (program development, transaction processing, and engineering compute-server) running on the IRIX 5.3 operating system from Silicon Graphics Inc. Looking at uniprocessor trends, we find that disk I/O is the first-order bottleneck for workloads such as program development and transaction processing. Its importance continues to grow over time. Ignoring I/O, we find that the memory system is the key bottleneck, stalling the CPU for over 50% of the execution time. Surprisingly, however, our results show that this stall fraction is unlikely to increase on future machines due to increased cache sizes and new latency hiding techniques in processors. We also find that the benefits of these architectural trends spread broadly across a majority of the important services provided by the operating system. We find the situation to be much worse for multiprocessors. Most operating systems services consume 30-70% more time than their uniprocessor counterparts. A large fraction of the stalls are due to coherence misses caused by communication between processors. Because larger caches do not reduce coherence misses, the performance gap between uniprocessor and multiprocessor performance will increase unless operating system developers focus on kernel restructuring to reduce unnecessary communication. The paper presents a detailed decomposition of execution time (e.g., instruction execution time, memory stall time separately for instructions and data, synchronization time) for important kernel services in the three workloads.

Book
01 Mar 1995
TL;DR: In this paper, a hybrid non-blocking cache and prefetching cache is proposed to hide memory latency by exploiting the overlap of processor computations with data accesses, and a hybrid design based on the combination of these two hardware-based schemes is proposed.
Abstract: Non-blocking caches and prefetehing caches are two techniques for hiding memory latency by exploiting the overlap of processor computations with data accesses. A nonblocking cache allows execution to proceed concurrently with cache misses as long as dependency constraints are observed, thus exploiting post-miss operations, A prefetching cache generates prefetch requests to bring data in the cache before it is actually needed, thus allowing overlap with premiss computations. In this paper, we evaluate the effectiveness of these two hardware-based schemes. We propose a hybrid design based on the combination of these approaches. We also consider compiler-based optimization to enhance the effectiveness of non-blocking caches. Results from instruction level simulations on the SPEC benchmarks show that the hardware prefetching caches generally outperform nonblocking caches. Also, the relative effectiveness of nonblocklng caches is more adversely affected by an increase in memory latency than that of prefetching caches,, However, the performance of non-blocking caches can be improved substantially by compiler optimizations such as instruction scheduling and register renaming. The hybrid design cm be very effective in reducing the memory latency penalty for many applications.

Proceedings ArticleDOI
01 May 1995
TL;DR: The results show that DSI reduces execution time of a sequentially consistent full-map coherence protocol by as much as 41%.
Abstract: This paper introduces dynamic self-invalidation (DSI), a new technique for reducing cache coherence overhead in shared-memory multiprocessors. DSI eliminates invalidation messages by having a processor automatically invalidate its local copy of a cache block before a conflicting access by another processor. Eliminating invalidation overhead is particularly important under sequential consistency, where the latency of invalidating outstanding copies can increase a program's critical path.DSI is applicable to software, hardware, and hybrid coherence schemes. In this paper we evaluate DSI in the context of hardware directory-based write-invalidate coherence protocols. Our results show that DSI reduces execution time of a sequentially consistent full-map coherence protocol by as much as 41%. This is comparable to an implementation of weak consistency that uses a coalescing write-buffer to allow up to 16 outstanding requests for exclusive blocks. When used in conjunction with weak consistency, DSI can exploit tear-off blocks---which eliminate both invalidation and acknowledgment messages---for a total reduction in messages of up to 26%.

Journal ArticleDOI
01 Oct 1995
TL;DR: A taxonomy of different cache invalidation strategies is proposed, and the impact of clients' disconnection times on their performance is studied to improve further the efficiency of the invalidation techniques described.
Abstract: In the mobile wireless computing environment of the future, a large number of users, equipped with low-powered palmtop machines, will query databases over wireless communication channels. Palmtop-based units will often be disconnected for prolonged periods of time, due to battery power saving measures; palmtops also will frequently relocate between different cells, and will connect to different data servers at different times. Caching of frequently accessed data items will be an important technique that will reduce contention on the narrowbandwidth, wireless channel. However, cache individualization strategies will be severely affected by the disconnection and mobility of the clients. The server may no longer know which clients are currently residing under its cell, and which of them are currently on. We propose a taxonomy of different cache invalidation strategies, and study the impact of clients' disconnection times on their performance. We study ways to improve further the efficiency of the invalidation techniques described. We also describe how our techniques can be implemented over different network environments.

Proceedings ArticleDOI
03 Dec 1995
TL;DR: In this paper, the authors describe how the Coda File System has evolved to exploit weak connectivity in lowbandwidth and intermittent, low-bandwidth, or expensive networks in mobile computing.
Abstract: Weak connectivity, in the form of intermittent, low-bandwidth, or expensive networks is a fact of life in mobile computing. In this paper, we describe how the Coda File System has evolved to exploit such networks. The underlying theme of this evolution has been the systematic introduction of adaptivity to eliminate hidden assumptions about strong connectivity. Many aspects of the system, including communication, cache validation, update propagation and cache miss handling have been modified. As a result, Coda is able to provide good performance even when network bandwidth varies over four orders of magnitude - from modem speeds to LAN speeds.

Journal ArticleDOI
TL;DR: In this paper, the authors propose an extension to the original timing schema where the timing information associated with each program construct is a simple time bound, and they show that tight WCET bounds can be obtained by using the revised timing schema approach.
Abstract: An accurate and safe estimation of a task's worst case execution time (WCET) is crucial for reasoning about the timing properties of real time systems. In RISC processors, the execution time of a program construct (e.g., a statement) is affected by various factors such as cache hits/misses and pipeline hazards, and these factors impose serious problems in analyzing the WCETs of tasks. To analyze the timing effects of RISC's pipelined execution and cache memory, we propose extensions to the original timing schema where the timing information associated with each program construct is a simple time bound. In our approach, associated with each program construct is worst case timing abstraction, (WCTA), which contains detailed timing information of every execution path that might be the worst case execution path of the program construct. This extension leads to a revised timing schema that is similar to the original timing schema except that concatenation and pruning operations on WCTAs are newly defined to replace the add and max operations on time bounds in the original timing schema. Our revised timing schema accurately accounts for the timing effects of pipelined execution and cache memory not only within but also across program constructs. The paper also reports on preliminary results of WCET analysis for a RISC processor. Our results show that tight WCET bounds (within a maximum of about 30% overestimation) can be obtained by using the revised timing schema approach. >

Patent
10 Jul 1995
TL;DR: In this article, the superscalar microprocessor is presented, which includes an integer functional unit and a floating-point functional unit that share a high performance main data processing bus.
Abstract: A superscalar microprocessor is provided which includes a integer functional unit and a floating point functional unit that share a high performance main data processing bus. The integer unit and the floating point unit also share a common reorder buffer, register file, branch prediction unit and load/store unit which all reside on the same main data processing bus. Instruction and data caches are coupled to a main memory via an internal address data bus which handles communications therebetween. An instruction decoder is coupled to the instruction cache and is capable of decoding multiple instructions per microprocessor cycle. Instructions are dispatched from the decoder in speculative order, issued out-of-order and completed out-of-order. Instructions are retired from the reorder buffer to the register file in-order. The functional units of the microprocessor desirably accommodate operands exhibiting multiple data widths. High performance and efficient use of the microprocessor die size are achieved by the sharing architecture of the disclosed superscalar microprocessor.

Patent
23 May 1995
TL;DR: In this paper, an apparatus and method to adjust the caching strategy used for writing its write request data to storage media during execution of various software applications is presented. But, the cache flushing parameter is transferred from the host computer to a controller which has a cache memory, and a quantity of query data is written from the cache memory to a storage medium in accordance with the cache-flushing parameter.
Abstract: An apparatus and method is disclosed which enables a host computer to adjust the caching strategy used for writing its write request data to storage media during execution of various software applications. The method includes the step of generating a caching-flushing parameter in the host computer. The cache flushing parameter is then transferred from the host computer to a controller which has a cache memory. Thereafter, a quantity of write request data is written from the cache memory to a storage medium in accordance with the cache-flushing parameter.

Proceedings ArticleDOI
05 Dec 1995
TL;DR: This paper describes an approach for bounding the worst-case performance of large code segments on machines that exploit both pipelining and instruction caching, and a graphical user interface is invoked that allows a user to request timing predictions on portions of the program.
Abstract: Recently designed machines contain pipelines and caches. While both features provide significant performance advantages, they also pose problems for predicting execution time of code segments in real-time systems. Pipeline hazards may result in multicycle delays. Instruction or data memory references may not be found in cache and these misses typically require several cycles to resolve. Whether an instruction will stall due to a pipeline hazard or a cache miss depends on the dynamic sequence of previous instructions executed and memory references performed. Furthermore, these penalties are not independent since delays due to pipeline stalls and cache miss penalties may overlap. This paper describes an approach for bounding the worst-case performance of large code segments on machines that exploit both pipelining and instruction caching. First, a method is used to analyze a program's control flow to statically categorize the caching behavior of each instruction. Next, these categorizations are used in the pipeline analysis of sequences of instructions representing paths within the program. A timing analyzer uses the pipeline path analysis to estimate the worst-case execution performance of each loop and function in the program. Finally, a graphical user interface is invoked that allows a user to request timing predictions on portions of the program.

Patent
Clark French1, Peter W. White1
11 Dec 1995
TL;DR: In this paper, the authors describe a client/server database system with improved methods for performing database queries, particularly DSS-type queries, which includes one or more Clients (e.g., Terminals or PCs) connected via a Network to a Server.
Abstract: A Client/Server Database System with improved methods for performing database queries, particularly DSS-type queries, is described. The system includes one or more Clients (e.g., Terminals or PCs) connected via a Network to a Server. In general operation, Clients store data in and retrieve data from one or more database tables resident on the Server by submitting SQL commands, some of which specify "queries"--criteria for selecting particular records of a table. The system implements methods for storing data vertically (i.e., by column), instead of horizontally (i.e., by row) as is traditionally done. Each column comprises a plurality of "cells" (i.e., column value for a record), which are arranged on a data page in a contiguous fashion. By storing data in a column-wise basis, the system can process a DSS query by bringing in only those columns of data which are of interest. Instead of retrieving row-based data pages consisting of information which is largely not of interest to a query, column-based pages can be retrieved consisting of information which is mostly, if not completely, of interest to the query. The retrieval itself can be done using more-efficient large block I/O transfers. The system includes data compression which is provided at the level of Cache or Buffer Managers, thus providing on-the-fly data compression in a manner which is transparent to each object. Since vertical storage of data leads to high repetition on a given data page, the system provides improved compression/decompression.

Proceedings ArticleDOI
05 Jun 1995
TL;DR: The results suggest that distinguishing between documents produced locally and those produced remotely can provide useful leverage in designing caching policies, because of differences in the potential for sharing these two document types among multiple users.
Abstract: With the increasing demand for document transfer services such as the World Wide Web comes a need for better resource management to reduce the latency of documents in these systems. To address this need, we analyze the potential for document caching at the application level in document transfer services. We have collected traces of actual executions of Mosaic, reflecting over half a million user requests for WWW documents. Using those traces, we study the tradeoffs between caching at three levels in the system, and the potential for use of application-level information in the caching system. Our traces show that while a high hit rate in terms of URLs is achievable, a much lower hit rate is possible in terms of bytes, because most profitably-cached documents are small. We consider the performance of caching when applied at the level of individual user sessions, at the level of individual hosts, and at the level of a collection of hosts on a single LAN. We show that the performance gain achievable by caching at the session level (which is straightforward to implement) is nearly all of that achievable at the LAN level (where caching is more difficult to implement). However, when resource requirements are considered, LAN level caching becomes muck more desirable, since it can achieve a given level of caching performance using a much smaller amount of cache space. Finally, we consider the use of organizational boundary information as an example of the potential for use of application-level information in caching. Our results suggest that distinguishing between documents produced locally and those produced remotely can provide useful leverage in designing caching policies, because of differences in the potential for sharing these two document types among multiple users. >

Journal ArticleDOI
TL;DR: Simulations of this adaptive scheme show reductions of the number of read misses, the read penalty, and of the execution time by up to 78%, 58%, and 25% respectively.
Abstract: To offset the effect of read miss penalties on processor utilization in shared-memory multiprocessors, several software- and hardware-based data prefetching schemes have been proposed. A major advantage of hardware techniques is that they need no support from the programmer or compiler. Sequential prefetching is a simple hardware-controlled prefetching technique which relies on the automatic prefetch of consecutive blocks following the block that misses in the cache, thus exploiting spatial locality. In its simplest form, the number of prefetched blocks on each miss is fixed throughout the execution. However, since the prefetching efficiency varies during the execution of a program, we propose to adapt the number of pre-fetched blocks according to a dynamic measure of prefetching effectiveness. Simulations of this adaptive scheme show reductions of the number of read misses, the read penalty, and of the execution time by up to 78%, 58%, and 25% respectively. >