Showing papers on "Cache algorithms published in 1996"

PDF

Open Access

Report•DOI•

[...]

Anawat Chankhunthod¹, Peter B. Danzig¹, Chuck Neerdaels¹, Michael F. Schwartz², Kurt J. Worrell² - Show less +1 more•Institutions (2)

University of Southern California¹, University of Colorado Boulder²

22 Jan 1996

TL;DR: The design and performance of a hierarchical proxy-cache designed to make Internet information systems scale better are discussed, and performance measurements indicate that hierarchy does not measurably increase access latency.

...read moreread less

Abstract: This paper discusses the design and performance of a hierarchical proxy-cache designed to make Internet information systems scale better. The design was motivated by our earlier trace-driven simulation study of Internet traffic. We challenge the conventional wisdom that the benefits of hierarchical file caching do not merit the costs, and believe the issue merits reconsideration in the Internet environment. The cache implementation supports a highly concurrent stream of requests. We present performance measurements that show that our cache outperforms other popular Internet cache implementations by an order of magnitude under concurrent load. These measurements indicate that hierarchy does not measurably increase access latency. Our software can also be configured as a Web-server accelerator; we present data that our httpd-accelerator is ten times faster than Netscape's Netsite and NCSA 1.4 servers. Finally, we relate our experience fitting the cache into the increasingly complex and operational world of Internet information systems, including issues related to security, transparency to cache-unaware clients, and the role of file systems in support of ubiquitous wide-area information systems.

...read moreread less

853 citations

Proceedings Article•DOI•

Trace cache: a low latency approach to high bandwidth instruction fetching

[...]

Eric Rotenberg¹, Steve Bennett², James E. Smith¹•Institutions (2)

University of Wisconsin-Madison¹, Intel²

02 Dec 1996

TL;DR: It is shown that the trace cache's efficient, low latency approach enables it to outperform more complex mechanisms that work solely out of the instruction cache.

...read moreread less

Abstract: As the issue width of superscalar processors is increased, instruction fetch bandwidth requirements will also increase. It will become necessary to fetch multiple basic blocks per cycle. Conventional instruction caches hinder this effort because long instruction sequences are not always in contiguous cache locations. We propose supplementing the conventional instruction cache with a trace cache. This structure caches traces of the dynamic instruction stream, so instructions that are otherwise noncontiguous appear contiguous. For the Instruction Benchmark Suite (IBS) and SPEC92 integer benchmarks, a 4 kilobyte trace cache improves performance on average by 28% over conventional sequential fetching. Further, it is shown that the trace cache's efficient, low latency approach enables it to outperform more complex mechanisms that work solely out of the instruction cache.

...read moreread less

637 citations

Proceedings Article•

World-wide web cache consistency

[...]

James S. Gwertzman¹, Margo Seltzer²•Institutions (2)

Microsoft¹, Harvard University²

22 Jan 1996

TL;DR: Using trace-driven simulation, it is shown that a weak cache consistency protocol (the one used in the Alex ftp cache) reduces network bandwidth consumption and server load more than either time-to-live fields or an invalidation protocol and can be tuned to return stale data less than 5% of the time.

...read moreread less

Abstract: The bandwidth demands of the World Wide Web continue to grow at a hyper-exponential rate. Given this rocketing growth, caching of web objects as a means to reduce network bandwidth consumption is likely to be a necessity in the very near future. Unfortunately, many Web caches do not satisfactorily maintain cache consistency. This paper presents a survey of contemporary cache consistency mechanisms in use on the Internet today and examines recent research in Web cache consistency. Using trace-driven simulation, we show that a weak cache consistency protocol (the one used in the Alex ftp cache) reduces network bandwidth consumption and server load more than either time-to-live fields or an invalidation protocol and can be tuned to return stale data less than 5% of the time.

...read moreread less

342 citations

Proceedings Article•DOI•

Predictive sequential associative cache

[...]

Brad Calder¹, Dirk Grunwald¹, Joel Emer•Institutions (1)

University of Colorado Boulder¹

03 Feb 1996

TL;DR: A cache design that provides the same miss rate as a two-way set associative cache, but with an access time closer to a direct-mapped cache, and is easier to implement than previous designs.

...read moreread less

Abstract: In this paper we propose a cache design that provides the same miss rate as a two-way set associative cache, but with an access time closer to a direct-mapped cache As with other designs, a traditional direct-mapped cache is conceptually partitioned into multiple banks, and the blocks in each set are probed, or examined, sequentially Other designs either probe the set in a fixed order or add extra delay in the access path for all accesses We use prediction sources to guide the cache examination, reducing the amount of searching and thus the average access latency A variety of accurate prediction sources are considered, with some being available in early pipeline stages We feel that our design offers the same or better performance and is easier to implement than previous designs

...read moreread less

233 citations

Proceedings Article•DOI•

Adding instruction cache effect to schedulability analysis of preemptive real-time systems

[...]

J.V. Busquets-Mataix, Juan José Serrano, R. Ors, Pedro Gil, Andy Wellings - Show less +1 more

10 Jun 1996

TL;DR: The paper describes how to incorporate the effect of instruction cache to the Response Time schedulability Analysis (RTA), an efficient analysis for preemptive fixed priority schedulers and compares the results of such an approach to both cache partitioning and CRMA.

...read moreread less

Abstract: Cache memories are commonly avoided in real time systems because of their unpredictable behavior. Recently, some research has been done to obtain tighter bounds on the worst case execution time (WCET) of cached programs. These techniques usually assume a non preemptive underlying system. However, some techniques can be applied to allow the use of caches in preemptive systems. The paper describes how to incorporate the effect of instruction cache to the Response Time schedulability Analysis (RTA). RTA is an efficient analysis for preemptive fixed priority schedulers. We also compare through simulations the results of such an approach to both cache partitioning (increase of the cache predictability by assigning private cache partitions to tasks) and CRMA (Cached RMA: cache effect is incorporated in the utilization based rate monotonic schedulability analysis). The results show that the cached version of RTA (CRTA) clearly outperforms CRMA, however the partitioning scheme may be better depending on the system configuration. The obtained results bound the applicability domain for each method for a variety of hardware and workload configurations. The results can be used as design guidelines.

...read moreread less

182 citations

Patent•

Method and apparatus for computer disk cache management

[...]

Mark Ish, Federico Giovannetti

19 Apr 1996

TL;DR: In this article, a hash function takes as its input a block number and outputs a hash index into a hash table of pointers, each pointer in the hash table points to a doubly-linked list of headers, with each header having a bit map wherein the bits contained in the map identify whether a particular block of data is contained within the cache.

...read moreread less

Abstract: A computer disk cache management method and apparatus which employs a least-recently-used with aging method to determine a best candidate for replacement as a result of a cache miss. A hash function takes as its input a block number and outputs a hash index into a hash table of pointers. Each pointer in the hash table points to a doubly-linked list of headers, with each header having a bit map wherein the bits contained in the map identify whether a particular block of data is contained within the cache. An ordered binary tree (heap) identifies candidates for replacement such that the best candidate for replacement is located at the root of the heap. After every access to a cache line, the heap is locally reorganized based upon a frequency of use and an age of the cache line, such that the least-frequently-used and/or oldest cache line is at the root of the heap.

...read moreread less

169 citations

Proceedings Article•

WATCHMAN: A Data Warehouse Intelligent Cache Manager

[...]

Peter Scheuermann¹, Junho Shim¹, Radek Vingralek¹•Institutions (1)

Northwestern University¹

03 Sep 1996

TL;DR: The design of an intelligent cache manager for sets retrieved by queries called WATCHMAN, which is particularly well suited for data warehousing environment, and achieves a substantial performance improvement in a decision support environment when compared to a traditional LRU replacement algorithm.

...read moreread less

Abstract: Data warehouses store large volumes of data which are used frequently by decision support applications. Such applications involve complex queries. Query performance in such an environment is critical because decision support applications often require interactive query response time. Because data warehouses are updated infrequently, it becomes possible to improve query performance by caching sets retrieved by queries in addition to query execution plans. In this paper we report on the design of an intelligent cache manager for sets retrieved by queries called WATCHMAN, which is particularly well suited for data warehousing environment. Our cache manager employs two novel, complementary algorithms for cache replacement and for cache admission. WATCHMAN aims at minimizing query response time and its cache replacement policy swaps out entire retrieved sets of queries instead of individual pages. The cache replacement and admission algorithms make use of a profit metric, which considers for each retrieved set its average rate of reference, its size, and execution cost of the associated query. We report on a performance evaluation based on the TPC-D and Set Query benchmarks. These experiments show that WATCHMAN achieves a substantial performance improvement in a decision support environment when compared to a traditional LRU replacement algorithm.

...read moreread less

165 citations

Book Chapter•DOI•

Cache Behavior Prediction by Abstract Interpretation

[...]

Martin Alt¹, Christian Ferdinand¹, Florian Martin¹, Reinhard Wilhelm¹•Institutions (1)

Saarland University¹

24 Sep 1996

TL;DR: Abstract Interpretation is a technique for the static analysis of dynamic properties of programs that computes approximative properties of the semantics of programs and replaces commonly used ad hoc techniques by systematic, provable ones.

...read moreread less

Abstract: Interpretation is a technique for the static analysis of dynamic properties of programs. It is semantics based, that is, it computes approximative properties of the semantics of programs. On this basis, it allows for correctness proofs of analyzes. It thus replaces commonly used ad hoc techniques by systematic, provable ones, and it allows the automatic generation of analyzers from specifications as in the Program Analyzer Generator, PAG.

...read moreread less

162 citations

Patent•

Database development system with methods for java-string reference lookups of column names

[...]

Steven T. Shaughnessy

16 Dec 1996

TL;DR: In this paper, a Java-based rapid application development (RAD) environment for creating applications providing named-based programmatic access to information from columns in databases is described, which provides methodology for rapid lookups of column names using a reference cache storing 32-bit references to immutable strings (e.g., Java strings).

...read moreread less

Abstract: A Java-based rapid application development (RAD) environment for creating applications providing named-based programmatic access to information from columns in databases is described. For increasing the efficiency by which named-based references to database columns are processed by application programs, the system provides methodology for rapid lookups of column names, using a reference cache storing 32-bit references to immutable strings (e.g., Java strings). The reference cache is preferably constructed as a least-recently allocated cache, thereby allowing allocation to occur in a round-robin fashion, with the oldest item allocated being the first item bumped from cache when the cache overflows. Each cache entry stores a reference (e.g., four-byte pointer or handle to a string) and an ordinal entry (i.e. the corresponding database ordinal). As a reference to a particular database column occurs during execution of a program, the reference cache fills with a reference to that column name as well as the corresponding column ordinal. Accordingly, program execution proceeds with comparison of existing items in the cache, using a sequence of rapid, in-line comparisons involving simple data types (e.g., 32-bit references for the column name string). This approach minimizes the need to perform hash lookups or string comparison operations.

...read moreread less

157 citations

Patent•

Computer system with private and shared partitions in cache

[...]

David Brian Kirk¹•Institutions (1)

IBM¹

18 Mar 1996

Abstract: The traditional computer system is modified by providing, in addition to a processor unit, a main memory and a cache memory buffer, remapping logic for remapping the cache memory buffer, and a plurality of registers for containing remapping information. With this environment the cache memory buffer is divided into segments, and the segments are one or more cache lines allocated to a task to form a partition, so as to make available (if a size is set above zero) of a shared partition and a group of private partitions. Registers include the functions of count registers which contain count information for the number of cache segments in a specific partition, a flag register, and two register which act as cache identification number registers. The flag register has bits acting as a flag, which bits include a non-real time flag which allows operation without the partition system, a private partition permitted flag, and a private partition selected flag. With this system a traditional computer system can be changed to operate without impediments of interrupts and other prior impediments to a real-time task to perform. By providing cache partition areas, and causing an active task to always have a pointer to a private partition, and a size register to specify how many segments can be used by the task, real time systems can take advantage of a cache. Thus each task can make use of a shared partition, and know how many segments can be used by the task. The system cache provides a high speed access path to memory data, so that during execution of a task the logic means and registers provide any necessary cache partitioning to assure a preempted task that it's cache contents will not be destroyed by a preempting task. This permits use of a software controlled partitioning system which allows segments of a cache to be statically allocated on a priority I benefit basis without hardware modification to said system. The cache allocation provided by the logic gives consideration of the scheduling requirements of tasks of the system in deciding the size of each cache partition. Accordingly, the cache can make use of a for dynamic programming implementation of an allocation algorithm which can determine an optimal cache allocation in polynomial time.

...read moreread less

155 citations

Proceedings Article•DOI•

An analysis of dag-consistent distributed shared-memory algorithms

[...]

Robert D. Blumofe¹, Matteo Frigo², Christopher F. Joerg², Charles E. Leiserson², Keith H. Randall² - Show less +1 more•Institutions (2)

University of Texas at Austin¹, Massachusetts Institute of Technology²

24 Jun 1996

TL;DR: It is proved that if the accesses to the backing store are random and independent (the BACKER algorithm actually uses hashing), the expected execution time TP(C) of a “fully strict” multithreaded computation on P processors, each with a LRU cache of C pages, is O(T1(C)=P+mCT∞).

...read moreread less

Abstract: In this paper, we analyze the performance of parallel multithreaded algorithms that use dag-consistent distributed shared memory. Specifically, we analyze execution time, page faults, and space requirements for multithreaded algorithms executed by a workstealing thread scheduler and the BACKER algorithm for maintaining dag consistency. We prove that if the accesses to the backing store are random and independent (the BACKER algorithm actually uses hashing), the expected execution time TP(C) of a “fully strict” multithreaded computation on P processors, each with a LRU cache of C pages, is O(T1(C)=P+mCT∞), where T1(C) is the total work of the computation including page faults, T∞ is its critical-path length excluding page faults, and m is the minimum page transfer time. As a corollary to this theorem, we show that the expected number FP(C) of page faults incurred by a computation executed on P processors can be related to the number F1(C) of serial page faults by the formula FP(C) F1(C)+O(CPT∞). Finally, we give simple bounds on the number of page faults and the space requirements for “regular” divide-and-conquer algorithms. We use these bounds to analyze parallel multithreaded algorithms for matrix multiplication and LU-decomposition.

...read moreread less

Proceedings Article•DOI•

Reducing conflicts in direct-mapped caches with a temporality-based design

[...]

Jude A. Rivers¹, Edward S. Davidson¹•Institutions (1)

University of Michigan¹

12 Aug 1996

TL;DR: This paper presents a simple but efficient novel hardware design called the non-temporal streaming (NTS) cache that supplements the conventional direct-mapped cache with a parallel fully associative buffer.

...read moreread less

Abstract: Direct-mapped caches are often plagued by conflict misses because they lack the associativity to store more than one memory block in each set. However, some blocks that have no temporal locality actually cause program execution degradation by displacing blocks that do manifest temporal behavior. In this paper, we present a simple but efficient novel hardware design called the non-temporal streaming (NTS) cache that supplements the conventional direct-mapped cache with a parallel fully associative buffer. Every cache block loaded into the main cache is monitored for temporal behavior by a hardware detection unit. Cache blocks identified as nontemporal are allocated to the buffer on subsequent requests. Our simulations show that the NTS Cache not only provides a performance improvement over the conventional direct-mapped cache, but can also save on-chip area. For some numerical programs like FFTPDE, APPSP and APPBT from the NAS benchmark suite, an integral NTS Cache of size 9 KB (i.e., 8 KB direct-mapped cache plus 1 KB NT buffer) performs as well as a 16 KB conventional direct-mapped cache.

...read moreread less

Proceedings Article•DOI•

Wrong-path instruction prefetching

[...]

Jim Pierce¹, Trevor Mudge²•Institutions (2)

Intel¹, University of Michigan²

02 Dec 1996

TL;DR: wrong-path prefetching performs better than the other prefetch algorithms studied in all of the cache configurations examined while requiring little additional hardware and is applicable to both multi-issue and long L1 miss latency machines.

...read moreread less

Abstract: Instruction cache misses can severely limit the performance of both superscalar processors and high speed sequential machines. Instruction prefetch algorithms attempt to reduce the performance degradation by bringing lines into the instruction cache before they are needed by the CPU fetch unit. There have been several algorithms proposed to do this, most notably next line prefetching and target prefetching. We propose a new scheme called wrong-path prefetching which combines next-line prefetching with the prefetching of all control instruction targets regardless of the predicted direction of conditional branches. The algorithm substantially reduces the cycles lost to instruction cache misses while somewhat increasing the amount of memory traffic. Wrong-path prefetching performs better than the other prefetch algorithms studied in all of the cache configurations examined while requiring little additional hardware. For example, the best wrong-path prefetch algorithm can result in a speed up of 16% when using an 8 K instruction cache. In fact, an 8 K wrong-path prefetched instruction cache is shown to achieve the same miss rate as a 32 K non-prefetch cache. Finally, it is shown that wrong-path prefetching is applicable to both multi-issue and long L1 miss latency machines.

...read moreread less

Patent•

Method and product for enchancing performance of computer networks including shared storage objects

[...]

Jagdeep Singh, Chandrashekhar W. Bhide¹•Institutions (1)

Intel¹

26 Apr 1996

TL;DR: In this article, an installable performance accelerator for computer network distributed file systems is provided, where a cache subsystem is added onto, or plugged into, an existing distributed file system with no source code modifications to the operating system.

...read moreread less

Abstract: An installable performance accelerator for computer network distributed file systems is provided. A cache subsystem is added onto, or plugged into, an existing distributed file system with no source code modifications to the operating system. The cache subsystem manages a cache on the client computer side which traps or intercepts file system calls to cached files in order to obtain an immediate and substantial performance increase in distributed file system performance. Additionally, a refresh agent may be installed on the server side to further speed up cache accesses.

...read moreread less

Patent•

Affinity scheduling of data within multi-processor computer systems

[...]

Vernon K. Boland¹•Institutions (1)

NCR Corporation¹

17 Dec 1996

TL;DR: In this article, an improved affinity scheduling system for assigning processes to processors within a multiprocessor computer system which includes a plurality of processors and cache memories associated with each processor is presented.

...read moreread less

Abstract: An improved affinity scheduling system for assigning processes to processors within a multiprocessor computer system which includes a plurality of processors and cache memories associated with each processor. The affinity scheduler affinitizes processes to processors so that processes which frequently modify the same data are affined to the same local processor—the processor whose cache memory includes the data being modified by the processes. The scheduler monitors the scheduling and execution of processes to identify processes which frequently modify data residing in the cache memory of a non-local processor. When a process is identified which requires access to data residing in the cache memory of a non-local processor with greater frequency than the process requires access to data residing in the cache memory of its affined local processor, the affinity of the process is changed to the non-local processor.

...read moreread less

Patent•

Quick recovery of write cache in a fault tolerant I/O system

[...]

Matthew C. Dewey¹, Ellen F. Jones¹•Institutions (1)

EMC Corporation¹

29 Mar 1996

TL;DR: In this paper, a method for recovering data from a cache memory of a second storage controller by access to a cache of a first storage controller is presented. But this method requires the storage controllers are coupled by a private common data path, which may take a relatively long time.

...read moreread less

Abstract: A method for recovering data from a cache memory of a second storage controller by access to a cache memory of a first storage controller is presented. The storage controllers are coupled by a private common data path. The method includes copying metadata corresponding to the data stored in the cache memory of the second storage controller to the cache memory of the first storage controller through the private common data path. The metadata may include pointers to and the size of the data. After copying the metadata pointers, the data in the cache memory of the second storage controller is established in the cache memory of the first storage controller. As a result, the entire set of data does not need to be totally recovered to the hard disk before resuming host communications in a recovery operation, which may take a relatively long time. Instead, if a controller fails, only a portion of the data in the cache of the failed controller, the data describing the recovery information, needs to be incorporated into the "dirty" cache of the remaining controller before communications with the host are resumed.

...read moreread less

Patent•

Integrated processing and L2 DRAM cache

[...]

William Todd Boyd¹, Thomas J. Heller¹, Michael Ignatowski¹, Richard E. Matick¹, Stanley E. Schuster¹ - Show less +1 more•Institutions (1)

IBM¹

13 Nov 1996

TL;DR: In this article, an integrated processor and level two (L2) dynamic random access memory (DRAM) are fabricated on a single chip, and the L2 DRAM cache is placed on the same chip as the processor to reduce the time needed for two chip-to-chip crossings.

...read moreread less

Abstract: An integrated processor and level two (L2) dynamic random access memory (DRAM) are fabricated on a single chip. As an extension of this basic structure, the invention also contemplates multiprocessor "node" chips in which multiple processors are integrated on a single chip with L2 cache. By integrating the processor and L2 DRAM cache on a single chip, high on-chip bandwidth, reduced latency and higher performance are achieved. A multiprocessor system can be realized in which a plurality of processors with integrated L2 DRAM cache are connected in a loosely coupled multiprocessor system. Alternatively, the single chip technology can be used to implement a plurality of processors integrated on a single chip with an L2 DRAM cache which may be either private or shared. This approach overcomes a number of issues which limit the performance and cost of a memory hierarchy. When the L2 DRAM cache is placed on the same chip as the processor, the time needed for two chip-to-chip crossings is eliminated. Since these crossings require off-chip drivers and receivers and must be synchronized with the system clock, the time involved is substantial. This means that with the integrated L2 DRAM cache, latency is reduced.

...read moreread less

Patent•

System and method for maintaining a shared cache look-up table

[...]

Douglas B. Boyle¹•Institutions (1)

LSI Corporation¹

05 Jan 1996

TL;DR: Group cache look-up tables minimize requests for data items outside the groups and greatly minimize the service load on servers having popular data items as discussed by the authors, where each client in the group has access to the group cache lookup table, and any client or group can cache any data item.

...read moreread less

Abstract: An information system and method for reducing workload load on servers in an information system network. The system defines a group of interconnected clients which have associated cache memories. The system maintains a shared group cache look-up table for the group having entries which identify data items cached by the clients within the group and identify the clients at which the data items are cached. Each client in the group has access to the group cache look-up table, and any client or group can cache any data item. The system can include a hierarchy of groups, with each group having a group cache look-up table. The group cache look-up tables minimize requests for data items outside the groups and greatly minimize the service load on servers having popular data items.

...read moreread less

Proceedings Article•DOI•

Instruction fetch mechanisms for VLIW architectures with compressed encodings

[...]

Thomas M. Conte¹, Sanjeev Banerjia¹, Sergei Y. Larin¹, Kishore N. Menezes¹, Sumedh W. Sathaye¹ - Show less +1 more•Institutions (1)

North Carolina State University¹

02 Dec 1996

TL;DR: This report uses the TINKER experimental testbed to examine instruction fetch and instruction cache mechanisms for VLIWs and a new i-fetch mechanism using a silo cache is found to have the best performance.

...read moreread less

Abstract: VLIW architectures use very wide instruction words in conjunction with high bandwidth to the instruction cache to achieve multiple instruction issue. This report uses the TINKER experimental testbed to examine instruction fetch and instruction cache mechanisms for VLIWs. A compressed instruction encoding for VLIWs is defined and a classification scheme for i-fetch hardware for such an encoding is introduced. Several interesting cache and i-fetch organizations are described and evaluated through trace-driven simulations. A new i-fetch mechanism using a silo cache is found to have the best performance.

...read moreread less

Proceedings Article•

Predicting file system actions from prior events

[...]

Thomas M. Kroeger¹, Darrell D. E. Long¹•Institutions (1)

University of California, Santa Cruz¹

22 Jan 1996

TL;DR: A multi-order context modeling technique used in the data compression method Prediction by Partial Match is adapted to track sequences of file access events and transformed an LRU cache into a predictive cache that in the authors' simulations averages 15% more cache hits than LRU.

...read moreread less

Abstract: We have adapted a multi-order context modeling technique used in the data compression method Prediction by Partial Match (PPM) to track sequences of file access events. From this model, we are able to determine file system accesses that have a high probability of occurring as the next event. By prefetching the data for these events, we have transformed an LRU cache into a predictive cache that in our simulations averages 15% more cache hits than LRU. In fact, on average our four-megabyte predictive cache has a higher cache hit rate than a 90 megabyte LRU cache.

...read moreread less

Journal Article•DOI•

The influence of caches on the performance of heaps

[...]

Anthony LaMarca, Richard E. Ladner

01 Jan 1996-ACM Journal of Experimental Algorithms

TL;DR: This paper investigates the cache performance of implicit heaps and presents an analytical model called collective analysis that allows cache performance to be predicted as a function of both cacheconfiguration and algorithm configuration.

...read moreread less

Abstract: As memory access times grow larger relative to processor cycletimes, the cache performance of algorithms has an increasinglylarge impact on overall performance. Unfortunately, most commonlyused algorithms were not designed with cache performance in mind.This paper investigates the cache performance of implicit heaps. Wepresent optimizations which significantly reduce the cache missesthat heaps incur and improve their overall performance. We presentan analytical model called collective analysis that allows cacheperformance to be predicted as a function of both cacheconfiguration and algorithm configuration. As part of ourinvestigation, we perform an approximate analysis of the cacheperformance of both traditional heaps and our improved heaps in ourmodel. In addition empirical data is given for five architecturesto show the impact our optimizations have on overall performance.We also revisit a priority queue study originally performed byJones [25]. Due to the increases in cache miss penalties, therelative performance results we obtain on today's machines differgreatly from the machines of only ten years ago. We compare theperformance of implicit heaps, skew heaps and splay trees anddiscuss the difference between our results and Jones's.

...read moreread less

Proceedings Article•DOI•

Thread scheduling for cache locality

[...]

James Philbin¹, Jan Sterling Edler¹, Otto J. Anshus², Craig C. Douglas³, Kai Li¹ - Show less +1 more•Institutions (3)

Princeton University¹, University of Tromsø², Yale University³

01 Sep 1996

TL;DR: Experiments with several application programs show that the thread scheduling method can improve program performance by reducing second-level cache misses.

...read moreread less

Abstract: This paper describes a method to improve the cache locality of sequential programs by scheduling fine-grained threads. The algorithm relies upon hints provided at the time of thread creation to determine a thread execution order likely to reduce cache misses. This technique may be particularly valuable when compiler-directed tiling is not feasible. Experiments with several application programs, on two systems with different cache structures, show that our thread scheduling method can improve program performance by reducing second-level cache misses.

...read moreread less

Patent•

Scoreboard for cached multi-thread processes

[...]

E. David Neufeld

31 Dec 1996

TL;DR: In this article, the scheduler algorithm of the computer operating system determines the most advantageous order for the process threads to run and which of the processors in a multi-processor system should execute these process threads.

...read moreread less

Abstract: A computer system comprising at least one processor and associated cache memory, and a plurality of registers to keep track of the number of cache memory lines associated with each process thread running in the computer system. Each process thread is assigned to one of the plurality of registers of each level of cache that is being monitored. The number of cache memory lines associated with each process thread in a particular level of the cache is stored as a number value in the assigned register and will increment as more cache memory lines are used for the process thread and will decrement as less cache memory lines are used. The number value in the register is defined as the "process thread temperature." Larger number values indicate warmer process thread temperature and smaller number values indicate cooler process thread temperature. Process thread temperatures are relative and indicate the cache memory line usage by the process threads running in the computer system at a particular level of cache. By keeping track or "score" of the number values (temperatures) in each of these registers called "scoreboard registers," the scheduler algorithm of the computer operating system may objectively determine the most advantageous order for the process threads to run and which of the processors in a multi-processor system should execute these process threads. A scoreboard register may be reassigned to a new process thread when its associated process thread has been discontinued.

...read moreread less

Patent•

Non-volatile cache for providing data integrity in operation with a volatile demand paging cache in a data storage system

[...]

Andras Sarkozy

12 Jun 1996

TL;DR: In this paper, a non-volatile cache mechanism connected to a bus connected for conducting write addresses and data from a host computer to mass storage devices and to a volatile cache wherein each write operation includes a write address and at least one data word.

...read moreread less

Abstract: A non-volatile cache mechanism connected to a bus connected for conducting write addresses and data from a host computer to mass storage devices and to a volatile cache wherein each write operation includes a write address and at least one data word. The non-volatile cache mechanism includes a non-volatile memory constructed of a plurality of sub-memories having overlapping read/write cycles for storing the data words, a cache control responsive to the write operations for writing the data words into the nonvolatile memory in parallel with receipt of the data words into the volatile cache, and a cache index for storing index entries relating write addresses of write operations on the bus with corresponding storage addresses of the data words in the non-volatile memory. The cache control is responsive to a write operation for reading the index entries to identify and select at least one available storage address in the non-volatile memory, generating at least one index entry relating the write address of the current write operation and the selected storage addresses in the non-volatile memory, and writing the data words into the non-volatile memory. The cache control is responsive to flush addresses to the volatile cache for indexing the cache index to identify cache entries corresponding to the flush addresses and invalidating the corresponding cache entries.

...read moreread less

Patent•

A fast, dual ported cache controller for data processors in a packet switched cache coherent multiprocessor system

[...]

Zahir Ebrahim¹, Kevin Normoyle¹, Satyanarayana Nishtala¹, William C. Van Loo¹•Institutions (1)

Sun Microsystems¹

28 Mar 1996

TL;DR: In this article, the cache controller has two modes of operation, including a first standard mode of operation in which read/write access to the cache memory is preceded by generation of the hit/miss signal by the comparator, and a second accelerated mode of operating without waiting for the comparators to process the access request's address value.

...read moreread less

Abstract: A multiprocessor computer system has data processors and a main memory coupled to a system controller. Each data processor has a cache memory. Each cache memory has a cache controller with two ports for receiving access requests. A first port receives access requests from the associated data processor and a second port receives access requests from the system controller. All cache memory access requests include an address value; access requests from the system controller also include a mode flag. A comparator in the cache controller processes the address value in each access request and generates a hit/miss signal indicating whether the data block corresponding to the address value is stored in the cache memory. The cache controller has two modes of operation, including a first standard mode of operation in which read/write access to the cache memory is preceded by generation of the hit/miss signal by the comparator, and a second accelerated mode of operation in which read/write access to the cache memory is initiated without waiting for the comparator to process the access request's address value. The first mode of operation is used for all access requests by the data processor and for system controller access requests when the mode flag has a first value. The second mode of operation is used for the system controller access requests when the mode flag has a second value distinct from the first value.

...read moreread less

Proceedings Article•DOI•

Evaluation of Design Alternatives for a Multiprocessor Microprocessor

[...]

Basem A. Nayfeh¹, Lance Hammond¹, Kunle Olukotun¹•Institutions (1)

Stanford University¹

01 May 1996

TL;DR: This paper evaluates three architectures: shared-primary cache, shared-secondary cache, and shared-memory using a complete system simulation environment which models the CPU, memory hierarchy and I/O devices in sufficient detail to boot and run a commercial operating system.

...read moreread less

Abstract: In the future, advanced integrated circuit processing and packaging technology will allow for several design options for multiprocessor microprocessors. In this paper we consider three architectures: shared-primary cache, shared-secondary cache, and shared-memory. We evaluate these three architectures using a complete system simulation environment which models the CPU, memory hierarchy and I/O devices in sufficient detail to boot and run a commercial operating system. Within our simulation environment, we measure performance using representative hand and compiler generated parallel applications, and a multiprogramming workload. Our results show that when applications exhibit fine-grained sharing, both shared-primary and shared-secondary architectures perform similarly when the full costs of sharing the primary cache are included.

...read moreread less

Patent•

Method and apparatus for managing coherency in object and page caches

[...]

Yoshiki Watanabe¹, Hayata Hiroshi¹•Institutions (1)

Fuji Xerox¹

15 Apr 1996

TL;DR: In this paper, the authors propose a separate region conversion system that is capable of maintaining the hit rate of the cache at high level by simplifying the cache status and to improve the execution efficiency of the application program.

...read moreread less

Abstract: To reduce process time by simplifying the cache status and to improve the execution efficiency of the application program in the separate region conversion system that is capable of maintaining the hit rate of the cache at high level. When an access request is made to the object and if the object is not stored in the object cache, the page containing the object is read from the database and is stored in the page cache, and the object is read from the page and stored in the cache. The status of the page cache describing the status of the page stored in the page cache is stored in the page status storage device and at the same time the status of the object cache describing the status of the object stored in the object cache is stored in the object status storage device. By establishing a relationship between the status of the page cache and the status of the object, if the status of the page cache and the corresponding status of the object cache are not consistent, the status synchronizing device executes a synchronization process to make these status consistent.

...read moreread less

Patent•

Cache control unit with a cache request transaction-oriented protocol

[...]

Yet-Ping Pai¹, Le T. Nguyen¹•Institutions (1)

Samsung¹

15 Nov 1996

TL;DR: In this paper, a cache control unit and a method of controlling a cache is coupled to a cache accessing device, and a request identification information is assigned to the first cache request and provided to the requesting device.

...read moreread less

Abstract: A cache control unit and a method of controlling a cache. The cache is coupled to a cache accessing device. A first cache request is received from the device. A request identification information is assigned to the first cache request and provided to the requesting device. The first cache request may begin to be processed. A second cache request is received from the cache accessing device. The second cache request is assigned to the first cache request and provided to the requesting device. The first and second cache requests are finally fully serviced.

...read moreread less

Patent•

Compressed data cache storage system

[...]

David John Craft¹, Richard Greenberg¹•Institutions (1)

IBM¹

24 Jul 1996

TL;DR: In this article, the authors propose a random access cache storage between a processor accessing data at high speed and in small block units and a mass storage medium holding data in large transfer units.

...read moreread less

Abstract: A system and related architecture for providing random access cache storage between a processor accessing data at high speed and in small block units and a mass storage medium holding data in large transfer units. Lossless data compression is applied to large transfer units of data before storage in a DRAM. Cache address space is assigned in allocation units which are assigned without a prespecified pattern within the DRAM but linked through chains. The chain lengths are adjusted to match the compressibility characteristics of transfer units and include resources for scavenging residuals. Logical blocks materially smaller than the transfer units are accessed and decompressed during readout from the DRAM. The system architecture provides resources for accessing the individual logical blocks through an index. The invention is particularly suited for a disk drive cache system having a small cache DRAM in conjunction with a magnetic or optical disk mass storage system reading highly compressible data.

...read moreread less

Patent•

Compression architecture for system memory application

[...]

William Paul Hovis¹, Kent Harold Haselhorst¹, Steven Wayne Kerchberger¹, Jeffrey Douglas Brown¹, David A. Luick¹ - Show less +1 more•Institutions (1)

IBM¹

20 Dec 1996

TL;DR: In this article, the authors present a memory architecture and method of partitioning a computer memory, which includes a cache section, a setup table, and a compressed storage, all of which are partitioned from the computer memory.

...read moreread less

Abstract: A memory architecture and method of partitioning a computer memory. The architecture includes a cache section, a setup table, and a compressed storage, all of which are partitioned from a computer memory. The cache section is used for storing uncompressed data and is a fast access memory for data which is frequently referenced. The compressed storage is used for storing compressed data. The setup table is used for specifying locations of compressed data stored within the compressed storage. A high speed uncompressed cache directory is coupled to the memory for determining if data is stored in the cache section or compressed storage and for locating data in the cache.

...read moreread less

Collapse