Showing papers on "Cache pollution published in 2000"

PDF

Open Access

Journal Article•DOI•

Summary cache: a scalable wide-area web cache sharing protocol

[...]

Li Fan¹, Pei Cao², Jussara M. Almeida¹, Andrei Z. Broder•Institutions (2)

University of Wisconsin-Madison¹, Cisco Systems, Inc.²

01 Jun 2000-IEEE ACM Transactions on Networking

TL;DR: This paper demonstrates the benefits of cache sharing, measures the overhead of the existing protocols, and proposes a new protocol called "summary cache", which reduces the number of intercache protocol messages, reduces the bandwidth consumption, and eliminates 30% to 95% of the protocol CPU overhead, all while maintaining almost the same cache hit ratios as ICP.

...read moreread less

Abstract: The sharing of caches among Web proxies is an important technique to reduce Web traffic and alleviate network bottlenecks. Nevertheless it is not widely deployed due to the overhead of existing protocols. In this paper we demonstrate the benefits of cache sharing, measure the overhead of the existing protocols, and propose a new protocol called "summary cache". In this new protocol, each proxy keeps a summary of the cache directory of each participating proxy, and checks these summaries for potential hits before sending any queries. Two factors contribute to our protocol's low overhead: the summaries are updated only periodically, and the directory representations are very economical, as low as 8 bits per entry. Using trace-driven simulations and a prototype implementation, we show that, compared to existing protocols such as the Internet cache protocol (ICP), summary cache reduces the number of intercache protocol messages by a factor of 25 to 60, reduces the bandwidth consumption by over 50%, eliminates 30% to 95% of the protocol CPU overhead, all while maintaining almost the same cache hit ratios as ICP. Hence summary cache scales to a large number of proxies. (This paper is a revision of Fan et al. 1998; we add more data and analysis in this version.).

...read moreread less

2,174 citations

Proceedings Article•DOI•

Memory hierarchy reconfiguration for energy and performance in general-purpose processor architectures

[...]

Rajeev Balasubramonian¹, David H. Albonesi¹, Alper Buyuktosunoglu¹, Sandhya Dwarkadas¹•Institutions (1)

University of Rochester¹

01 Dec 2000

TL;DR: This paper proposes a cache and TLB layout and design that leverages repeater insertion to provide dynamic low-cost configurability trading off size and speed on a per application phase basis and demonstrates that a configurable L2/L3 cache hierarchy coupled with a conventional LI results in an average 43% reduction in memory hierarchy energy in addition to improved performance.

...read moreread less

Abstract: Conventional microarchitectures choose a single memory hierarchy design point targeted at the average application. In this paper, we propose a cache and TLB layout and design that leverages repeater insertion to provide dynamic low-cost configurability trading off size and speed on a per application phase basis. A novel configuration management algorithm dynamically detects phase changes and reacts to an application's hit and miss intolerance in order to improve memory hierarchy performance while taking energy consumption into consideration. When applied to a two-level cache and TLB hierarchy at 0.1 /spl mu/m technology, the result is an average 15% reduction in cycles per instruction (CPI), corresponding to an average 27% reduction in memory-CPI, across a broad class of applications compared to the best conventional two-level hierarchy of comparable size. Projecting to sub-.1 /spl mu/m technology design considerations that call for a three-level conventional cache hierarchy for performance reasons, we demonstrate that a configurable L2/L3 cache hierarchy coupled with a conventional LI results in an average 43% reduction in memory hierarchy energy in addition to improved performance.

...read moreread less

425 citations

Patent•

Dynamic page generation acceleration using component-level caching

[...]

Anindya Datta¹•Institutions (1)

Wilmington University¹

24 Nov 2000

TL;DR: In this paper, a preloader uses a cache replacement manager to manage requests for retrievals, insertions, and removal of web page components in a component cache, and a profile server predicts a user's next content request.

...read moreread less

Abstract: A preloader works in conjunction with a web/app server and optionally a profile server to cache web page content elements or components for faster on-demand and anticipatory dynamic web page delivery. The preloader uses a cache manager to manage requests for retrievals, insertions, and removal of web page components in a component cache. The preloader uses a cache replacement manager to manage the replacement of components in the cache. While the cache replacement manager may utilize any cache replacement policy, a particularly effective replacement policy utilizes predictive information to make replacement decisions. Such a policy uses a profile server, which predicts a user's next content request. The components that can be cached are identified by tagging them within the dynamic scripts that generate them. The preloader caches components that are likely to be accessed next, thus improving a web site's scalability.

...read moreread less

366 citations

Journal Article•DOI•

Data prefetch mechanisms

[...]

Steven P. Vanderwiel¹, David J. Lilja²•Institutions (2)

University of Rochester¹, University of Minnesota²

01 Jun 2000-ACM Computing Surveys

TL;DR: To be effective, prefetching must be implemented in such a way that prefetches are timely, useful, and introduce little overhead, and secondary effects such as cache pollution and increased memory bandwidth requirements must be taken into consideration.

...read moreread less

Abstract: The expanding gap between microprocessor and DRAM performance has necessitated the use of increasingly aggressive techniques designed to reduce or hide the latency of main memory access. Although large cache hierarchies have proven to be effective in reducing this latency for the most frequently used data, it is still not uncommon for many programs to spend more than half their run times stalled on memory requests. Data prefetching has been proposed as a technique for hiding the access latency of data referencing patterns that defeat caching strategies. Rather than waiting for a cache miss to initiate a memory fetch, data prefetching anticipates such misses and issues a fetch to the memory system in advance of the actual memory reference. To be effective, prefetching must be implemented in such a way that prefetches are timely, useful, and introduce little overhead. Secondary effects such as cache pollution and increased memory bandwidth requirements must also be taken into consideration. Despite these obstacles, prefetching has the potential to significantly improve overall program execution time by overlapping computation with memory accesses. Prefetching strategies are diverse, and no single strategy has yet been proposed that provides optimal performance. The following survey examines several alternative approaches, and discusses the design tradeoffs involved when implementing a data prefetch strategy.

...read moreread less

316 citations

Proceedings Article•DOI•

A low power unified cache architecture providing power and performance flexibility (poster session)

[...]

Afzal M. Malik¹, Bill Moyer¹, Dan Cermak¹•Institutions (1)

Motorola¹

01 Aug 2000

TL;DR: This paper focuses on the features of the M340 cache sub-system and illustrates the effect on power and performance through benchmark analysis and actual silicon measurements.

...read moreread less

Abstract: Advances in technology have allowed portable electronic devices to become smaller and more complex, placing stringent power and performance requirements on the devices components. The M7CORE M3 architecture was developed specifically for these embedded applications. To address the growing need for longer battery life and higher performance, an 8-Kbyte, 4-way set-associative, unified (instruction and data) cache with pro-grammable features was added to the M3 core. These features allow the architecture to be optimized based on the applications requirements. In this paper, we focus on the features of the M340 cache sub-system and illustrate the effect on power and perfor-mance through benchmark analysis and actual silicon measure-ments.

...read moreread less

253 citations

Proceedings Article•DOI•

A fully associative software-managed cache design

[...]

Erik G. Hallnor¹, Steven K. Reinhardt¹•Institutions (1)

University of Michigan¹

01 May 2000

TL;DR: A practical, fully associative, software-managed secondary cache system that provides performance competitive with or superior to traditional caches without OS or application involvement is presented.

...read moreread less

Abstract: As DRAM access latencies approach a thousand instruction-execution times and on-chip caches grow to multiple megabytes, it is not clear that conventional cache structures continue to be appropriate. Two key features—full associativity and software management—have been used successfully in the virtual-memory domain to cope with disk access latencies. Future systems will need to employ similar techniques to deal with DRAM latencies. This paper presents a practical, fully associative, software-managed secondary cache system that provides performance competitive with or superior to traditional caches without OS or application involvement. We see this structure as the first step toward OS- and application-aware management of large on-chip caches.This paper has two primary contributions: a practical design for a fully associative memory structure, the indirect index cache (IIC), and a novel replacement algorithm, generational replacement, that is specifically designed to work with the IIC. We analyze the behavior of an IIC with generational replacement as a drop-in, transparent substitute for a conventional secondary cache. We achieve miss rate reductions from 8% to 85% relative to a 4-way associative LRU organization, matching or beating a (practically infeasible) fully associative true LRU cache. Incorporating these miss rates into a rudimentary timing model indicates that the IIC/generational replacement cache could be competitive with a conventional cache at today's DRAM latencies, and will outperform a conventional cache as these CPU-relative latencies grow.

...read moreread less

224 citations

Patent•

Preloading data in a cache memory according to user-specified preload criteria

[...]

Deniz Teoman, John M. Neil

25 Jan 2000

TL;DR: In this paper, a relatively high-speed, intermediate-volume storage device is operated as a user-configurable cache, where data is preloaded and responsively cached in the cache memory based on user preferences.

...read moreread less

Abstract: An apparatus and method for caching data in a storage device (26) of a computer system (10). A relatively high-speed, intermediate-volume storage device (25) is operated as a user-configurable cache. Requests to access a mass storage device (46) such as a disk or tape (26, 28) are intercepted by a device driver (32) that compares the access request against a directory (51) of the contents of the user-configurable cache (25). If the user-configurable cache contains the data sought to be accessed, the access request is carried out in the user-configurable cache instead of being forwarded to the device driver for the target mass storage device (46). Because the user-cache is implemented using memory having a dramatically shorter access time than most mechanical mass storage devices, the access request is fulfilled much more quickly than if the originally intended mass storage device was accessed. Data is preloaded and responsively cached in the user-configurable cache memory based on user preferences.

...read moreread less

205 citations

Proceedings Article•DOI•

Frequent value compression in data caches

[...]

Jun Yang¹, Youtao Zhang¹, Rajiv Gupta¹•Institutions (1)

University of Arizona¹

01 Dec 2000

TL;DR: The design and evaluation of the compression cache (CC) is presented which is a first level cache that has been designed so that each cache line can either hold one uncompressed line or two cache lines which have been compressed to at least half their lengths.

...read moreread less

Abstract: Since the area occupied by cache memories on processor chips continues to grow, an increasing percentage of power is consumed by memory. We present the design and evaluation of the compression cache (CC) which is a first level cache that has been designed so that each cache line can either hold one uncompressed line or two cache lines which have been compressed to at least half their lengths. We use a novel data compression scheme based upon encoding of a small number of valves that appear frequently during memory accesses. This compression scheme preserves the ability to randomly access individual data items. We observed that the contents of 40%, 52% and 51% of the memory blocks of size 4, 8, and 16 words respectively in SPECint95 benchmarks can be compressed to at least half their sizes by encoding the top 2, 4, and 8 frequent valves respectively. Compression allows greater amounts of data to be stored leading to substantial reductions in miss rates (0-36.4%), off-chip traffic (3.9-48.1%), and energy consumed (1-27%). Traffic and energy reductions are in part derived by transferring data over external buses in compressed form.

...read moreread less

195 citations

Proceedings Article•DOI•

Dynamic zero compression for cache energy reduction

[...]

Luis Villa¹, Michael Zhang¹, Krste Asanovic¹•Institutions (1)

Massachusetts Institute of Technology¹

01 Dec 2000

TL;DR: Dynamic zero compression reduces the energy required for cache accesses by only writing and reading a single bit for every zero-valued byte and an instruction recoding technique is described that increases instruction cache energy savings to 18%.

...read moreread less

Abstract: Dynamic zero compression reduces the energy required for cache accesses by only writing and reading a single bit for every zero-valued byte. This energy-conscious compression is invisible to software and is handled with additional circuitry embedded inside the cache RAM arrays and the CPU. The additional circuitry imposes a cache area overhead of 9% and a read latency overhead of around two F04 gate delays. Simulation results show that we can reduce total data cache energy by around 26% and instruction cache energy by around 10% for SPECint95 and MediaBench benchmarks. We also describe the use of an instruction recoding technique that increases instruction cache energy savings to 18%.

...read moreread less

181 citations

Patent•

Cache control system and method having hardware-based tag record allocation

[...]

Tsun Y. Ng¹, Ralph H. Castro¹, Virgil V. Wilkins¹•Institutions (1)

Western Digital¹

19 Apr 2000

TL;DR: In this article, a disk drive consisting of a cache memory and a cache control system with a tag memory having a plurality of tag records, and means for allocating a tag record for responding to a host command is described.

...read moreread less

Abstract: The present invention relates to a disk drive 10 comprising a cache memory 14 and a cache control system having a tag memory having a plurality of tag records, and means for allocating a tag record for responding to a host command. The cache memory has a plurality of sequentially-ordered memory clusters 46 for caching disk data stored in sectors (not shown) on disks of a disk assembly 38 . Conventionally the disk sectors are identified by logical block addresses (LBAs). The cache control system 12 along with the tag memory 22 and means for allocating tag records are embedded within the cache control system 12 and thereby configured only for use in defining variable length segments of the memory clusters 46 . The segments are defined without regard to the sequential order of the memory clusters 46.

...read moreread less

149 citations

Proceedings Article•DOI•

Code compression for low power embedded system design

[...]

Haris Lekatsas¹, Jorg Henkel², Wayne Wolf¹•Institutions (2)

Princeton University¹, NEC²

01 Jun 2000

TL;DR: This approach is the first one to measure and optimize the power consumption of a complete SOC comprising a CPU, instruction cache, data cache, main memory, data buses and address bus through code compression.

...read moreread less

Abstract: We propose instruction code compression as an efficient method for reducing power on an embedded system. Our approach is the first one to measure and optimize the power consumption of a complete SOC (System--On--a--Chip) comprising a CPU, instruction cache, data cache, main memory, data buses and address bus through code compression. We compare the pre-cache architecture (decompressor between main memory and cache) to a novel post-cache architecture (decompressor between cache and CPU). Our simulations and synthesis results show that our methodology results in large energy savings between 22% and 82% compared to the same system without code compression. Furthermore, we demonstrate that power savings come with reduced chip area and the same or even improved performance.

...read moreread less

Proceedings Article•DOI•

Modified LRU policies for improving second-level cache behavior

[...]

W.A. Wong¹, J.-L. Baer¹•Institutions (1)

University of Washington¹

08 Jan 2000

TL;DR: This work shows that an opportunity exists to close part of the gap between the OPT and the LRU algorithms, and presents a replacement algorithm based on the detection of temporal locality in lines residing in the L2 cache that improves on the second-level cache miss rate.

...read moreread less

Abstract: Main memory accesses continue to be a significant bottleneck for applications whose working sets do not fit in second-level caches With the trend of greater associativity in second-level caches, implementing effective replacement algorithms might become more important than reducing conflict misses After showing that an opportunity exists to close part of the gap between the OPT and the LRU algorithms, we present a replacement algorithm based on the detection of temporal locality in lines residing in the L2 cache Rather than always replacing the LRU line, the victim is chosen by considering both its priority in the LRU stack and whether it exhibits temporal locality or not We consider two strategies which use this replacement algorithm: a profile-based scheme where temporal locality is detected by processing a trace from a training set of the application and an on-line scheme, where temporal locality is detected with the assistance of a small locality table Both schemes improve on the second-level cache miss rate over a pure LRU algorithm, by as much as 12% in the profiling case and 20% in the dynamic case

...read moreread less

Patent•

Cache control method and system for mixed streaming and non-streaming data

[...]

Sharon H. Yu¹, Gary K. Laatsch¹, Quoc N. Dang¹•Institutions (1)

Western Digital¹

31 Jul 2000

TL;DR: In this paper, the authors present a cache control method for caching disk data in a disk drive configured to receive commands for both streaming and non-streaming data from a host, where a lossy state record is provided for memory segments in a cache memory.

...read moreread less

Abstract: The present invention may be embodied in a cache control method for caching disk data in a disk drive configured to receive commands for both streaming and non-streaming data from a host. A lossy state record is provided for memory segments in a cache memory. The lossy state record allows hosts commands to be mixed for streaming and non-streaming data without flushing of cache data for a command mode change.

...read moreread less

Proceedings Article•DOI•

Using time skewing to eliminate idle time due to memory bandwidth and network limitations

[...]

David Wonnacott¹•Institutions (1)

Haverford College¹

01 May 2000

TL;DR: A generalization of time skewing for multiprocessor architectures is given, and techniques for using multilevel caches reduce the LI cache requirement, which would otherwise be unacceptably high for some architectures when using arrays of high dimension.

...read moreread less

Abstract: Time skewing is a compile-time optimization that can provide arbitrarily high cache hit rates for a class of iterative calculations, given a sufficient number of time steps and sufficient cache memory. Thus, it can eliminate processor idle time caused by inadequate main memory bandwidth. In this article, we give a generalization of time skewing for multiprocessor architectures, and discuss time skewing for multilevel caches. Our generalization for multiprocessors lets us eliminate processor idle time caused by any combination of inadequate main memory bandwidth, limited network bandwidth, and high network latency, given a sufficiently large problem and sufficient cache. As in the uniprocessor case, the cache requirement grows with the machine balance rather than the problem size. Our techniques for using multilevel caches reduce the LI cache requirement, which would otherwise be unacceptably high for some architectures when using arrays of high dimension.

...read moreread less

Proceedings Article•DOI•

Application-specific memory management for embedded systems using software-controlled caches

[...]

Derek Chiou¹, Prabhat Jain¹, Larry Rudolph¹, Srinivas Devadas¹•Institutions (1)

Massachusetts Institute of Technology¹

01 Jun 2000

TL;DR: A way to improve the performance of embedded processors running data-intensive applications by allowing software to allocate on-chip memory on an application-specific basis via a novel hardware mechanism, called column caching.

...read moreread less

Abstract: We propose a way to improve the performance of embedded processors running data-intensive applications by allowing software to allocate on-chip memory on an application-specific basis. On-chip memory in the form of cache can be made to act like scratch-pad memory via a novel hardware mechanism, which we call column caching. Column caching enables dynamic cache partitioning in software, by mapping data regions to a specified sets of cache “columns” or “ways.” When a region of memory is exclusively mapped to an equivalent sized partition of cache, column caching provides the same functionality and predictability as a dedicated scratchpad memory for time-critical parts of a real-time application. The ratio between scratchpad size and cache size can be easily and quickly varied for each application, or each task within an application. Thus, software has much finer software control of on-chip memory, providing the ability to dynamically tradeoff performance for on-chip memory.

...read moreread less

Journal Article•DOI•

Performance evaluation of Web proxy cache replacement policies

[...]

Martin Arlitt¹, Rich Friedrich¹, Tai Jin¹•Institutions (1)

Hewlett-Packard¹

01 Feb 2000-Performance Evaluation

TL;DR: The importance of different Web proxy workload characteristics in making good cache replacement decisions is analyzed, and results indicate that higher cache hit rates are achieved using size-based replacement policies.

...read moreread less

Patent•

Cache including a prefetch way for storing cache lines and configured to move a prefetched cache line to a non-prefetch way upon access to the prefetched cache line

[...]

Brian D. McMinn¹•Institutions (1)

Advanced Micro Devices¹

26 Apr 2000

TL;DR: In this paper, the cache employs one or more prefetch ways for storing prefetch cache lines and accessed cache lines, while cache lines fetched in response to cache misses for requests initiated by a microprocessor connected to the cache are stored into non-prefetch ways.

...read moreread less

Abstract: A cache employs one or more prefetch ways for storing prefetch cache lines and one or more ways for storing accessed cache lines. Prefetch cache lines are stored into the prefetch way, while cache lines fetched in response to cache misses for requests initiated by a microprocessor connected to the cache are stored into the non-prefetch ways. Accessed cache lines are thereby maintained within the cache separately from prefetch cache lines. When a prefetch cache line is presented to the cache for storage, the prefetch cache line may displace another prefetch cache line but does not displace an accessed cache line. A cache hit in either the prefetch way or the non-prefetch ways causes the cache line to be delivered to the requesting microprocessor in a cache hit fashion. The cache is further configured to move prefetch cache lines from the prefetch way to the non-prefetch way if the prefetch cache lines are requested (i.e. they become accessed cache lines). Instruction cache lines may be moved immediately upon access, while data cache line accesses may be counted and a number of accesses greater than a predetermined threshold value may occur prior to moving the data cache line from the prefetch way to the non-prefetch way. Additionally, movement of an accessed cache line from the prefetch way to the non-prefetch way may be delayed until the accessed cache line is to be replaced by a prefetch cache line.

...read moreread less

Patent•

Cache management for a multi-threaded processor

[...]

Robert Bruce Aglietti¹, Rajiv Gupta¹•Institutions (1)

Hewlett-Packard¹

07 Dec 2000

TL;DR: In this paper, the cache memory is partitioned among a set of threads of a multi-threaded processor, and when a cache miss occurs, a replacement line is selected in a partition of the cache space which is allocated to the particular thread from which the access causing the cache miss originated, thereby preventing pollution to partitions belonging to other threads.

...read moreread less

Abstract: A method and apparatus which provides a cache management policy for use with a cache memory for a multi-threaded processor. The cache memory is partitioned among a set of threads of the multi-threaded processor. When a cache miss occurs, a replacement line is selected in a partition of the cache memory which is allocated to the particular thread from which the access causing the cache miss originated, thereby preventing pollution to partitions belonging to other threads.

...read moreread less

Patent•

Associative cache structure for lookups and updates of flow records in a network monitor

[...]

Haig A. Sarkissian, Russell S. Dietz

30 Jun 2000

TL;DR: In this article, a cache system for looking up one or more elements of an external memory includes a set of cache memory elements coupled to the external memory, a cache cache cache memory cells (CAMs) containing an address and a pointer to a cache memory element, and a matching circuit having an input such that the CAM asserts a match output when the input is the same as the address in the CAM cell.

...read moreread less

Abstract: A cache system for looking up one or more elements of an external memory includes a set of cache memory elements coupled to the external memory, a set of content addressable memory cells (CAMs) containing an address and a pointer to one of the cache memory elements, and a matching circuit having an input such that the CAM asserts a match output when the input is the same as the address in the CAM cell. The cache memory element which a particular CAM points to changes over time. In the preferred implementation, the CAMs are connected in an order from top to bottom, and the bottom CAM points to the least recently used cache memory element.

...read moreread less

Patent•

Method and apparatus for write cache flush and fill mechanisms

[...]

Zohar Bogin¹, Steven J. Clohset¹•Institutions (1)

Intel¹

21 Sep 2000

Abstract: A write cache that reduces the number of memory accesses required to write data to main memory. When a memory write request is executed, the request not only updates the relevant location in cache memory, but the request is also directed to updating the corresponding location in main memory. A separate write cache is dedicated to temporarily holding multiple write requests so that they can be organized for more efficient transmission to memory in burst transfers. In one embodiment, all writes within a predefined range of addresses can be written to memory as a group. In another embodiment, entries are held in the write cache until a minimum number of entries are available for writing to memory, and a least-recently-used mechanism can be used to decide which entries to transmit first. In yet another embodiment, partial writes are merged into a single cache line, to be written to memory in a single burst transmission.

...read moreread less

Patent•

Caching system and method for a network storage system

[...]

Lih-Sheng Chiou, Michael Witkowski, Hawkins Yao, Cheh-Suei Yang, Sompong Paul Olarig - Show less +1 more

14 Dec 2000

TL;DR: In this paper, a cache system and method in accordance with the invention includes a cache near the target devices and another cache at the requesting host side so that the data traffic across the computer network is reduced.

...read moreread less

Abstract: A cache system and method in accordance with the invention includes a cache near the target devices and another cache at the requesting host side so that the data traffic across the computer network is reduced. A cache updating and invalidation method are described.

...read moreread less

Patent•

Multiple-thread processor with single-thread interface shared among threads

[...]

William N. Joy¹, Marc Tremblay¹, Gary Lauterbach¹, Joseph I. Chamdani¹•Institutions (1)

Sun Microsystems¹

09 May 2000

TL;DR: In this article, a technique for cache segregation utilizes logic for storing and communicating thread identification (TID) bits, which can be inserted at the most significant bits of the cache index.

...read moreread less

Abstract: A processor includes logic (612) for tagging a thread identifier (TID) for usage with processor blocks that are not stalled. Pertinent non-stalling blocks include caches, translation look-aside buffers (TLB) (1258, 1220), a load buffer asynchronous interface, an external memory management unit (MMU) interface (320, 330), and others. A processor (300) includes a cache that is segregated into a plurality of N cache parts. Cache segregation avoids interference, 'pollution', or 'cross-talk' between threads. One technique for cache segregation utilizes logic for storing and communicating thread identification (TID) bits. The cache utilizes cache indexing logic. For example, the TID bits can be inserted at the most significant bits of the cache index.

...read moreread less

Journal Article•DOI•

Location consistency-a new memory model and cache consistency protocol

[...]

Guang R. Gao¹, Vivek Sarkar²•Institutions (2)

University of Delaware¹, IBM²

01 Aug 2000-IEEE Transactions on Computers

TL;DR: This paper defines a new memory consistency model, called Location Consistency (LC), in which the state of a memory location is modeled as a partially ordered multiset (pomset) of write and synchronization operations.

...read moreread less

Abstract: Existing memory models and cache consistency protocols assume the memory coherence property which requires that all processors observe the same ordering of write operations to the same location In this paper, we address the problem of defining a memory model that does not rely on the memory coherence assumption and also the problem of designing a cache consistency protocol based on such a memory model We define a new memory consistency model, called Location Consistency (LC), in which the state of a memory location is modeled as a partially ordered multiset (pomset) of write and synchronization operations We prove that LC is strictly weaker than existing memory models, but is still equivalent to stronger models for the common case of parallel programs that have no data races We also describe a new multiprocessor cache consistency protocol based on the LC memory model We prove that this LC protocol obeys the LC memory model The LC protocol does not need to enforce single write ownership of memory blocks As a result, the LC protocol is simpler and more scalable than existing snooping and directory-based cache consistency protocols

...read moreread less

Journal Article•DOI•

Filtering memory references to increase energy efficiency

[...]

Johnson Kin¹, M. Gupta², William H. Mangione-Smith¹•Institutions (2)

University of California, Los Angeles¹, International Rectifier²

01 Jan 2000-IEEE Transactions on Computers

TL;DR: This work proposes sacrificing some performance in exchange for energy efficiency by filtering cache references through an unusually small first level cache, which results in a 51 percent reduction in the energy-delay product when compared to a conventional design.

...read moreread less

Abstract: Most modern microprocessors employ one or two levels of on-chip caches in order to improve performance. Caches typically are implemented with static RAM cells and often occupy a large portion of the chip area. Not surprisingly, these caches can consume a significant amount of power. In many applications, such as portable devices, energy efficiency is more important than performance. We propose sacrificing some performance in exchange for energy efficiency by filtering cache references through an unusually small first level cache. We refer to this structure as the filter cache. A second level cache, similar in size and structure to a conventional first level cache, is positioned behind the filter cache and serves to mitigate the performance loss. Extensive experiments indicate that a small filter cache still can achieve a high hit rate and good performance. This approach allows the second level cache to be in a low power mode most of the time, thus resulting in power savings. The filter cache is particularly attractive in low power applications, such as the embedded processors used for communication and multimedia applications. For example, experimental results across a wide range of embedded applications show that a direct mapped 255-byte filter cache achieves a 58 percent power reduction while reducing performance by 21 percent. This trade-off results in a 51 percent reduction in the energy-delay product when compared to a conventional design.

...read moreread less

Proceedings Article•DOI•

Memory system behavior of Java programs: methodology and analysis

[...]

Jin-Soo Kim¹, Yarsun Hsu²•Institutions (2)

Electronics and Telecommunications Research Institute¹, National Tsing Hua University²

01 Jun 2000

TL;DR: This paper studies the memory system behavior of Java programs by analyzing memory reference traces of several SPECjvm98 applications running with a Just-In-Time (JIT) compiler and finds that the overall cache miss ratio is increased due to garbage collection, which suffers from higher cache misses compared to the application.

...read moreread less

Abstract: This paper studies the memory system behavior of Java programs by analyzing memory reference traces of several SPECjvm98 applications running with a Just-In-Time (JIT) compiler. Trace information is collected by an exception-based tracing tool called JTRACE, without any instrumentation to the Java programs or the JIT compiler.First, we find that the overall cache miss ratio is increased due to garbage collection, which suffers from higher cache misses compared to the application. We also note that going beyond 2-way cache associativity improves the cache miss ratio marginally. Second, we observe that Java programs generate a substantial amount of short-lived objects. However, the size of frequently-referenced long-lived objects is more important to the cache performance, because it tends to determine the application's working set size. Finally, we note that the default heap configuration which starts from a small initial heap size is very inefficient since it invokes a garbage collector frequently. Although the direct costs of garbage collection decrease as we increase the available heap size, there exists an optimal heap size which minimizes the total execution time due to the interaction with the virtual memory performance.

...read moreread less

Patent•

Method and apparatus for reducing memory latency in a cache coherent multi-node architecture

[...]

Manoj Khare¹, Faye Briggs¹, Akhilesh Kumar¹, Lily Pao Looi¹, Kai Cheng¹ - Show less +1 more•Institutions (1)

Intel¹

28 Dec 2000

TL;DR: In this article, a speculative read request is issued to a home node before results of a cache coherence protocol are determined, and the home node initiates a read to memory to complete the read request.

...read moreread less

Abstract: A method for reducing memory latency in a multi-node architecture In one embodiment, a speculative read request is issued to a home node before results of a cache coherence protocol are determined The home node initiates a read to memory to complete the speculative read request Results of a cache coherence protocol may be determined by a coherence agent to resolve cache coherency after the speculative read request is issued

...read moreread less

Journal Article•DOI•

Fast priority queues for cached memory

[...]

Peter Sanders¹•Institutions (1)

Max Planck Society¹

31 Dec 2000-ACM Journal of Experimental Algorithms

TL;DR: In this article, a fast priority queue for external memory and cached memory that is based on k-way merging is proposed, which is at least two times faster than an optimized implementation of binary heaps and 4-ary heaps for large inputs.

...read moreread less

Abstract: The cache hierarchy prevalent in todays high performance processors has to be taken into account in order to design algorithms that perform well in practice. This paper advocates the adaption of external memory algorithms to this purpose. This idea and the practical issues involved are exemplified by engineering a fast priority queue suited to external memory and cached memory that is based on k-way merging. It improves previous external memory algorithms by constant factors crucial for transferring it to cached memory. Running in the cache hierarchy of a workstation the algorithm is at least two times faster than an optimized implementation of binary heaps and 4-ary heaps for large inputs.

...read moreread less

Patent•

Multiple virtual machine system with efficient cache memory design

[...]

David W. Jensen¹, Steven E. Koenck¹•Institutions (1)

Rockwell Collins¹

31 Mar 2000

TL;DR: In this article, a multiple virtual machine system with a microprocessor for executing instructions and issuing memory reads and writes to a current partition of a plurality of partitions is described, and a next partition cache is selected by the partition management unit when the next partition becomes active on the microprocessor to receive next partition memory read hits and misses.

...read moreread less

Abstract: A multiple virtual machine system with a microprocessor for executing instructions and issuing memory reads and writes to a current partition of a plurality of partitions. The multiple virtual machine may contain a partition management unit for receiving the memory reads and writes from the microprocessor for the current partition. A current partition cache is selected by the partition management unit for the current partition to receive the current partition memory reads and writes from the partition management unit. The current partition cache resolves memory read hits and misses. A next partition cache is selected by the partition management unit when the next partition becomes active on the microprocessor to receive next partition memory reads and writes from the microprocessor. An external memory containing data organized in frame blocks and containing cache block addresses for the plurality of partitions provides data to the microprocessor when the current partition cache resolves the memory miss, and provides appropriate frame block data to the next partition cache to restore the next partition cache to a previous state.

...read moreread less

Patent•

Hardware mechanism for managing cache structures in a data storage system

[...]

Rodeny Augusta DeKoning¹, Dennis E. Gates¹, John R. Kloeppner¹•Institutions (1)

LSI Corporation¹

03 Jul 2000

TL;DR: In this article, a cache management processor is coupled with a cache cache management memory by a second link to manipulate the cache management structure in a hash table with linked lists at each hash queue element in accordance with cache management command and search key.

...read moreread less

Abstract: A system and method for managing data stored in a cache block in a cache memory includes a cache block is located at a cache block address in the cache memory, and the data in the cache block corresponds to a storage location in a storage array identified by a storage location identifier. A storage processor accesses the cache block in the cache memory and provides a cache management command to a command processor. A processor memory coupled to the storage processor stores a search key based on the storage location identifier corresponding to the cache block. A command processor coupled to the storage processor receives a cache management command specified by the storage processor and transfers the storage location identifier from the processor memory. A cache management memory stores a cache management structure including the cache block address and the search key. A cache management processor is coupled to the cache management memory by a second link to manipulate the cache management structure in a hash table with linked lists at each hash queue element within the cache management memory in accordance with the cache management command and the search key.

...read moreread less

Patent•

Method and system for servicing cache line in response to partial cache line request

[...]

Zohar Bogin, David J. Harriman, Zdzislaw A. Wirkus, Satish B. Acharya

29 Dec 2000

TL;DR: In this paper, the authors describe a system for servicing a full cache line in response to a partial cache line request, which includes a storage to store at least one cache line, a hit/miss detector, and a data mover.

...read moreread less

Abstract: A system is described for servicing a full cache line in response to a partial cache line request. The system includes a storage to store at least one cache line, a hit/miss detector, and a data mover. The hit/miss detector receives a partial cache line read request from a requesting agent and dispatches a fetch request to a memory device to fetch a full cache line data that contains data requested in the partial cache line read request from the requesting agent. The data mover loads the storage with the full cache line data returned from the memory device and forwards a portion of the full cache line data requested by the requesting agent. If data specified in a subsequent partial cache line request from the requesting agent is contained within the full cache line data specified in the previously dispatched fetch request, the hit/miss detector will send a command to the data mover to forward another portion of the full cache line data stored in the storage to the requesting agent. In one embodiment, the system also includes a write combining logic to combine two or more consecutive write requests that meet defined conditions into a single write request.

...read moreread less

Collapse