Showing papers on "Cache pollution published in 2001"

PDF

Open Access

Proceedings Article•DOI•

Cache decay: exploiting generational behavior to reduce cache leakage power

[...]

Stefanos Kaxiras¹, Zhigang Hu², Margaret Martonosi²•Institutions (2)

01 May 2001

TL;DR: This paper discusses policies and implementations for reducing cache leakage by invalidating and “turning off” cache lines when they hold data not likely to be reused, and proposes adaptive policies that effectively reduce LI cache leakage energy by 5x for the SPEC2000 with only negligible degradations in performance.

...read moreread less

Abstract: Power dissipation is increasingly important in CPUs ranging from those intended for mobile use, all the way up to high-performance processors for high-end servers. While the bulk of the power dissipated is dynamic switching power, leakage power is also beginning to be a concern. Chipmakers expect that in future chip generations, leakage's proportion of total chip power will increase significantly.This paper examines methods for reducing leakage power within the cache memories of the CPU. Because caches comprise much of a CPU chip's area and transistor counts, they are reasonable targets for attacking leakage. We discuss policies and implementations for reducing cache leakage by invalidating and “turning off” cache lines when they hold data not likely to be reused. In particular, our approach is targeted at the generational nature of cache line usage. That is, cache lines typically have a flurry of frequent use when first brought into the cache, and then have a period of “dead time” before they are evicted. By devising effective, low-power ways of deducing dead time, our results show that in many cases we can reduce LI cache leakage energy by 4x in SPEC2000 applications without impacting performance. Because our decay-based techniques have notions of competitive on-line algorithms at their roots, their energy usage can be theoretically bounded at within a factor of two of the optimal oracle-based policy. We also examine adaptive decay-based policies that make energy-minimizing policy choices on a per-application basis by choosing appropriate decay intervals individually for each cache line. Our proposed adaptive policies effectively reduce LI cache leakage energy by 5x for the SPEC2000 with only negligible degradations in performance.

...read moreread less

725 citations

Proceedings Article•

Weaving Relations for Cache Performance

[...]

Anastassia Ailamaki¹, David J. DeWitt², Mark D. Hill², Marios Skounakis²•Institutions (2)

Carnegie Mellon University¹, University of Wisconsin-Madison²

11 Sep 2001

TL;DR: This paper proposes a new data organization model called PAX (Partition Attributes Across), that significantly improves cache performance by grouping together all values of each attribute within each page, and demonstrates that in-page data placement is the key to high cache performance.

...read moreread less

Abstract: Relational database systems have traditionally optimzed for I/O performance and organized records sequentially on disk pages using the N-ary Storage Model (NSM) (a.k.a., slotted pages). Recent research, however, indicates that cache utilization and performance is becoming increasingly important on modern platforms. In this paper, we first demonstrate that in-page data placement is the key to high cache performance and that NSM exhibits low cache utilization on modern platforms. Next, we propose a new data organization model called PAX (Partition Attributes Across), that significantly improves cache performance by grouping together all values of each attribute within each page. Because PAX only affects layout inside the pages, it incurs no storage penalty and does not affect I/O behavior. According to our experimental results, when compared to NSM (a) PAX exhibits superior cache and memory bandwidth utilization, saving at least 75% of NSM’s stall time due to data cache accesses, (b) range selection queries and updates on memoryresident relations execute 17-25% faster, and (c) TPC-H queries involving I/O execute 11-48% faster.

...read moreread less

428 citations

Proceedings Article•DOI•

Reducing set-associative cache energy via way-prediction and selective direct-mapping

[...]

Michael D. Powell¹, Amit Agarwal¹, T. N. Vijaykumar¹, Babak Falsafi², Kaushik Roy¹ - Show less +1 more•Institutions (2)

Purdue University¹, Carnegie Mellon University²

01 Dec 2001

TL;DR: Two previously-proposed techniques, way-prediction and selective direct-mapping, are applied to reducing L1 cache dynamic energy while maintaining high performance, and caches achieve the energy-delay of sequential access while maintaining the performance of parallel access.

...read moreread less

Abstract: Set-associative caches achieve low miss rates for typical applications but result in significant energy dissipation. Set-associative caches minimize access time by probing all the data ways in parallel with the tag lookup, although the output of only the matching way is used. The energy spent accessing the other ways is wasted Eliminating the wasted energy by performing the data lookup sequentially following the tag lookup substantially increases cache access time, and is unacceptable for high-performance L1 caches. In this paper, we apply two previously-proposed techniques, way-prediction and selective direct-mapping, to reducing L1 cache dynamic energy while maintaining high performance. The techniques predict the matching way and probe only the predicted way and not all the ways, achieving energy savings. While these techniques were originally proposed to improve set-associative cache access times, this is the first paper to apply them to reducing cache energy. We evaluate the effectiveness of these techniques in reducing L1 d-cache, L1 i-cache, and overall processor energy. Using these techniques, our caches achieve the energy-delay of sequential access while maintaining the performance of parallel access. Relative to parallel access L1 i- and d-caches, the techniques achieve overall processor energy-delay reduction of 8%, while perfect way-prediction with no performance degradation achieves 10% reduction. The performance degradation of the techniques is less than 3%, compared to an aggressive,.1-cycle, 4-way, parallel access cache.

...read moreread less

310 citations

Patent•

Scalable architecture based on single-chip multiprocessing

[...]

Luiz Andre Barroso, Kourosh Gharachorloo, Andreas Nowatzyk

08 Jun 2001

TL;DR: The PIRANHA system as discussed by the authors is a scalable chip-multiprocessing system with scalable architecture, including on a single chip: a plurality of processor cores; a two-level cache hierarchy; an intra-chip switch; one or more memory controllers; a cache coherence protocol; and an interconnect subsystem.

...read moreread less

Abstract: A chip-multiprocessing system with scalable architecture, including on a single chip: a plurality of processor cores; a two-level cache hierarchy; an intra-chip switch; one or more memory controllers; a cache coherence protocol; one or more coherence protocol engines; and an interconnect subsystem The two-level cache hierarchy includes first level and second level caches In particular, the first level caches include a pair of instruction and data caches for, and private to, each processor core The second level cache has a relaxed inclusion property, the second-level cache being logically shared by the plurality of processor cores Each of the plurality of processor cores is capable of executing an instruction set of the ALPHA™ processing core The scalable architecture of the chip-multiprocessing system is targeted at parallel commercial workloads A showcase example of the chip-multiprocessing system, called the PIRANHA™ system, is a highly integrated processing node with eight simpler ALPHA™ processor cores A method for scalable chip-multiprocessing is also provided

...read moreread less

294 citations

Patent•

System and method for using a mapping between client addresses and addresses of caches to support content delivery

[...]

J.J. Garcia-Luna-Aceves, Bradley R. Smith

26 Apr 2001

TL;DR: In this article, the authors present a selection procedure for information object repository selection procedures for determining which of a number of information object repositories should service a request for the information object, including a direct cache selection process, a redirect cache selection, a remote DNS cache, or a local DNS cache selection.

...read moreread less

Abstract: Various information object repository selection procedures for determining which of a number of information object repositories should service a request for the information object include a direct cache selection process, a redirect cache selection process, a remote DNS cache selection process, or a local DNS cache selection process. Different combinations of these procedures may also be used. For example different combination may be used depending on the type of content being requested. The direct cache selection process may be used for information objects that will be immediately loaded without user action, while any of the redirect cache selection process, the remote DNS cache selection process and/or the local DNS cache selection process may be used for information objects that will be loaded only after some user action.

...read moreread less

231 citations

Patent•

Multi-tier caching system

[...]

Lawrence Jacobs¹, Alan J. Demers¹, Norman C. Woo¹•Institutions (1)

Business International Corporation¹

13 Aug 2001

TL;DR: In this article, a content analysis engine determines which of the caches a data item should be stored in, based on an analysis of data requests or data items served in response to the requests, guidelines set by a system administrator, etc.

...read moreread less

Abstract: A multi-tier caching system and method of operating the same. The system comprises a first cache implemented in operating system or kernel space (e.g., in memory managed by or allocated to an operating system) and a second cache implemented in application or user space (e.g., in memory managed by or allocated to an application program). Data requests requiring little processing to identify responsive data may be served from the first cache, while those requiring further processing are served from the second. The first cache may therefore store frequently requested data items or items that can be served in response to requests having different forms, qualifiers or other indicia. A content analysis engine determines which of the caches a data item should be stored in, based on an analysis of data requests or data items served in response to the requests, guidelines set by a system administrator, etc.

...read moreread less

220 citations

Proceedings Article•DOI•

Reducing DRAM latencies with an integrated memory hierarchy design

[...]

Wei-Fen Lin¹, Steven K. Reinhardt¹, Doug Burger¹•Institutions (1)

University of Michigan¹

19 Jan 2001

TL;DR: It is shown that even with an aggressive, next-generation memory system using four Direct Rambus channels and an integrated one-megabyte level-two cache, a processor still spends over half of its time stalling for L2 misses.

...read moreread less

Abstract: In this paper we address the severe performance gap caused by high processor clock rates and slow DRAM accesses. We show that even with an aggressive, next-generation memory system using four Direct Rambus channels and an integrated one-megabyte level-two cache, a processor still spends over half of its time stalling for L2 misses. Large cache blocks can improve performance, but only when coupled with wide memory channels. DRAM address mappings also affect performance significantly. We evaluate an aggressive prefetch unit integrated with the L2 cache and memory, controllers. By issuing prefetches only when the Rambus channels are idle, prioritizing them to maximize DRAM row buffer hits, and giving them low replacement priority, we achieve a 43% speedup across 10 of the 26 SPEC2000 benchmarks, without degrading performance an the others. With eight Rambus channels, these ten benchmarks improve to within 10% of the performance of a perfect L2 cache.

...read moreread less

213 citations

Journal Article•DOI•

Speculative Versioning Cache

[...]

T. N. Vijaykumar¹, S. Gopal², James E. Smith³, Gurindar S. Sohi³•Institutions (3)

Purdue University¹, Alcatel-Lucent², University of Wisconsin-Madison³

01 Dec 2001-IEEE Transactions on Parallel and Distributed Systems

TL;DR: The proposed Speculative Versioning Cache uses distributed caches to eliminate the latency and bandwidth problems of the ARB and conceptually unifies cache coherence and speculative versioning by using an organization similar to snooping bus-based coherent caches.

...read moreread less

Abstract: Dependences among loads and stores whose addresses are unknown hinder the extraction of instruction level parallelism during the execution of a sequential program. Such ambiguous memory dependences can be overcome by memory dependence speculation which enables a load or store to be speculatively executed before the addresses of all preceding loads and stores are known. Furthermore, multiple speculative stores to a memory location create multiple speculative versions of the location. Program order among the speculative versions must be tracked to maintain sequential semantics. A previously proposed approach, the Address Resolution Buffer (ARB) uses a centralized buffer to support speculative versions. Our proposal, called the Speculative Versioning Cache (SVC), uses distributed caches to eliminate the latency and bandwidth problems of the ARB. The SVC conceptually unifies cache coherence and speculative versioning by using an organization similar to snooping bus-based coherent caches. Our evaluation for the Multiscalar architecture shows that hit latency is an important factor affecting performance and private cache solutions trade-off hit rate for hit latency.

...read moreread less

167 citations

Patent•

Method and apparatus for freeing memory from an extensible markup language document object model tree active in an application cache

[...]

Krishnendu Charkraborty¹, Jayashri Visvanathan¹•Institutions (1)

Sun Microsystems¹

01 Mar 2001

TL;DR: In this article, a garbage collector that uses an LRU algorithm to free memory from an XML DOM tree active in an application cache is described. But it is not shown how to remove the nodes from the DOM tree.

...read moreread less

Abstract: The present invention relates to a garbage collector that uses an LRU algorithm to free memory from an XML DOM tree active in an application cache. According to one or more embodiments of the present invention, a threshold for the amount of memory permitted to reside in an application cache is set. Then, a garbage collector removes entries from the cache until it falls below the threshold. In one or more embodiments, a node table is used. When nodes are added to the XML DOM tree in the application cache the node table is updated. When the threshold for the amount of memory permitted to reside in the application cache is exceeded, the garbage collector applies an LRU algorithm uses the node table to determine which nodes to remove from the application cache. In one embodiment, the LRU algorithm scans the node table to determine the least recently used node in the table by examining time stamp entries in the table. Then, the algorithm removes that node and repeats the process until the XML DOM tree uses less memory in the cache than the threshold.

...read moreread less

154 citations

Proceedings Article•DOI•

Analytical cache models with applications to cache partitioning

[...]

G. Edward Suh¹, Srinivas Devadas¹, Larry Rudolph¹•Institutions (1)

Massachusetts Institute of Technology¹

17 Jun 2001

TL;DR: In this paper, an analytical cache model for time-shared systems is presented, which estimates the overall cache miss-rate of a multiprocessing system with any cache size and time quanta.

...read moreread less

Abstract: An accurate, tractable, analytic cache model for time-shared systems is presented, which estimates the overall cache miss-rate of a multiprocessing system with any cache size and time quanta. The input to the model consists of the isolated miss-rate curves for each process, the time quanta for each of the executing processes, and the total cache size. The output is the overall miss-rate. Trace-driven simulations demonstrate that the estimated miss-rate is very accurate. Since the model provides a fast and accurate way to estimate the effect of context switching, it is useful for both understanding the effect of context switching on caches and optimizing cache performance for time-shared systems. A cache partitioning mechanism is also presented and is shown to improve the cache miss-rate up to 25% over the normal LRU replacement policy.

...read moreread less

130 citations

Patent•

Method and system for exclusive two-level caching in a chip-multiprocessor

[...]

Luiz Andre Barroso, Kourosh Gharachorloo, Andreas Nowatzyk

08 Jun 2001

TL;DR: In this article, a method and system for exclusive two-level caching in a chip-multiprocessor is presented to maximize the effective use of on-chip cache.

...read moreread less

Abstract: To maximize the effective use of on-chip cache, a method and system for exclusive two-level caching in a chip-multiprocessor are provided. The exclusive two-level caching in accordance with the present invention involves method relaxing the inclusion requirement in a two-level cache system in order to form an exclusive cache hierarchy. Additionally, the exclusive two-level caching involves providing a first-level tag-state structure in a first-level cache of the two-level cache system. The first tag-state structure has state information. The exclusive two-level caching also involves maintaining in a second-level cache of the two-level cache system a duplicate of the first-level tag-state structure and extending the state information in the duplicate of the first tag-state structure, but not in the first-level tag-state structure itself, to include an owner indication. The exclusive two-level caching further involves providing in the second-level cache a second tag-state structure so that a simultaneous lookup at the duplicate of the first tag-state structure and the second tag-state structure is possible. Moreover, the exclusive two-level caching involves associating a single owner with a cache line at any given time of its lifetime in the chip-multiprocessor.

...read moreread less

Patent•

System boot time reduction method

[...]

Richard L. Coulson¹, John I. Garney, Jeanna Matthews, Robert J. Royer•Institutions (1)

Intel¹

27 Jun 2001

TL;DR: In this article, a system and method to reduce the time for system initializations is described, where data accessed during a system initialization is loaded into a non-volatile cache and is pinned to prevent eviction.

...read moreread less

Abstract: A system and method to reduce the time for system initializations is disclosed. In accordance with the invention, data accessed during a system initialization is loaded into a non-volatile cache and is pinned to prevent eviction. By pinning data into the cache, the data required for system initialization is pre-loaded into the cache on a system reboot, thereby eliminating the need to access a disk.

...read moreread less

Patent•

Method and apparatus for adaptively bypassing one or more levels of a cache hierarchy

[...]

Jr. Simon C. Steely

25 Jan 2001

TL;DR: In this paper, a system for adaptive bypassing one or more higher cache levels following a miss in a lower level of a cache hierarchy is described, where each cache level preferably includes a tag store containing address and state information for each cache line resident in the respective cache.

...read moreread less

Abstract: A system for adaptively bypassing one or more higher cache levels following a miss in a lower level of a cache hierarchy is described. Each cache level preferably includes a tag store containing address and state information for each cache line resident in the respective cache. When an invalidate request is received at a given cache hierarchy, each cache level is searched for the address specified by the invalidate request. When an address match is detected, the state of the respective cache line is changed to the invalid state, although the address of the cache line is left in the tag store. Thereafter, if the processor or entity associated with this cache hierarchy issues its own request for this same cache line, the cache hierarchy begins searching the tag store of each level starting with the lowest cache level. Since the address of the invalidated cache line was left in the respective tag store, a match will be detected at one of the cache levels, although the corresponding state of this cache line is invalid. This condition is specifically detected and is considered to be an “inval_miss” occurrence. In response, to an inval_miss, the cache hierarchy calls off searching any higher levels, and instead, issues a memory reference request for the desired cache line. In a further embodiment, the entity that sourced an invalidate request is stored, and a subsequent memory reference request for the same cache line is sent directly to the source entity.

...read moreread less

Patent•

Cache system and method for generating uncached objects from cached and stored object components

[...]

Ron Abraham Gut, Alexis Paul Tzannes, Edmund Campion Reiter

08 Aug 2001

TL;DR: In this paper, the cache system determines that an object, such as an image file, is missing from the cache memory, locates sufficient components from cache memory and/or external storage, and constructs the object from the located components.

...read moreread less

Abstract: Methods and apparatus for constructing objects within a cache system thereby allowing the cache system to respond to requested objects that are not initially available within the cache system. One embodiment of the invention caches image files, where the images are divided into components and stored in a format that allows identification and access to the components. The cache system determines that an object, such as an image file, is missing from the cache memory, locates sufficient components from the cache memory and/or external storage, and constructs the object from the located components.

...read moreread less

Journal Article•DOI•

An evaluation of cache invalidation strategies in wireless environments

[...]

Kian-Lee Tian¹, Jun Cai¹, Beng Chin Ooi¹•Institutions (1)

National University of Singapore¹

01 Aug 2001-IEEE Transactions on Parallel and Distributed Systems

TL;DR: This study shows that the two proposed schemes are not only effective in salvaging the cache content but consume significantly less energy than their counterparts.

...read moreread less

Abstract: Caching can reduce the bandwidth requirement in a wireless computing environment as well as minimize the energy consumption of wireless portable computers. To facilitate mobile clients in ascertaining the validity of their cache content, servers periodically broadcast cache invalidation reports that contain information of data that has been updated. However, as mobile clients may operate in a doze or even totally disconnected mode (to conserve energy), it is possible that some reports may be missed and the clients are forced to discard the entire cache content. In this paper, we reexamine the issue of designing cache invalidation strategies. We identify the basic issues in designing cache invalidation strategies. From the solutions to these issues, a large set of cache invalidation schemes can be constructed. We evaluate the performance of four representative algorithms-two of which are known algorithms (i.e., Dual-Report Cache Invalidation and Bit-Sequences) while the other two are their counterparts that exploit selective tuning (namely, Selective Dual-Report Cache Invalidation and Bit-Sequences with Bit Count). Our study shows that the two proposed schemes are not only effective in salvaging the cache content but consume significantly less energy than their counterparts. While the Selective Dual-Report Cache Invalidation scheme performs best in most cases, it is inferior to the Bit-Sequences with the Bit-Count scheme under high update rates.

...read moreread less

Patent•

System and method for partitioning address space in a proxy cache server cluster

[...]

Robert Drew Major, Stephen R. Carter, Howard Davis, Brent Ray Christensen

07 Jun 2001

TL;DR: In this article, a proxy partition cache (PPC) architecture and a technique for address-partitioning a proxy cache consisting of a grouping of discrete, cooperating caches (servers) is provided.

...read moreread less

Abstract: A proxy partition cache (PPC) architecture and a technique for address-partitioning a proxy cache consisting of a grouping of discrete, cooperating caches (servers) is provided. Client requests for objects (files) of a given size are redirected or reassigned to a single cache in the grouping, notwithstanding the cache to which the request is made by the load-balancing mechanism (such as a Layer 4 switch) based upon load-balancing considerations. The file is then returned to the switch via the switch-designated cache for vending to the requesting client. The redirection/reassignment occurs according to a function within the cache to which the request is directed so that the switch remains freed from additional tasks that can compromise speed.

...read moreread less

Patent•

Computer performance improvement by adjusting a time used for preemptive eviction of cache entries

[...]

Blaine D. Gaither¹, Benjamin D. Osecky¹•Institutions (1)

Hewlett-Packard¹

31 Oct 2001

TL;DR: In this article, a cache memory system can determine that an entry is stale if the entry has not been accessed or modified for a predetermined time, and the predetermined time is made dynamically variable.

...read moreread less

Abstract: A cache memory system can determine that an entry is stale if the entry has not been accessed or modified for a predetermined time. If an entry is stale, the entry may be preemptively evicted. The predetermined time is made dynamically variable. A computer system can adjust the time to optimize a measure of performance. In a specific example, evicted lines are temporarily stored in an eviction queue. The time is adjusted to be as short as possible without substantially increasing the number of lines that must be recalled from the eviction queue.

...read moreread less

Patent•

Multiprocessor cache coherence system and method in which processor nodes and input/output nodes are equal participants

[...]

Luiz Andre Barroso¹, Kourosh Gharachorloo¹, Andreas Nowatzyk¹, Mosur K. Ravishankar¹, Robert J. Stets¹ - Show less +1 more•Institutions (1)

Hewlett-Packard¹

11 Jun 2001

TL;DR: In this article, the authors propose a cache coherence protocol for a plurality of processor nodes and input/output nodes, where each processor node includes a multiplicity of processor cores, an interface to a local memory system and a protocol engine.

...read moreread less

Abstract: A computer system has a plurality of processor nodes and a plurality of input/output nodes. Each processor node includes a multiplicity of processor cores, an interface to a local memory system and a protocol engine implementing a predefined cache coherence protocol. Each processor core has an associated memory cache for caching memory lines of information. Each input/output node includes no processor cores, an input/output interface for interfacing to an input/output bus or input/output device, a memory cache for caching memory lines of information and an interface to a local memory subsystem. The local memory subsystem of each processor node and input/output node stores a multiplicity of memory lines of information. The protocol engine of each processor node and input/output node implements the same predefined cache coherence protocol.

...read moreread less

Patent•

Technique for enhancing effectiveness of cache server

[...]

Masayoshi Kobayashi¹•Institutions (1)

NEC¹

25 Jul 2001

TL;DR: In this article, a path calculating section obtains a path suitable for carrying out an automatic cache updating operation, a link prefetching operation, and a cache server cooperating operation, based on QoS path information that includes network path information and path load information obtained by a path information obtaining section.

...read moreread less

Abstract: A path calculating section obtains a path suitable for carrying out an automatic cache updating operation, a link prefetching operation, and a cache server cooperating operation, based on QoS path information that includes network path information and path load information obtained by a QoS path information obtaining section. An automatic cache updating section, a link prefetching control section, and a cache server cooperating section carry out respective ones of the automatic cache updating operation, the link prefetching operation, and the cache server cooperating operation, by utilizing the path obtained. For example, the path calculating section obtains a maximum remaining bandwidth path as the path.

...read moreread less

Patent•

Distributed read and write caching implementation for optimized input/output applications

[...]

Kenneth C. Creta¹, D. Bell¹, Robert George¹, Bradford Congdon¹, Robert G. Blankenship¹, Duane January¹ - Show less +2 more•Institutions (1)

Intel¹

27 Aug 2001

TL;DR: In this article, a cache directory is also provided to track cache lines in the write cache and the at least one read cache, which provides a low-latency copy of data that is most likely to be used.

...read moreread less

Abstract: A caching input/output hub includes a host interface to connect with a host. At least one input/output interface is provided to connect with an input/output device. A write cache manages memory writes initiated by the input/output device. At least one read cache, separate from the write cache, provides a low-latency copy of data that is most likely to be used. The at least one read cache is in communication with the write cache. A cache directory is also provided to track cache lines in the write cache and the at least one read cache. The cache directory is in communication with the write cache and the at least one read cache.

...read moreread less

Proceedings Article•DOI•

Reactive-Associative Caches

[...]

Brannon Batson, T. N. Vijaykumar

08 Sep 2001

TL;DR: The r-a cache is proposed, which provides flexible associativity by placing most blocks in direct-mapped positions and reactively displacing only conflicting blocks to set-associative positions, and using a novel PC-based way-prediction to achieve high accuracy.

...read moreread less

Abstract: While set-associative caches typically incur fewer misses than direct-mapped caches, set-associative caches have slower hit times. We propose the reactive-associative cache (r-a cache), which provides flexible associativity by placing most blocks in direct-mapped positions and reactively displacing only conflicting blocks to set-associative positions. The r-a cache uses way-prediction (like the predictive associative cache, PSA) to access displaced blocks on the initial probe. Unlike PSA, however, the r-a cache employs a novel feedback mechanism to prevent unpredictable blocks from being displaced. Reactive displacement and feedback allow the r-a cache to use a novel PC-based way-prediction and achieve high accuracy; without impractical block swapping as in column associative and group associative, and without relying on timing-constrained XOR way prediction. A one-port, 4-way r-a cache achieves up to 9% speedup over a direct-mapped cache and performs within 2% of an idealized 2-way set-associative, 1-cycle cache. A 4-way r-a cache achieves up to 13% speedup over a PSA cache, with both r-a and PSA using the PC scheme. CACTI estimates that for sizes larger than 8KB, a 4-way r-a cache is within 1% of direct-mapped hit times, and 24% faster than a 2-way set-associative cache.

...read moreread less

Patent•

Flash memory low-latency cache

[...]

Terry L. Kendall¹•Institutions (1)

Intel¹

28 Mar 2001

TL;DR: A small cache memory can be incorporated with a main memory, such as a flash memory, on an integrated circuit to improve average access times between a processor and the main memory as discussed by the authors, which can also allow a suspended transfer with minimal latency when the transfer is resumed.

...read moreread less

Abstract: A small cache memory can be incorporated with a main memory, such as a flash memory, on an integrated circuit to improve average access times between a processor and the main memory To minimize cost and complexity, the cache memory may contain only a few words of data The cache can also allow a suspended transfer with minimal latency when the transfer is resumed Designing the cache memory to interface with the processor over a standard memory bus permits the cache to be implemented in a system that could otherwise have no cache memory unless the processor and/or memory bus were redesigned

...read moreread less

Proceedings Article•DOI•

Improving effective bandwidth through compiler enhancement of global cache reuse

[...]

Chen Ding¹, Ken Kennedy²•Institutions (2)

University of Rochester¹, Rice University²

23 Apr 2001

TL;DR: The potential for addressing bandwidth limitations by increasing global cache reuse is explored-that is, reusing data across whole program and over the entire data collection, in a two-step global strategy.

...read moreread less

Abstract: Reusing data in cache is critical to achieving high performance on modern machines because it reduces the impact of the latency and bandwidth limitations of direct memory access. To date, most studies of software memory hierarchy management have focused on the latency problem. However today's machines are increasingly limited by insufficient memory bandwidth-on these machines, latency-oriented techniques are inadequate because they do not seek to minimize the total memory traffic over the whole program. This paper explores the potential for addressing bandwidth limitations by increasing global cache reuse-that is, reusing data across whole program and over the entire data collection. To this end, the paper explores a two-step global strategy. The first step fuses computations on the same data to enable the caching of repeated accesses. The second step groups data used by the same computation to bring about contiguous access to memory. While the first step reduces the frequency of memory accesses, the second step improves their efficiency. The paper demonstrates the effectiveness of this strategy and shows how to automate it in a production compiler.

...read moreread less

Patent•

High performance symmetric multiprocessing systems via super-coherent data mechanisms

[...]

Ravi Kumar Arimilli¹, Guy Lynn Guthrie¹, William J. Starke¹, Derek Edward Williams¹•Institutions (1)

IBM¹

16 Oct 2001

TL;DR: In this paper, the coherency protocol of a multiprocessor data processing system is described, which includes processing logic that returns to coherent operations with other processing units responsive to an occurrence of a pre-determined condition.

...read moreread less

Abstract: A multiprocessor data processing system comprising a plurality of processing units, a plurality of caches, that is each affiliated with one of the processing units, and processing logic that, responsive to a receipt of a first system bus response to a coherency operation, causes the requesting processor to execute operations utilizing super-coherent data. The data processing system further includes logic eventually returning to coherent operations with other processing units responsive to an occurrence of a pre-determined condition. The coherency protocol of the data processing system includes a first coherency state that indicates that modification of data within a shared cache line of a second cache of a second processor has been snooped on a system bus of the data processing system. When the cache line is in the first coherency state, subsequent requests for the cache line is issued as a Z 1 read on a system bus and one of two responses are received. If the response to the Z 1 read indicates that the first processor should utilize local data currently available within the cache line, the first coherency state is changed to a second coherency state that indicates to the first processor that subsequent request for the cache line should utilize the data within the local cache and not be issued to the system interconnect. Coherency state transitions to the second coherency state is completed via the coherency protocol of the data processing system. Super-coherent data is provided to the processor from the cache line of the local cache whenever the second coherency state is set for the cache line and a request is received.

...read moreread less

Proceedings Article•DOI•

Main-memory index structures with fixed-size partial keys

[...]

Philip L. Bohannon¹, Peter Mcllroy¹, Rajeev Rastogi¹•Institutions (1)

Alcatel-Lucent¹

01 May 2001

TL;DR: This paper proposes two index structures, pkT-trees and pkB-tree, which significantly reduce cache misses by storing partial-key information in the index, and shows that a small, fixed amount of key information allows most cache misses to be avoided, allowing for a simple node structure and efficient implementation.

...read moreread less

Abstract: The performance of main-memory index structures is increasingly determined by the number of CPU cache misses incurred when traversing the index. When keys are stored indirectly, as is standard in main-memory databases, the cost of key retrieval in terms of cache misses can dominate the cost of an index traversal. Yet it is inefficient in both time and space to store even moderate sized keys directly in index nodes. In this paper, we investigate the performance of tree structures suitable for OLTP workloads in the face of expensive cache misses and non-trivial key sizes. We propose two index structures, pkT-trees and pkB-trees, which significantly reduce cache misses by storing partial-key information in the index. We show that a small, fixed amount of key information allows most cache misses to be avoided, allowing for a simple node structure and efficient implementation. Finally, we study the performance and cache behavior of partial-key trees by comparing them with other main-memory tree structures for a wide variety of key sizes and key value distributions.

...read moreread less

Proceedings Article•DOI•

Direct addressed caches for reduced power consumption

[...]

Emmett Witchel¹, Samuel Larsen¹, C.S. Ananian¹, Krste Asanovic¹•Institutions (1)

Massachusetts Institute of Technology¹

01 Dec 2001

TL;DR: Support for tag-unchecked loads and stores to C and Java compilers that save the energy of a tag check when the compiler can guarantee an access will be to the same line as an earlier access is added.

...read moreread less

Abstract: A direct addressed cache is a hardware-software design for an energy-efficient microprocessor data cache. Direct addressing allows software to access cache data without a hardware cache tag check. These tag-unchecked loads and stores save the energy of a tag check when the compiler can guarantee an access will be to the same line as an earlier access. We have added support for tag-unchecked loads and stores to C and Java compilers. For Mediabench C programs, the compiler eliminates 16-76% of data cache tag accesses, with half of the benchmarks avoiding over 40% of the data tag checks. For SPECjvm98 Java programs, the compiler eliminates 18-63% of data cache tag checks. These tag check reductions translate into data cache energy savings of 9-40%, and overall processor and cache energy savings of 2-8%.

...read moreread less

Journal Article•DOI•

A reconfigurable multifunction computing cache architecture

[...]

Huesung Kim¹, A.K. Somani, Akhilesh Tyagi•Institutions (1)

Iowa State University¹

01 Aug 2001-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: This paper presents a cache architecture to convert a cache into a computing unit for either of the following two structured computations: finite impulse response and discrete/inverse discrete cosine transform, and includes additional logic to embed multibit output lookup tables into the cache structure.

...read moreread less

Abstract: A considerable portion of a microprocessor chip is dedicated to cache memory. However, not all applications need all the cache storage all the time, especially the computing bandwidth-limited applications. In addition, some applications have large embedded computations with a regular structure. Such applications may be able to use additional computing resources. If the unused portion of the cache could serve these computation needs, the on-chip resources would be utilized more efficiently. This presents an opportunity to explore the reconfiguration of a part of the cache memory for computing. Thus, we propose adaptive balanced computing (ABC)-dynamic resource configuration on demand from application-between memory and computing resources. In this paper, we present a cache architecture to convert a cache into a computing unit for either of the following two structured computations: finite impulse response and discrete/inverse discrete cosine transform. In order to convert a cache memory to a function unit, we include additional logic to embed multibit output lookup tables into the cache structure. The experimental results show that the reconfigurable module improves the execution time of applications with a large number of data elements by a factor as high as 50 and 60.

...read moreread less

Patent•

Method and apparatus for address transfers, system serialization, and centralized cache and transaction control, in a symetric multiprocessor system

[...]

Yuanlong Wang¹, Zong Yu¹, Xiaofan Wei¹, Earl T. Cohen¹, Brian R. Baird¹, Daniel Fu¹ - Show less +2 more•Institutions (1)

Conexant¹

10 Aug 2001

TL;DR: In this paper, the authors present the Transaction Bus of a symmetric multiprocessor system, which is implemented using segmented buses, distributed muxes, point-to-point wiring, and supports transaction processing at a rate of one transaction per clock cycle.

...read moreread less

Abstract: A preferred embodiment of a symmetric multiprocessor system includes a switched fabric (switch matrix) for data transfers that provides multiple concurrent buses that enable greatly increased bandwidth between processors and shared memory. A Transaction Controller, Transaction Bus, and Transaction Status Bus are used for serialization, centralized cache control, and highly pipelined address transfers. The shared Transaction Controller serializes transaction requests from Initiator devices that can include CPU/Cache modules and Peripheral Bus modules. The Transaction Bus of an illustrative embodiment is implemented using segmented buses, distributed muxes, point-to-point wiring, and supports transaction processing at a rate of one transaction per clock cycle. The Transaction Controller monitors the Transaction Bus, maintains a set of duplicate cache-tags for all CPU/Cache modules, maps addresses to Target devices, performs centralized cache control for all CPU/Cache modules, filters unnecessary Cache transactions, and routes necessary transactions to Target devices over the Transaction Status Bus. The Transaction Status Bus includes both bus-based based and point-to-point control of the target devices. A modified rotating priority scheme is used to provide Starvation-free support for Locked buses and memory resources via backoff operations. Speculative memory operations are supported to further enhance performance.

...read moreread less

Patent•

Software controlled cache configuration based on average miss rate

[...]

Gerard Chauvel, Dominique D'Inverno, Serge Lasserre

17 Aug 2001

TL;DR: In this article, the cache is reconfigured in response to an operation command (1314 ), such that each tag in the array of tags that contains a specified qualifier value is modified in accordance with the operation command.

...read moreread less

Abstract: A digital system is provided with a several processors ( 1302 ), a shared level two (L2) cache ( 1300 ) having several segments per entry with associated tags, and a level three (L3) physical memory. Each tag entry includes a task-ID qualifier field and a resource ID qualifier field. Data is loaded into various lines in the cache in response to cache access requests when a given cache access request misses. After loading data into the cache in response to a miss, a tag associated with the data line is set to a valid state. In addition to setting a tag to a valid state, qualifier values are stored in qualifier fields in the tag. Each qualifier value specifies a usage characteristic of data stored in an associated data line of the cache, such as a task ID. A miss counter ( 532 ) counts each miss and a monitoring task ( 1311 ) determines a miss rate for memory requests. If a selected miss rate threshold value is exceeded, the digital system is reconfigured in order to reduce the miss rate. The cache is reconfigured in response to an operation command ( 1314 ), such that each tag in the array of tags that contains a specified qualifier value is modified in accordance with the operation command. Other types of reconfiguration can be performed, such as remapping a selected program portion to operate in a different address range, locking a portion of the data entries within the cache, or defining addresses corresponding to a selected program task as uncacheable, for example.

...read moreread less

Journal Article•DOI•

Compiler support for scalable and efficient memory systems

[...]

Rajeev Barua¹, Whay S. Lee², S. Arnarasinghe², Anant Agarwal²•Institutions (2)

University of Maryland, College Park¹, Massachusetts Institute of Technology²

01 Nov 2001-IEEE Transactions on Computers

TL;DR: This paper considers how to effectively use a bank-exposed memory system comprised of small, decentralized cache banks for sequential programs and demonstrates that using bank disambiguation improves performance, by a factor of 3 to 5 over using ILP alone.

...read moreread less

Abstract: Technological trends require that future scalable microprocessors be decentralized. Applying these trends toward memory systems shows that the size of the cache accessible in a single cycle will decrease in a future generation of chips. Thus, a bank-exposed memory system comprised of small, decentralized cache banks must eventually replace that of a monolithic cache. This paper considers how to effectively use such a memory system for sequential programs. This paper presents Maps, the software technology central to bank-exposed architectures, which are architectures with bank-exposed memory systems. Maps solves the problem of bank disambiguation-that of determining at compile-time which bank a memory reference is accessing. Bank disambiguation is important because it enables the compile-time optimization for data locality, where data can be placed close to the computation that requires it. Two methods for bank disambiguation are presented: equivalence-class unification and modulo unrolling. Experimental results are presented using a compiler for the MIT Raw machine, a bank-exposed architecture that relies on the compiler to 1) manage its memory and 2) orchestrate its instruction level parallelism and communication. Results on Raw using sequential codes demonstrate that using bank disambiguation improves performance, by a factor of 3 to 5 over using ILP alone.

...read moreread less

Collapse