scispace - formally typeset
Search or ask a question

Showing papers on "Cache pollution published in 2001"


Proceedings ArticleDOI
01 May 2001
TL;DR: This paper discusses policies and implementations for reducing cache leakage by invalidating and “turning off” cache lines when they hold data not likely to be reused, and proposes adaptive policies that effectively reduce LI cache leakage energy by 5x for the SPEC2000 with only negligible degradations in performance.
Abstract: Power dissipation is increasingly important in CPUs ranging from those intended for mobile use, all the way up to high-performance processors for high-end servers. While the bulk of the power dissipated is dynamic switching power, leakage power is also beginning to be a concern. Chipmakers expect that in future chip generations, leakage's proportion of total chip power will increase significantly.This paper examines methods for reducing leakage power within the cache memories of the CPU. Because caches comprise much of a CPU chip's area and transistor counts, they are reasonable targets for attacking leakage. We discuss policies and implementations for reducing cache leakage by invalidating and “turning off” cache lines when they hold data not likely to be reused. In particular, our approach is targeted at the generational nature of cache line usage. That is, cache lines typically have a flurry of frequent use when first brought into the cache, and then have a period of “dead time” before they are evicted. By devising effective, low-power ways of deducing dead time, our results show that in many cases we can reduce LI cache leakage energy by 4x in SPEC2000 applications without impacting performance. Because our decay-based techniques have notions of competitive on-line algorithms at their roots, their energy usage can be theoretically bounded at within a factor of two of the optimal oracle-based policy. We also examine adaptive decay-based policies that make energy-minimizing policy choices on a per-application basis by choosing appropriate decay intervals individually for each cache line. Our proposed adaptive policies effectively reduce LI cache leakage energy by 5x for the SPEC2000 with only negligible degradations in performance.

725 citations


Proceedings Article
11 Sep 2001
TL;DR: This paper proposes a new data organization model called PAX (Partition Attributes Across), that significantly improves cache performance by grouping together all values of each attribute within each page, and demonstrates that in-page data placement is the key to high cache performance.
Abstract: Relational database systems have traditionally optimzed for I/O performance and organized records sequentially on disk pages using the N-ary Storage Model (NSM) (a.k.a., slotted pages). Recent research, however, indicates that cache utilization and performance is becoming increasingly important on modern platforms. In this paper, we first demonstrate that in-page data placement is the key to high cache performance and that NSM exhibits low cache utilization on modern platforms. Next, we propose a new data organization model called PAX (Partition Attributes Across), that significantly improves cache performance by grouping together all values of each attribute within each page. Because PAX only affects layout inside the pages, it incurs no storage penalty and does not affect I/O behavior. According to our experimental results, when compared to NSM (a) PAX exhibits superior cache and memory bandwidth utilization, saving at least 75% of NSM’s stall time due to data cache accesses, (b) range selection queries and updates on memoryresident relations execute 17-25% faster, and (c) TPC-H queries involving I/O execute 11-48% faster.

428 citations


Proceedings ArticleDOI
01 Dec 2001
TL;DR: Two previously-proposed techniques, way-prediction and selective direct-mapping, are applied to reducing L1 cache dynamic energy while maintaining high performance, and caches achieve the energy-delay of sequential access while maintaining the performance of parallel access.
Abstract: Set-associative caches achieve low miss rates for typical applications but result in significant energy dissipation. Set-associative caches minimize access time by probing all the data ways in parallel with the tag lookup, although the output of only the matching way is used. The energy spent accessing the other ways is wasted Eliminating the wasted energy by performing the data lookup sequentially following the tag lookup substantially increases cache access time, and is unacceptable for high-performance L1 caches. In this paper, we apply two previously-proposed techniques, way-prediction and selective direct-mapping, to reducing L1 cache dynamic energy while maintaining high performance. The techniques predict the matching way and probe only the predicted way and not all the ways, achieving energy savings. While these techniques were originally proposed to improve set-associative cache access times, this is the first paper to apply them to reducing cache energy. We evaluate the effectiveness of these techniques in reducing L1 d-cache, L1 i-cache, and overall processor energy. Using these techniques, our caches achieve the energy-delay of sequential access while maintaining the performance of parallel access. Relative to parallel access L1 i- and d-caches, the techniques achieve overall processor energy-delay reduction of 8%, while perfect way-prediction with no performance degradation achieves 10% reduction. The performance degradation of the techniques is less than 3%, compared to an aggressive,.1-cycle, 4-way, parallel access cache.

310 citations


Patent
08 Jun 2001
TL;DR: The PIRANHA system as discussed by the authors is a scalable chip-multiprocessing system with scalable architecture, including on a single chip: a plurality of processor cores; a two-level cache hierarchy; an intra-chip switch; one or more memory controllers; a cache coherence protocol; and an interconnect subsystem.
Abstract: A chip-multiprocessing system with scalable architecture, including on a single chip: a plurality of processor cores; a two-level cache hierarchy; an intra-chip switch; one or more memory controllers; a cache coherence protocol; one or more coherence protocol engines; and an interconnect subsystem The two-level cache hierarchy includes first level and second level caches In particular, the first level caches include a pair of instruction and data caches for, and private to, each processor core The second level cache has a relaxed inclusion property, the second-level cache being logically shared by the plurality of processor cores Each of the plurality of processor cores is capable of executing an instruction set of the ALPHA™ processing core The scalable architecture of the chip-multiprocessing system is targeted at parallel commercial workloads A showcase example of the chip-multiprocessing system, called the PIRANHA™ system, is a highly integrated processing node with eight simpler ALPHA™ processor cores A method for scalable chip-multiprocessing is also provided

294 citations


Patent
26 Apr 2001
TL;DR: In this article, the authors present a selection procedure for information object repository selection procedures for determining which of a number of information object repositories should service a request for the information object, including a direct cache selection process, a redirect cache selection, a remote DNS cache, or a local DNS cache selection.
Abstract: Various information object repository selection procedures for determining which of a number of information object repositories should service a request for the information object include a direct cache selection process, a redirect cache selection process, a remote DNS cache selection process, or a local DNS cache selection process. Different combinations of these procedures may also be used. For example different combination may be used depending on the type of content being requested. The direct cache selection process may be used for information objects that will be immediately loaded without user action, while any of the redirect cache selection process, the remote DNS cache selection process and/or the local DNS cache selection process may be used for information objects that will be loaded only after some user action.

231 citations


Patent
13 Aug 2001
TL;DR: In this article, a content analysis engine determines which of the caches a data item should be stored in, based on an analysis of data requests or data items served in response to the requests, guidelines set by a system administrator, etc.
Abstract: A multi-tier caching system and method of operating the same. The system comprises a first cache implemented in operating system or kernel space (e.g., in memory managed by or allocated to an operating system) and a second cache implemented in application or user space (e.g., in memory managed by or allocated to an application program). Data requests requiring little processing to identify responsive data may be served from the first cache, while those requiring further processing are served from the second. The first cache may therefore store frequently requested data items or items that can be served in response to requests having different forms, qualifiers or other indicia. A content analysis engine determines which of the caches a data item should be stored in, based on an analysis of data requests or data items served in response to the requests, guidelines set by a system administrator, etc.

220 citations


Proceedings ArticleDOI
19 Jan 2001
TL;DR: It is shown that even with an aggressive, next-generation memory system using four Direct Rambus channels and an integrated one-megabyte level-two cache, a processor still spends over half of its time stalling for L2 misses.
Abstract: In this paper we address the severe performance gap caused by high processor clock rates and slow DRAM accesses. We show that even with an aggressive, next-generation memory system using four Direct Rambus channels and an integrated one-megabyte level-two cache, a processor still spends over half of its time stalling for L2 misses. Large cache blocks can improve performance, but only when coupled with wide memory channels. DRAM address mappings also affect performance significantly. We evaluate an aggressive prefetch unit integrated with the L2 cache and memory, controllers. By issuing prefetches only when the Rambus channels are idle, prioritizing them to maximize DRAM row buffer hits, and giving them low replacement priority, we achieve a 43% speedup across 10 of the 26 SPEC2000 benchmarks, without degrading performance an the others. With eight Rambus channels, these ten benchmarks improve to within 10% of the performance of a perfect L2 cache.

213 citations


Journal ArticleDOI
TL;DR: The proposed Speculative Versioning Cache uses distributed caches to eliminate the latency and bandwidth problems of the ARB and conceptually unifies cache coherence and speculative versioning by using an organization similar to snooping bus-based coherent caches.
Abstract: Dependences among loads and stores whose addresses are unknown hinder the extraction of instruction level parallelism during the execution of a sequential program. Such ambiguous memory dependences can be overcome by memory dependence speculation which enables a load or store to be speculatively executed before the addresses of all preceding loads and stores are known. Furthermore, multiple speculative stores to a memory location create multiple speculative versions of the location. Program order among the speculative versions must be tracked to maintain sequential semantics. A previously proposed approach, the Address Resolution Buffer (ARB) uses a centralized buffer to support speculative versions. Our proposal, called the Speculative Versioning Cache (SVC), uses distributed caches to eliminate the latency and bandwidth problems of the ARB. The SVC conceptually unifies cache coherence and speculative versioning by using an organization similar to snooping bus-based coherent caches. Our evaluation for the Multiscalar architecture shows that hit latency is an important factor affecting performance and private cache solutions trade-off hit rate for hit latency.

167 citations


Patent
01 Mar 2001
TL;DR: In this article, a garbage collector that uses an LRU algorithm to free memory from an XML DOM tree active in an application cache is described. But it is not shown how to remove the nodes from the DOM tree.
Abstract: The present invention relates to a garbage collector that uses an LRU algorithm to free memory from an XML DOM tree active in an application cache. According to one or more embodiments of the present invention, a threshold for the amount of memory permitted to reside in an application cache is set. Then, a garbage collector removes entries from the cache until it falls below the threshold. In one or more embodiments, a node table is used. When nodes are added to the XML DOM tree in the application cache the node table is updated. When the threshold for the amount of memory permitted to reside in the application cache is exceeded, the garbage collector applies an LRU algorithm uses the node table to determine which nodes to remove from the application cache. In one embodiment, the LRU algorithm scans the node table to determine the least recently used node in the table by examining time stamp entries in the table. Then, the algorithm removes that node and repeats the process until the XML DOM tree uses less memory in the cache than the threshold.

154 citations


Proceedings ArticleDOI
17 Jun 2001
TL;DR: In this paper, an analytical cache model for time-shared systems is presented, which estimates the overall cache miss-rate of a multiprocessing system with any cache size and time quanta.
Abstract: An accurate, tractable, analytic cache model for time-shared systems is presented, which estimates the overall cache miss-rate of a multiprocessing system with any cache size and time quanta. The input to the model consists of the isolated miss-rate curves for each process, the time quanta for each of the executing processes, and the total cache size. The output is the overall miss-rate. Trace-driven simulations demonstrate that the estimated miss-rate is very accurate. Since the model provides a fast and accurate way to estimate the effect of context switching, it is useful for both understanding the effect of context switching on caches and optimizing cache performance for time-shared systems. A cache partitioning mechanism is also presented and is shown to improve the cache miss-rate up to 25% over the normal LRU replacement policy.

130 citations


Patent
08 Jun 2001
TL;DR: In this article, a method and system for exclusive two-level caching in a chip-multiprocessor is presented to maximize the effective use of on-chip cache.
Abstract: To maximize the effective use of on-chip cache, a method and system for exclusive two-level caching in a chip-multiprocessor are provided. The exclusive two-level caching in accordance with the present invention involves method relaxing the inclusion requirement in a two-level cache system in order to form an exclusive cache hierarchy. Additionally, the exclusive two-level caching involves providing a first-level tag-state structure in a first-level cache of the two-level cache system. The first tag-state structure has state information. The exclusive two-level caching also involves maintaining in a second-level cache of the two-level cache system a duplicate of the first-level tag-state structure and extending the state information in the duplicate of the first tag-state structure, but not in the first-level tag-state structure itself, to include an owner indication. The exclusive two-level caching further involves providing in the second-level cache a second tag-state structure so that a simultaneous lookup at the duplicate of the first tag-state structure and the second tag-state structure is possible. Moreover, the exclusive two-level caching involves associating a single owner with a cache line at any given time of its lifetime in the chip-multiprocessor.

Patent
27 Jun 2001
TL;DR: In this article, a system and method to reduce the time for system initializations is described, where data accessed during a system initialization is loaded into a non-volatile cache and is pinned to prevent eviction.
Abstract: A system and method to reduce the time for system initializations is disclosed. In accordance with the invention, data accessed during a system initialization is loaded into a non-volatile cache and is pinned to prevent eviction. By pinning data into the cache, the data required for system initialization is pre-loaded into the cache on a system reboot, thereby eliminating the need to access a disk.

Patent
25 Jan 2001
TL;DR: In this paper, a system for adaptive bypassing one or more higher cache levels following a miss in a lower level of a cache hierarchy is described, where each cache level preferably includes a tag store containing address and state information for each cache line resident in the respective cache.
Abstract: A system for adaptively bypassing one or more higher cache levels following a miss in a lower level of a cache hierarchy is described. Each cache level preferably includes a tag store containing address and state information for each cache line resident in the respective cache. When an invalidate request is received at a given cache hierarchy, each cache level is searched for the address specified by the invalidate request. When an address match is detected, the state of the respective cache line is changed to the invalid state, although the address of the cache line is left in the tag store. Thereafter, if the processor or entity associated with this cache hierarchy issues its own request for this same cache line, the cache hierarchy begins searching the tag store of each level starting with the lowest cache level. Since the address of the invalidated cache line was left in the respective tag store, a match will be detected at one of the cache levels, although the corresponding state of this cache line is invalid. This condition is specifically detected and is considered to be an “inval_miss” occurrence. In response, to an inval_miss, the cache hierarchy calls off searching any higher levels, and instead, issues a memory reference request for the desired cache line. In a further embodiment, the entity that sourced an invalidate request is stored, and a subsequent memory reference request for the same cache line is sent directly to the source entity.

Patent
08 Aug 2001
TL;DR: In this paper, the cache system determines that an object, such as an image file, is missing from the cache memory, locates sufficient components from cache memory and/or external storage, and constructs the object from the located components.
Abstract: Methods and apparatus for constructing objects within a cache system thereby allowing the cache system to respond to requested objects that are not initially available within the cache system. One embodiment of the invention caches image files, where the images are divided into components and stored in a format that allows identification and access to the components. The cache system determines that an object, such as an image file, is missing from the cache memory, locates sufficient components from the cache memory and/or external storage, and constructs the object from the located components.

Journal ArticleDOI
TL;DR: This study shows that the two proposed schemes are not only effective in salvaging the cache content but consume significantly less energy than their counterparts.
Abstract: Caching can reduce the bandwidth requirement in a wireless computing environment as well as minimize the energy consumption of wireless portable computers. To facilitate mobile clients in ascertaining the validity of their cache content, servers periodically broadcast cache invalidation reports that contain information of data that has been updated. However, as mobile clients may operate in a doze or even totally disconnected mode (to conserve energy), it is possible that some reports may be missed and the clients are forced to discard the entire cache content. In this paper, we reexamine the issue of designing cache invalidation strategies. We identify the basic issues in designing cache invalidation strategies. From the solutions to these issues, a large set of cache invalidation schemes can be constructed. We evaluate the performance of four representative algorithms-two of which are known algorithms (i.e., Dual-Report Cache Invalidation and Bit-Sequences) while the other two are their counterparts that exploit selective tuning (namely, Selective Dual-Report Cache Invalidation and Bit-Sequences with Bit Count). Our study shows that the two proposed schemes are not only effective in salvaging the cache content but consume significantly less energy than their counterparts. While the Selective Dual-Report Cache Invalidation scheme performs best in most cases, it is inferior to the Bit-Sequences with the Bit-Count scheme under high update rates.

Patent
07 Jun 2001
TL;DR: In this article, a proxy partition cache (PPC) architecture and a technique for address-partitioning a proxy cache consisting of a grouping of discrete, cooperating caches (servers) is provided.
Abstract: A proxy partition cache (PPC) architecture and a technique for address-partitioning a proxy cache consisting of a grouping of discrete, cooperating caches (servers) is provided. Client requests for objects (files) of a given size are redirected or reassigned to a single cache in the grouping, notwithstanding the cache to which the request is made by the load-balancing mechanism (such as a Layer 4 switch) based upon load-balancing considerations. The file is then returned to the switch via the switch-designated cache for vending to the requesting client. The redirection/reassignment occurs according to a function within the cache to which the request is directed so that the switch remains freed from additional tasks that can compromise speed.

Patent
31 Oct 2001
TL;DR: In this article, a cache memory system can determine that an entry is stale if the entry has not been accessed or modified for a predetermined time, and the predetermined time is made dynamically variable.
Abstract: A cache memory system can determine that an entry is stale if the entry has not been accessed or modified for a predetermined time. If an entry is stale, the entry may be preemptively evicted. The predetermined time is made dynamically variable. A computer system can adjust the time to optimize a measure of performance. In a specific example, evicted lines are temporarily stored in an eviction queue. The time is adjusted to be as short as possible without substantially increasing the number of lines that must be recalled from the eviction queue.

Patent
11 Jun 2001
TL;DR: In this article, the authors propose a cache coherence protocol for a plurality of processor nodes and input/output nodes, where each processor node includes a multiplicity of processor cores, an interface to a local memory system and a protocol engine.
Abstract: A computer system has a plurality of processor nodes and a plurality of input/output nodes. Each processor node includes a multiplicity of processor cores, an interface to a local memory system and a protocol engine implementing a predefined cache coherence protocol. Each processor core has an associated memory cache for caching memory lines of information. Each input/output node includes no processor cores, an input/output interface for interfacing to an input/output bus or input/output device, a memory cache for caching memory lines of information and an interface to a local memory subsystem. The local memory subsystem of each processor node and input/output node stores a multiplicity of memory lines of information. The protocol engine of each processor node and input/output node implements the same predefined cache coherence protocol.

Patent
Masayoshi Kobayashi1
25 Jul 2001
TL;DR: In this article, a path calculating section obtains a path suitable for carrying out an automatic cache updating operation, a link prefetching operation, and a cache server cooperating operation, based on QoS path information that includes network path information and path load information obtained by a path information obtaining section.
Abstract: A path calculating section obtains a path suitable for carrying out an automatic cache updating operation, a link prefetching operation, and a cache server cooperating operation, based on QoS path information that includes network path information and path load information obtained by a QoS path information obtaining section. An automatic cache updating section, a link prefetching control section, and a cache server cooperating section carry out respective ones of the automatic cache updating operation, the link prefetching operation, and the cache server cooperating operation, by utilizing the path obtained. For example, the path calculating section obtains a maximum remaining bandwidth path as the path.

Patent
27 Aug 2001
TL;DR: In this article, a cache directory is also provided to track cache lines in the write cache and the at least one read cache, which provides a low-latency copy of data that is most likely to be used.
Abstract: A caching input/output hub includes a host interface to connect with a host. At least one input/output interface is provided to connect with an input/output device. A write cache manages memory writes initiated by the input/output device. At least one read cache, separate from the write cache, provides a low-latency copy of data that is most likely to be used. The at least one read cache is in communication with the write cache. A cache directory is also provided to track cache lines in the write cache and the at least one read cache. The cache directory is in communication with the write cache and the at least one read cache.

Proceedings ArticleDOI
08 Sep 2001
TL;DR: The r-a cache is proposed, which provides flexible associativity by placing most blocks in direct-mapped positions and reactively displacing only conflicting blocks to set-associative positions, and using a novel PC-based way-prediction to achieve high accuracy.
Abstract: While set-associative caches typically incur fewer misses than direct-mapped caches, set-associative caches have slower hit times. We propose the reactive-associative cache (r-a cache), which provides flexible associativity by placing most blocks in direct-mapped positions and reactively displacing only conflicting blocks to set-associative positions. The r-a cache uses way-prediction (like the predictive associative cache, PSA) to access displaced blocks on the initial probe. Unlike PSA, however, the r-a cache employs a novel feedback mechanism to prevent unpredictable blocks from being displaced. Reactive displacement and feedback allow the r-a cache to use a novel PC-based way-prediction and achieve high accuracy; without impractical block swapping as in column associative and group associative, and without relying on timing-constrained XOR way prediction. A one-port, 4-way r-a cache achieves up to 9% speedup over a direct-mapped cache and performs within 2% of an idealized 2-way set-associative, 1-cycle cache. A 4-way r-a cache achieves up to 13% speedup over a PSA cache, with both r-a and PSA using the PC scheme. CACTI estimates that for sizes larger than 8KB, a 4-way r-a cache is within 1% of direct-mapped hit times, and 24% faster than a 2-way set-associative cache.

Patent
Terry L. Kendall1
28 Mar 2001
TL;DR: A small cache memory can be incorporated with a main memory, such as a flash memory, on an integrated circuit to improve average access times between a processor and the main memory as discussed by the authors, which can also allow a suspended transfer with minimal latency when the transfer is resumed.
Abstract: A small cache memory can be incorporated with a main memory, such as a flash memory, on an integrated circuit to improve average access times between a processor and the main memory To minimize cost and complexity, the cache memory may contain only a few words of data The cache can also allow a suspended transfer with minimal latency when the transfer is resumed Designing the cache memory to interface with the processor over a standard memory bus permits the cache to be implemented in a system that could otherwise have no cache memory unless the processor and/or memory bus were redesigned

Proceedings ArticleDOI
23 Apr 2001
TL;DR: The potential for addressing bandwidth limitations by increasing global cache reuse is explored-that is, reusing data across whole program and over the entire data collection, in a two-step global strategy.
Abstract: Reusing data in cache is critical to achieving high performance on modern machines because it reduces the impact of the latency and bandwidth limitations of direct memory access. To date, most studies of software memory hierarchy management have focused on the latency problem. However today's machines are increasingly limited by insufficient memory bandwidth-on these machines, latency-oriented techniques are inadequate because they do not seek to minimize the total memory traffic over the whole program. This paper explores the potential for addressing bandwidth limitations by increasing global cache reuse-that is, reusing data across whole program and over the entire data collection. To this end, the paper explores a two-step global strategy. The first step fuses computations on the same data to enable the caching of repeated accesses. The second step groups data used by the same computation to bring about contiguous access to memory. While the first step reduces the frequency of memory accesses, the second step improves their efficiency. The paper demonstrates the effectiveness of this strategy and shows how to automate it in a production compiler.

Patent
16 Oct 2001
TL;DR: In this paper, the coherency protocol of a multiprocessor data processing system is described, which includes processing logic that returns to coherent operations with other processing units responsive to an occurrence of a pre-determined condition.
Abstract: A multiprocessor data processing system comprising a plurality of processing units, a plurality of caches, that is each affiliated with one of the processing units, and processing logic that, responsive to a receipt of a first system bus response to a coherency operation, causes the requesting processor to execute operations utilizing super-coherent data. The data processing system further includes logic eventually returning to coherent operations with other processing units responsive to an occurrence of a pre-determined condition. The coherency protocol of the data processing system includes a first coherency state that indicates that modification of data within a shared cache line of a second cache of a second processor has been snooped on a system bus of the data processing system. When the cache line is in the first coherency state, subsequent requests for the cache line is issued as a Z 1 read on a system bus and one of two responses are received. If the response to the Z 1 read indicates that the first processor should utilize local data currently available within the cache line, the first coherency state is changed to a second coherency state that indicates to the first processor that subsequent request for the cache line should utilize the data within the local cache and not be issued to the system interconnect. Coherency state transitions to the second coherency state is completed via the coherency protocol of the data processing system. Super-coherent data is provided to the processor from the cache line of the local cache whenever the second coherency state is set for the cache line and a request is received.

Proceedings ArticleDOI
01 May 2001
TL;DR: This paper proposes two index structures, pkT-trees and pkB-tree, which significantly reduce cache misses by storing partial-key information in the index, and shows that a small, fixed amount of key information allows most cache misses to be avoided, allowing for a simple node structure and efficient implementation.
Abstract: The performance of main-memory index structures is increasingly determined by the number of CPU cache misses incurred when traversing the index. When keys are stored indirectly, as is standard in main-memory databases, the cost of key retrieval in terms of cache misses can dominate the cost of an index traversal. Yet it is inefficient in both time and space to store even moderate sized keys directly in index nodes. In this paper, we investigate the performance of tree structures suitable for OLTP workloads in the face of expensive cache misses and non-trivial key sizes. We propose two index structures, pkT-trees and pkB-trees, which significantly reduce cache misses by storing partial-key information in the index. We show that a small, fixed amount of key information allows most cache misses to be avoided, allowing for a simple node structure and efficient implementation. Finally, we study the performance and cache behavior of partial-key trees by comparing them with other main-memory tree structures for a wide variety of key sizes and key value distributions.

Proceedings ArticleDOI
01 Dec 2001
TL;DR: Support for tag-unchecked loads and stores to C and Java compilers that save the energy of a tag check when the compiler can guarantee an access will be to the same line as an earlier access is added.
Abstract: A direct addressed cache is a hardware-software design for an energy-efficient microprocessor data cache. Direct addressing allows software to access cache data without a hardware cache tag check. These tag-unchecked loads and stores save the energy of a tag check when the compiler can guarantee an access will be to the same line as an earlier access. We have added support for tag-unchecked loads and stores to C and Java compilers. For Mediabench C programs, the compiler eliminates 16-76% of data cache tag accesses, with half of the benchmarks avoiding over 40% of the data tag checks. For SPECjvm98 Java programs, the compiler eliminates 18-63% of data cache tag checks. These tag check reductions translate into data cache energy savings of 9-40%, and overall processor and cache energy savings of 2-8%.

Journal ArticleDOI
TL;DR: This paper presents a cache architecture to convert a cache into a computing unit for either of the following two structured computations: finite impulse response and discrete/inverse discrete cosine transform, and includes additional logic to embed multibit output lookup tables into the cache structure.
Abstract: A considerable portion of a microprocessor chip is dedicated to cache memory. However, not all applications need all the cache storage all the time, especially the computing bandwidth-limited applications. In addition, some applications have large embedded computations with a regular structure. Such applications may be able to use additional computing resources. If the unused portion of the cache could serve these computation needs, the on-chip resources would be utilized more efficiently. This presents an opportunity to explore the reconfiguration of a part of the cache memory for computing. Thus, we propose adaptive balanced computing (ABC)-dynamic resource configuration on demand from application-between memory and computing resources. In this paper, we present a cache architecture to convert a cache into a computing unit for either of the following two structured computations: finite impulse response and discrete/inverse discrete cosine transform. In order to convert a cache memory to a function unit, we include additional logic to embed multibit output lookup tables into the cache structure. The experimental results show that the reconfigurable module improves the execution time of applications with a large number of data elements by a factor as high as 50 and 60.

Patent
Yuanlong Wang1, Zong Yu1, Xiaofan Wei1, Earl T. Cohen1, Brian R. Baird1, Daniel Fu1 
10 Aug 2001
TL;DR: In this paper, the authors present the Transaction Bus of a symmetric multiprocessor system, which is implemented using segmented buses, distributed muxes, point-to-point wiring, and supports transaction processing at a rate of one transaction per clock cycle.
Abstract: A preferred embodiment of a symmetric multiprocessor system includes a switched fabric (switch matrix) for data transfers that provides multiple concurrent buses that enable greatly increased bandwidth between processors and shared memory. A Transaction Controller, Transaction Bus, and Transaction Status Bus are used for serialization, centralized cache control, and highly pipelined address transfers. The shared Transaction Controller serializes transaction requests from Initiator devices that can include CPU/Cache modules and Peripheral Bus modules. The Transaction Bus of an illustrative embodiment is implemented using segmented buses, distributed muxes, point-to-point wiring, and supports transaction processing at a rate of one transaction per clock cycle. The Transaction Controller monitors the Transaction Bus, maintains a set of duplicate cache-tags for all CPU/Cache modules, maps addresses to Target devices, performs centralized cache control for all CPU/Cache modules, filters unnecessary Cache transactions, and routes necessary transactions to Target devices over the Transaction Status Bus. The Transaction Status Bus includes both bus-based based and point-to-point control of the target devices. A modified rotating priority scheme is used to provide Starvation-free support for Locked buses and memory resources via backoff operations. Speculative memory operations are supported to further enhance performance.

Patent
17 Aug 2001
TL;DR: In this article, the cache is reconfigured in response to an operation command (1314 ), such that each tag in the array of tags that contains a specified qualifier value is modified in accordance with the operation command.
Abstract: A digital system is provided with a several processors ( 1302 ), a shared level two (L2) cache ( 1300 ) having several segments per entry with associated tags, and a level three (L3) physical memory. Each tag entry includes a task-ID qualifier field and a resource ID qualifier field. Data is loaded into various lines in the cache in response to cache access requests when a given cache access request misses. After loading data into the cache in response to a miss, a tag associated with the data line is set to a valid state. In addition to setting a tag to a valid state, qualifier values are stored in qualifier fields in the tag. Each qualifier value specifies a usage characteristic of data stored in an associated data line of the cache, such as a task ID. A miss counter ( 532 ) counts each miss and a monitoring task ( 1311 ) determines a miss rate for memory requests. If a selected miss rate threshold value is exceeded, the digital system is reconfigured in order to reduce the miss rate. The cache is reconfigured in response to an operation command ( 1314 ), such that each tag in the array of tags that contains a specified qualifier value is modified in accordance with the operation command. Other types of reconfiguration can be performed, such as remapping a selected program portion to operate in a different address range, locking a portion of the data entries within the cache, or defining addresses corresponding to a selected program task as uncacheable, for example.

Journal ArticleDOI
TL;DR: This paper considers how to effectively use a bank-exposed memory system comprised of small, decentralized cache banks for sequential programs and demonstrates that using bank disambiguation improves performance, by a factor of 3 to 5 over using ILP alone.
Abstract: Technological trends require that future scalable microprocessors be decentralized. Applying these trends toward memory systems shows that the size of the cache accessible in a single cycle will decrease in a future generation of chips. Thus, a bank-exposed memory system comprised of small, decentralized cache banks must eventually replace that of a monolithic cache. This paper considers how to effectively use such a memory system for sequential programs. This paper presents Maps, the software technology central to bank-exposed architectures, which are architectures with bank-exposed memory systems. Maps solves the problem of bank disambiguation-that of determining at compile-time which bank a memory reference is accessing. Bank disambiguation is important because it enables the compile-time optimization for data locality, where data can be placed close to the computation that requires it. Two methods for bank disambiguation are presented: equivalence-class unification and modulo unrolling. Experimental results are presented using a compiler for the MIT Raw machine, a bank-exposed architecture that relies on the compiler to 1) manage its memory and 2) orchestrate its instruction level parallelism and communication. Results on Raw using sequential codes demonstrate that using bank disambiguation improves performance, by a factor of 3 to 5 over using ILP alone.