Showing papers on "Smart Cache published in 2001"

PDF

Open Access

Proceedings Article•DOI•

Cache decay: exploiting generational behavior to reduce cache leakage power

[...]

Stefanos Kaxiras¹, Zhigang Hu², Margaret Martonosi²•Institutions (2)

01 May 2001

TL;DR: This paper discusses policies and implementations for reducing cache leakage by invalidating and “turning off” cache lines when they hold data not likely to be reused, and proposes adaptive policies that effectively reduce LI cache leakage energy by 5x for the SPEC2000 with only negligible degradations in performance.

...read moreread less

Abstract: Power dissipation is increasingly important in CPUs ranging from those intended for mobile use, all the way up to high-performance processors for high-end servers. While the bulk of the power dissipated is dynamic switching power, leakage power is also beginning to be a concern. Chipmakers expect that in future chip generations, leakage's proportion of total chip power will increase significantly.This paper examines methods for reducing leakage power within the cache memories of the CPU. Because caches comprise much of a CPU chip's area and transistor counts, they are reasonable targets for attacking leakage. We discuss policies and implementations for reducing cache leakage by invalidating and “turning off” cache lines when they hold data not likely to be reused. In particular, our approach is targeted at the generational nature of cache line usage. That is, cache lines typically have a flurry of frequent use when first brought into the cache, and then have a period of “dead time” before they are evicted. By devising effective, low-power ways of deducing dead time, our results show that in many cases we can reduce LI cache leakage energy by 4x in SPEC2000 applications without impacting performance. Because our decay-based techniques have notions of competitive on-line algorithms at their roots, their energy usage can be theoretically bounded at within a factor of two of the optimal oracle-based policy. We also examine adaptive decay-based policies that make energy-minimizing policy choices on a per-application basis by choosing appropriate decay intervals individually for each cache line. Our proposed adaptive policies effectively reduce LI cache leakage energy by 5x for the SPEC2000 with only negligible degradations in performance.

...read moreread less

725 citations

Proceedings Article•

Weaving Relations for Cache Performance

[...]

Anastassia Ailamaki¹, David J. DeWitt², Mark D. Hill², Marios Skounakis²•Institutions (2)

Carnegie Mellon University¹, University of Wisconsin-Madison²

11 Sep 2001

TL;DR: This paper proposes a new data organization model called PAX (Partition Attributes Across), that significantly improves cache performance by grouping together all values of each attribute within each page, and demonstrates that in-page data placement is the key to high cache performance.

...read moreread less

Abstract: Relational database systems have traditionally optimzed for I/O performance and organized records sequentially on disk pages using the N-ary Storage Model (NSM) (a.k.a., slotted pages). Recent research, however, indicates that cache utilization and performance is becoming increasingly important on modern platforms. In this paper, we first demonstrate that in-page data placement is the key to high cache performance and that NSM exhibits low cache utilization on modern platforms. Next, we propose a new data organization model called PAX (Partition Attributes Across), that significantly improves cache performance by grouping together all values of each attribute within each page. Because PAX only affects layout inside the pages, it incurs no storage penalty and does not affect I/O behavior. According to our experimental results, when compared to NSM (a) PAX exhibits superior cache and memory bandwidth utilization, saving at least 75% of NSM’s stall time due to data cache accesses, (b) range selection queries and updates on memoryresident relations execute 17-25% faster, and (c) TPC-H queries involving I/O execute 11-48% faster.

...read moreread less

428 citations

Proceedings Article•DOI•

Reducing set-associative cache energy via way-prediction and selective direct-mapping

[...]

Michael D. Powell¹, Amit Agarwal¹, T. N. Vijaykumar¹, Babak Falsafi², Kaushik Roy¹ - Show less +1 more•Institutions (2)

Purdue University¹, Carnegie Mellon University²

01 Dec 2001

TL;DR: Two previously-proposed techniques, way-prediction and selective direct-mapping, are applied to reducing L1 cache dynamic energy while maintaining high performance, and caches achieve the energy-delay of sequential access while maintaining the performance of parallel access.

...read moreread less

Abstract: Set-associative caches achieve low miss rates for typical applications but result in significant energy dissipation. Set-associative caches minimize access time by probing all the data ways in parallel with the tag lookup, although the output of only the matching way is used. The energy spent accessing the other ways is wasted Eliminating the wasted energy by performing the data lookup sequentially following the tag lookup substantially increases cache access time, and is unacceptable for high-performance L1 caches. In this paper, we apply two previously-proposed techniques, way-prediction and selective direct-mapping, to reducing L1 cache dynamic energy while maintaining high performance. The techniques predict the matching way and probe only the predicted way and not all the ways, achieving energy savings. While these techniques were originally proposed to improve set-associative cache access times, this is the first paper to apply them to reducing cache energy. We evaluate the effectiveness of these techniques in reducing L1 d-cache, L1 i-cache, and overall processor energy. Using these techniques, our caches achieve the energy-delay of sequential access while maintaining the performance of parallel access. Relative to parallel access L1 i- and d-caches, the techniques achieve overall processor energy-delay reduction of 8%, while perfect way-prediction with no performance degradation achieves 10% reduction. The performance degradation of the techniques is less than 3%, compared to an aggressive,.1-cycle, 4-way, parallel access cache.

...read moreread less

310 citations

Patent•

Scalable architecture based on single-chip multiprocessing

[...]

Luiz Andre Barroso, Kourosh Gharachorloo, Andreas Nowatzyk

08 Jun 2001

TL;DR: The PIRANHA system as discussed by the authors is a scalable chip-multiprocessing system with scalable architecture, including on a single chip: a plurality of processor cores; a two-level cache hierarchy; an intra-chip switch; one or more memory controllers; a cache coherence protocol; and an interconnect subsystem.

...read moreread less

Abstract: A chip-multiprocessing system with scalable architecture, including on a single chip: a plurality of processor cores; a two-level cache hierarchy; an intra-chip switch; one or more memory controllers; a cache coherence protocol; one or more coherence protocol engines; and an interconnect subsystem The two-level cache hierarchy includes first level and second level caches In particular, the first level caches include a pair of instruction and data caches for, and private to, each processor core The second level cache has a relaxed inclusion property, the second-level cache being logically shared by the plurality of processor cores Each of the plurality of processor cores is capable of executing an instruction set of the ALPHA™ processing core The scalable architecture of the chip-multiprocessing system is targeted at parallel commercial workloads A showcase example of the chip-multiprocessing system, called the PIRANHA™ system, is a highly integrated processing node with eight simpler ALPHA™ processor cores A method for scalable chip-multiprocessing is also provided

...read moreread less

294 citations

Journal Article•DOI•

Analysis of web caching architectures: hierarchical and distributed caching

[...]

Pablo Rodriguez¹, Christian Spanner¹, Ernst W. Biersack¹•Institutions (1)

Institut Eurécom¹

01 Aug 2001-IEEE ACM Transactions on Networking

TL;DR: This paper considers hierarchical and distributed caching, a hybrid caching architecture that combines hierarchical caching with distributed caching at every level of a caching hierarchy, and determines the optimal number of caches that should cooperate at each caching level to minimize client's retrieval latency.

...read moreread less

Abstract: Cache cooperation improves the performance of isolated caches, especially for caches with small cache populations. To make caches cooperate on a large scale and effectively increase the cache population, several caches are usually federated in caching architectures. In this paper, we discuss and compare the performance of different caching architectures. In particular, we consider hierarchical and distributed caching. We derive analytical models to study important performance parameters of hierarchical and distributed caching, i.e., client's perceived latency, bandwidth usage, load in the caches, and disk space usage. Additionally, we consider a hybrid caching architecture that combines hierarchical caching with distributed caching at every level of a caching hierarchy. We evaluate the performance of a hybrid scheme and determine the optimal number of caches that should cooperate at each caching level to minimize client's retrieval latency.

...read moreread less

232 citations

Patent•

Application caching system and method

[...]

Ashok K. Chandra, Neil Latarche¹, Jianchang Mao¹, Prabhakar Raghavan¹•Institutions (1)

Hewlett-Packard¹

08 May 2001

TL;DR: In this paper, an application caching system and method are provided wherein one or more applications may be cached throughout a distributed computer network (24), where the system may include a central cache directory server, a distributed master application server, and a distributed application cache server.

...read moreread less

Abstract: An application caching system and method are provided wherein one or more applications may be cached throughout a distributed computer network (24). The system may include a central cache directory server (30), one or more distributed master application servers (28) and one or more distributed application cache servers (26). The system may permit a service, such as a search, to be provided to the user more quickly.

...read moreread less

131 citations

Proceedings Article•DOI•

Analytical cache models with applications to cache partitioning

[...]

G. Edward Suh¹, Srinivas Devadas¹, Larry Rudolph¹•Institutions (1)

Massachusetts Institute of Technology¹

17 Jun 2001

TL;DR: In this paper, an analytical cache model for time-shared systems is presented, which estimates the overall cache miss-rate of a multiprocessing system with any cache size and time quanta.

...read moreread less

Abstract: An accurate, tractable, analytic cache model for time-shared systems is presented, which estimates the overall cache miss-rate of a multiprocessing system with any cache size and time quanta. The input to the model consists of the isolated miss-rate curves for each process, the time quanta for each of the executing processes, and the total cache size. The output is the overall miss-rate. Trace-driven simulations demonstrate that the estimated miss-rate is very accurate. Since the model provides a fast and accurate way to estimate the effect of context switching, it is useful for both understanding the effect of context switching on caches and optimizing cache performance for time-shared systems. A cache partitioning mechanism is also presented and is shown to improve the cache miss-rate up to 25% over the normal LRU replacement policy.

...read moreread less

130 citations

Proceedings Article•DOI•

Analysis and design of hierarchical Web caching systems

[...]

Hao Che¹, Z. Wang¹, Ye Tung¹•Institutions (1)

Pennsylvania State University¹

22 Apr 2001

TL;DR: An analytical modeling technique is developed to characterize an uncooperative two-level hierarchical caching system where the least recently used (LRU) algorithm is locally run at each cache, and a cooperative hierarchical Web caching architecture is proposed based on these principles.

...read moreread less

Abstract: This paper aims at finding fundamental design principles for hierarchical Web caching. An analytical modeling technique is developed to characterize an uncooperative two-level hierarchical caching system where the least recently used (LRU) algorithm is locally run at each cache. With this modeling technique, we are able to identify a characteristic time for each cache, which plays a fundamental role in understanding the caching processes. In particular, a cache can be viewed roughly as a lowpass filter with its cutoff frequency equal to the inverse of the characteristic time. Documents with access frequencies lower than this cutoff frequency will have good chances to pass through the cache without cache hits. This viewpoint enables us to take any branch of the cache tree as a tandem of lowpass filters at different cutoff frequencies, which further results in the finding of two fundamental design principles. Finally, to demonstrate how to use the principles to guide the caching algorithm design, we propose a cooperative hierarchical Web caching architecture based on these principles. The simulation study shows that the proposed cooperative architecture results in 50% saving of the cache resource compared with the traditional uncooperative hierarchical caching architecture.

...read moreread less

124 citations

Patent•

Method and system for exclusive two-level caching in a chip-multiprocessor

[...]

Luiz Andre Barroso, Kourosh Gharachorloo, Andreas Nowatzyk

08 Jun 2001

TL;DR: In this article, a method and system for exclusive two-level caching in a chip-multiprocessor is presented to maximize the effective use of on-chip cache.

...read moreread less

Abstract: To maximize the effective use of on-chip cache, a method and system for exclusive two-level caching in a chip-multiprocessor are provided. The exclusive two-level caching in accordance with the present invention involves method relaxing the inclusion requirement in a two-level cache system in order to form an exclusive cache hierarchy. Additionally, the exclusive two-level caching involves providing a first-level tag-state structure in a first-level cache of the two-level cache system. The first tag-state structure has state information. The exclusive two-level caching also involves maintaining in a second-level cache of the two-level cache system a duplicate of the first-level tag-state structure and extending the state information in the duplicate of the first tag-state structure, but not in the first-level tag-state structure itself, to include an owner indication. The exclusive two-level caching further involves providing in the second-level cache a second tag-state structure so that a simultaneous lookup at the duplicate of the first tag-state structure and the second tag-state structure is possible. Moreover, the exclusive two-level caching involves associating a single owner with a cache line at any given time of its lifetime in the chip-multiprocessor.

...read moreread less

121 citations

Patent•

System and method for network caching

[...]

Mark Vange, Marc Plumb, Marco Clementoni

16 Apr 2001

TL;DR: In this article, the authors propose a system and method for caching network resources in an intermediary server topologically located between a client and a server in a network, where the intermediate server includes a cache and methods for loading content into the cache as according to rules specified by a site owner.

...read moreread less

Abstract: A system and method for caching network resources in an intermediary server topologically located between a client and a server in a network. The intermediate server preferably caches at both a back-end location and a front-end location. Intermediary server includes a cache and methods for loading content into the cache as according to rules specified by a site owner. Optionally, content can be proactively loaded into the cache to include content not yet requested. In another option, requests can be held at the cache when a prior request for similar content is pending.

...read moreread less

120 citations

Patent•

Multiple microprocessors with a shared cache

[...]

Chauvel Gerard¹, Maija Kuusela¹, Dominique D'Inverno¹, Lasserre Serge¹•Institutions (1)

Texas Instruments¹

17 Aug 2001

TL;DR: In this article, a shared L2 cache architecture with 4-way associativity, four segments per entry and four valid and dirty bits is presented, and a shared translation look-aside buffer (TLB) is provided for L2 accesses, while a private TLB associated with each processor.

...read moreread less

Abstract: A digital system is provided with a several processors, a private level one (L1) cache associated with each processor, a shared level two (L2) cache having several segments per entry, and a level three (L3) physical memory. The shared L2 cache architecture is embodied with 4-way associativity, four segments per entry and four valid and dirty bits. When the L2-cache misses, the penalty to access to data within the L3 memory is high. The system supports miss under miss to let a second miss interrupt a segment prefetch being done in response to a first miss. Thus, an interruptible SDRAM to L2-cache prefetch system with miss under miss support is provided. A shared translation look-aside buffer (TLB) is provided for L2 accesses, while a private TLB is associated with each processor. A micro TLB (μTLB) is associated with each resource that can initiate a memory transfer. The L2 cache, along with all of the TLBs and μTLBs have resource ID fields and task ID fields associated with each entry to allow flushing and cleaning based on resource or task. Configuration circuitry is provided to allow the digital system to be configured on a task by task basis in order to reduce power consumption.

...read moreread less

Patent•

Distributed multicast caching technique

[...]

Julian Satran¹, Gidon Gershinsky¹•Institutions (1)

IBM¹

26 Jan 2001

TL;DR: In this paper, the first cache forms the root of a multilevel hierarchical tree and transmits the group directory to a plurality of subsidiary caches, and the subsidiary caches may reorganize the group directories and relay it to a lower level of subsidiary cache.

...read moreread less

Abstract: A caching arrangement for the content of multicast transmission across a data network utilizes a first cache which receives content from one or more content providers. Using the REMADE protocol, the first cache constructs a group directory. The first cache forms the root of a multilevel hierarchical tree. In accordance with configuration parameters, the first cache transmits the group directory to a plurality of subsidiary caches. The subsidiary caches may reorganize the group directory, and relay it to a lower level of subsidiary caches. The process is recursive, until a multicast group of end-user clients is reached. Requests for content by the end-user clients are received by the lowest level cache, and forwarded as necessary to higher levels in the hierarchy. The content is then returned to the requesters. Various levels of caches retain the group directory and content according to configuration options, which can be adaptive to changing conditions such as demand, loading, and the like. The behavior of the caches may optionally be modified by the policies of the content providers.

...read moreread less

Patent•

Method and apparatus for adaptively bypassing one or more levels of a cache hierarchy

[...]

Jr. Simon C. Steely

25 Jan 2001

TL;DR: In this paper, a system for adaptive bypassing one or more higher cache levels following a miss in a lower level of a cache hierarchy is described, where each cache level preferably includes a tag store containing address and state information for each cache line resident in the respective cache.

...read moreread less

Abstract: A system for adaptively bypassing one or more higher cache levels following a miss in a lower level of a cache hierarchy is described. Each cache level preferably includes a tag store containing address and state information for each cache line resident in the respective cache. When an invalidate request is received at a given cache hierarchy, each cache level is searched for the address specified by the invalidate request. When an address match is detected, the state of the respective cache line is changed to the invalid state, although the address of the cache line is left in the tag store. Thereafter, if the processor or entity associated with this cache hierarchy issues its own request for this same cache line, the cache hierarchy begins searching the tag store of each level starting with the lowest cache level. Since the address of the invalidated cache line was left in the respective tag store, a match will be detected at one of the cache levels, although the corresponding state of this cache line is invalid. This condition is specifically detected and is considered to be an “inval_miss” occurrence. In response, to an inval_miss, the cache hierarchy calls off searching any higher levels, and instead, issues a memory reference request for the desired cache line. In a further embodiment, the entity that sourced an invalidate request is stored, and a subsequent memory reference request for the same cache line is sent directly to the source entity.

...read moreread less

Journal Article•DOI•

Web caching and content distribution: a view from the interior

[...]

Syam Gadde¹, Jeffrey S. Chase¹, Michael Rabinovich²•Institutions (2)

Duke University¹, AT&T Labs²

01 Feb 2001-Computer Communications

TL;DR: This paper shows how the Wolman model is applied to large-scale caching systems in which the interior nodes belong to third-party content distribution services and correlates the model's predictions of interior cache behavior with empirical observations from the root caches of the NLANR cache hierarchy.

...read moreread less

Journal Article•DOI•

An evaluation of cache invalidation strategies in wireless environments

[...]

Kian-Lee Tian¹, Jun Cai¹, Beng Chin Ooi¹•Institutions (1)

National University of Singapore¹

01 Aug 2001-IEEE Transactions on Parallel and Distributed Systems

TL;DR: This study shows that the two proposed schemes are not only effective in salvaging the cache content but consume significantly less energy than their counterparts.

...read moreread less

Abstract: Caching can reduce the bandwidth requirement in a wireless computing environment as well as minimize the energy consumption of wireless portable computers. To facilitate mobile clients in ascertaining the validity of their cache content, servers periodically broadcast cache invalidation reports that contain information of data that has been updated. However, as mobile clients may operate in a doze or even totally disconnected mode (to conserve energy), it is possible that some reports may be missed and the clients are forced to discard the entire cache content. In this paper, we reexamine the issue of designing cache invalidation strategies. We identify the basic issues in designing cache invalidation strategies. From the solutions to these issues, a large set of cache invalidation schemes can be constructed. We evaluate the performance of four representative algorithms-two of which are known algorithms (i.e., Dual-Report Cache Invalidation and Bit-Sequences) while the other two are their counterparts that exploit selective tuning (namely, Selective Dual-Report Cache Invalidation and Bit-Sequences with Bit Count). Our study shows that the two proposed schemes are not only effective in salvaging the cache content but consume significantly less energy than their counterparts. While the Selective Dual-Report Cache Invalidation scheme performs best in most cases, it is inferior to the Bit-Sequences with the Bit-Count scheme under high update rates.

...read moreread less

Patent•

System and method for partitioning address space in a proxy cache server cluster

[...]

Robert Drew Major, Stephen R. Carter, Howard Davis, Brent Ray Christensen

07 Jun 2001

TL;DR: In this article, a proxy partition cache (PPC) architecture and a technique for address-partitioning a proxy cache consisting of a grouping of discrete, cooperating caches (servers) is provided.

...read moreread less

Abstract: A proxy partition cache (PPC) architecture and a technique for address-partitioning a proxy cache consisting of a grouping of discrete, cooperating caches (servers) is provided. Client requests for objects (files) of a given size are redirected or reassigned to a single cache in the grouping, notwithstanding the cache to which the request is made by the load-balancing mechanism (such as a Layer 4 switch) based upon load-balancing considerations. The file is then returned to the switch via the switch-designated cache for vending to the requesting client. The redirection/reassignment occurs according to a function within the cache to which the request is directed so that the switch remains freed from additional tasks that can compromise speed.

...read moreread less

Patent•

Technique for enhancing effectiveness of cache server

[...]

Masayoshi Kobayashi¹•Institutions (1)

NEC¹

25 Jul 2001

TL;DR: In this article, a path calculating section obtains a path suitable for carrying out an automatic cache updating operation, a link prefetching operation, and a cache server cooperating operation, based on QoS path information that includes network path information and path load information obtained by a path information obtaining section.

...read moreread less

Abstract: A path calculating section obtains a path suitable for carrying out an automatic cache updating operation, a link prefetching operation, and a cache server cooperating operation, based on QoS path information that includes network path information and path load information obtained by a QoS path information obtaining section. An automatic cache updating section, a link prefetching control section, and a cache server cooperating section carry out respective ones of the automatic cache updating operation, the link prefetching operation, and the cache server cooperating operation, by utilizing the path obtained. For example, the path calculating section obtains a maximum remaining bandwidth path as the path.

...read moreread less

Patent•

Distributed read and write caching implementation for optimized input/output applications

[...]

Kenneth C. Creta¹, D. Bell¹, Robert George¹, Bradford Congdon¹, Robert G. Blankenship¹, Duane January¹ - Show less +2 more•Institutions (1)

Intel¹

27 Aug 2001

TL;DR: In this article, a cache directory is also provided to track cache lines in the write cache and the at least one read cache, which provides a low-latency copy of data that is most likely to be used.

...read moreread less

Abstract: A caching input/output hub includes a host interface to connect with a host. At least one input/output interface is provided to connect with an input/output device. A write cache manages memory writes initiated by the input/output device. At least one read cache, separate from the write cache, provides a low-latency copy of data that is most likely to be used. The at least one read cache is in communication with the write cache. A cache directory is also provided to track cache lines in the write cache and the at least one read cache. The cache directory is in communication with the write cache and the at least one read cache.

...read moreread less

Proceedings Article•DOI•

Reactive-Associative Caches

[...]

Brannon Batson, T. N. Vijaykumar

08 Sep 2001

TL;DR: The r-a cache is proposed, which provides flexible associativity by placing most blocks in direct-mapped positions and reactively displacing only conflicting blocks to set-associative positions, and using a novel PC-based way-prediction to achieve high accuracy.

...read moreread less

Abstract: While set-associative caches typically incur fewer misses than direct-mapped caches, set-associative caches have slower hit times. We propose the reactive-associative cache (r-a cache), which provides flexible associativity by placing most blocks in direct-mapped positions and reactively displacing only conflicting blocks to set-associative positions. The r-a cache uses way-prediction (like the predictive associative cache, PSA) to access displaced blocks on the initial probe. Unlike PSA, however, the r-a cache employs a novel feedback mechanism to prevent unpredictable blocks from being displaced. Reactive displacement and feedback allow the r-a cache to use a novel PC-based way-prediction and achieve high accuracy; without impractical block swapping as in column associative and group associative, and without relying on timing-constrained XOR way prediction. A one-port, 4-way r-a cache achieves up to 9% speedup over a direct-mapped cache and performs within 2% of an idealized 2-way set-associative, 1-cycle cache. A 4-way r-a cache achieves up to 13% speedup over a PSA cache, with both r-a and PSA using the PC scheme. CACTI estimates that for sizes larger than 8KB, a 4-way r-a cache is within 1% of direct-mapped hit times, and 24% faster than a 2-way set-associative cache.

...read moreread less

Patent•

Method and apparatus for web caching

[...]

Michael Zhu

22 May 2001

TL;DR: In this article, a method and apparatus for web caching is described, which can be implemented in hardware, software, or firmware, and can be used in either hardware or software.

...read moreread less

Abstract: A method and apparatus for web caching is disclosed. The method and apparatus may be implemented in hardware, software or firmware. Complementary cache management modules, a coherency module and a cache module(s) are installed complementary gateways for data and for clients respectively. The coherency management module monitors data access requests and or response and determines for each: the uniform resource locator (URL) of the requested web page, the URL of the requestor and a signature. The signature is computed using cryptographic techniques and in particular a hash function for which the input is the corresponding web page for which a signature is to be generated. The coherency management module caches these signatures and the corresponding URL and uses the signatures to determine when a page has been updated. When, on the basis of signature comparisons it is determined that a page has been updated the coherency management module sends a notification to all complementary cache modules. Each cache module caches web pages requested by the associated client(s) to which it is coupled. The notification from the cache management module results in the cache module(s) which are the recipient of a given notice updating their tag table with a stale bit for the associated web page. The cache module(s) use this information in the associated tag tables to determine which pages they need to update. The cache modules initiate this update during intervals of reduced activity in the servers, gateways, routers, or switches of which they are a part. All clients requesting data through the system of which each cache module is a part are provided by the associated cache module with cached copies of requested web pages.

...read moreread less

Proceedings Article•DOI•

Main-memory index structures with fixed-size partial keys

[...]

Philip L. Bohannon¹, Peter Mcllroy¹, Rajeev Rastogi¹•Institutions (1)

Alcatel-Lucent¹

01 May 2001

TL;DR: This paper proposes two index structures, pkT-trees and pkB-tree, which significantly reduce cache misses by storing partial-key information in the index, and shows that a small, fixed amount of key information allows most cache misses to be avoided, allowing for a simple node structure and efficient implementation.

...read moreread less

Abstract: The performance of main-memory index structures is increasingly determined by the number of CPU cache misses incurred when traversing the index. When keys are stored indirectly, as is standard in main-memory databases, the cost of key retrieval in terms of cache misses can dominate the cost of an index traversal. Yet it is inefficient in both time and space to store even moderate sized keys directly in index nodes. In this paper, we investigate the performance of tree structures suitable for OLTP workloads in the face of expensive cache misses and non-trivial key sizes. We propose two index structures, pkT-trees and pkB-trees, which significantly reduce cache misses by storing partial-key information in the index. We show that a small, fixed amount of key information allows most cache misses to be avoided, allowing for a simple node structure and efficient implementation. Finally, we study the performance and cache behavior of partial-key trees by comparing them with other main-memory tree structures for a wide variety of key sizes and key value distributions.

...read moreread less

Proceedings Article•DOI•

Direct addressed caches for reduced power consumption

[...]

Emmett Witchel¹, Samuel Larsen¹, C.S. Ananian¹, Krste Asanovic¹•Institutions (1)

Massachusetts Institute of Technology¹

01 Dec 2001

TL;DR: Support for tag-unchecked loads and stores to C and Java compilers that save the energy of a tag check when the compiler can guarantee an access will be to the same line as an earlier access is added.

...read moreread less

Abstract: A direct addressed cache is a hardware-software design for an energy-efficient microprocessor data cache. Direct addressing allows software to access cache data without a hardware cache tag check. These tag-unchecked loads and stores save the energy of a tag check when the compiler can guarantee an access will be to the same line as an earlier access. We have added support for tag-unchecked loads and stores to C and Java compilers. For Mediabench C programs, the compiler eliminates 16-76% of data cache tag accesses, with half of the benchmarks avoiding over 40% of the data tag checks. For SPECjvm98 Java programs, the compiler eliminates 18-63% of data cache tag checks. These tag check reductions translate into data cache energy savings of 9-40%, and overall processor and cache energy savings of 2-8%.

...read moreread less

Journal Article•DOI•

A reconfigurable multifunction computing cache architecture

[...]

Huesung Kim¹, A.K. Somani, Akhilesh Tyagi•Institutions (1)

Iowa State University¹

01 Aug 2001-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: This paper presents a cache architecture to convert a cache into a computing unit for either of the following two structured computations: finite impulse response and discrete/inverse discrete cosine transform, and includes additional logic to embed multibit output lookup tables into the cache structure.

...read moreread less

Abstract: A considerable portion of a microprocessor chip is dedicated to cache memory. However, not all applications need all the cache storage all the time, especially the computing bandwidth-limited applications. In addition, some applications have large embedded computations with a regular structure. Such applications may be able to use additional computing resources. If the unused portion of the cache could serve these computation needs, the on-chip resources would be utilized more efficiently. This presents an opportunity to explore the reconfiguration of a part of the cache memory for computing. Thus, we propose adaptive balanced computing (ABC)-dynamic resource configuration on demand from application-between memory and computing resources. In this paper, we present a cache architecture to convert a cache into a computing unit for either of the following two structured computations: finite impulse response and discrete/inverse discrete cosine transform. In order to convert a cache memory to a function unit, we include additional logic to embed multibit output lookup tables into the cache structure. The experimental results show that the reconfigurable module improves the execution time of applications with a large number of data elements by a factor as high as 50 and 60.

...read moreread less

Patent•

Software controlled cache configuration based on average miss rate

[...]

Gerard Chauvel, Dominique D'Inverno, Serge Lasserre

17 Aug 2001

TL;DR: In this article, the cache is reconfigured in response to an operation command (1314 ), such that each tag in the array of tags that contains a specified qualifier value is modified in accordance with the operation command.

...read moreread less

Abstract: A digital system is provided with a several processors ( 1302 ), a shared level two (L2) cache ( 1300 ) having several segments per entry with associated tags, and a level three (L3) physical memory. Each tag entry includes a task-ID qualifier field and a resource ID qualifier field. Data is loaded into various lines in the cache in response to cache access requests when a given cache access request misses. After loading data into the cache in response to a miss, a tag associated with the data line is set to a valid state. In addition to setting a tag to a valid state, qualifier values are stored in qualifier fields in the tag. Each qualifier value specifies a usage characteristic of data stored in an associated data line of the cache, such as a task ID. A miss counter ( 532 ) counts each miss and a monitoring task ( 1311 ) determines a miss rate for memory requests. If a selected miss rate threshold value is exceeded, the digital system is reconfigured in order to reduce the miss rate. The cache is reconfigured in response to an operation command ( 1314 ), such that each tag in the array of tags that contains a specified qualifier value is modified in accordance with the operation command. Other types of reconfiguration can be performed, such as remapping a selected program portion to operate in a different address range, locking a portion of the data entries within the cache, or defining addresses corresponding to a selected program task as uncacheable, for example.

...read moreread less

Patent•

System and method for coordinated hierarchical caching and cache replacement

[...]

James R. H. Challenger¹, Dantzig Paul Michael¹, Daniel M. Dias¹, Arun Iyengar¹, Eric Levy-Abegnoli¹ - Show less +1 more•Institutions (1)

IBM¹

06 Nov 2001

TL;DR: In this paper, the authors present a hierarchy of level-1 caches and level-2 caches for hierarchically caching objects, based on a set of one-or more criteria.

...read moreread less

Abstract: A system and method for hierarchically caching objects includes one or more level 1 nodes, each including at least one level 1 cache; one or more level 2 nodes within which the objects are permanently stored or generated upon request, each level 2 node coupled to at least one of the one or more level 1 nodes and including one or more level 2 caches; and means for storing, in a coordinated manner, one or more objects in at least one level 1 cache and/or at least one level 2 cache, based on a set of one or more criteria. Furthermore, in a system adapted to receive requests for objects from one or more clients, the system having a set of one or more level 1 nodes, each containing at least one level 1 cache, a method for managing a level 1 cache includes the steps of applying, for part of the at least one level 1 cache, a cache replacement policy designed to minimize utilization of a set of one or more resources in the system; and using, for other parts of the at least one level 1 cache, one or more other cache replacement policies designed to minimize utilization of one or more other sets of one or more resources in the system.

...read moreread less

Proceedings Article•DOI•

L1 data cache decomposition for energy efficiency

[...]

Michael C. Huang¹, Jose Renau¹, Seung-Moon Yoo¹, Josep Torrellas¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

06 Aug 2001

TL;DR: In this paper, a new L1 data cache structure that combines a Specialized Stack Cache (SSC) and a Pseudo Set-Associative Cache (PSAC) is proposed.

...read moreread less

Abstract: The L1 data cache is a time-critical module and, at the same time a major consumer of energy To reduce its energy-delay product, we apply two principles of low-power design: specialize part of the cache structure and break the cache down into smaller caches To this end, we propose a new L1 data cache structure that combines a Specialized Stack Cache (SSC) and a Pseudo Set-Associative Cache (PSAC) Individually, our SSC and PSAC designs have a lower energy-delay product than previously-proposed related designs In addition, their combined operation is very effective Relative to a conventional 2-way 32 KB data cache, a design containing a 4-way 32 KB PSAC and a 512 B SSC reduces the energy-delay product of several applications by an average of 44%

...read moreread less

Proceedings Article•DOI•

Cache performance for multimedia applications

[...]

Nathan T. Slingerland¹, Alan Jay Smith²•Institutions (2)

Apple Inc.¹, University of California, Berkeley²

17 Jun 2001

TL;DR: This analysis examines the differences between multimedia and traditional applications in cache behavior and finds that multimedia applications actually exhibit lower instruction miss ratios and comparable data miss ratios when contrasted with other widely studied workloads.

...read moreread less

Abstract: The caching behavior of multimedia applications has been described as having high instruction reference locality within small loops, very large working sets, and poor data cache performance due to non-locality of data references. Despite this, there is no published research deriving or measuring these qualities. Utilizing the previously developed Berkeley Multimedia Workload, we present the results of execution driven cache simulations with the goal of aiding future media processing architecture design. Our analysis examines the differences between multimedia and traditional applications in cache behavior. We find that multimedia applications actually exhibit lower instruction miss ratios and comparable data miss ratios when contrasted with other widely studied workloads. In addition, we find that longer data cache line sizes than are currently used would benefit multimedia processing.

...read moreread less

Patent•

System-on-a-chip with soft cache and systems and methods using the same

[...]

Gregory Allen North

30 Mar 2001

TL;DR: In this paper, a soft cache system compares tag bits of a virtual address with tag fields of a plurality of soft cache register entries, each entry associated with an index to a corresponding cache line in virtual memory.

...read moreread less

Abstract: A soft cache system compares tag bits of a virtual address with tag fields of a plurality of soft cache register entries, each entry associated with an index to a corresponding cache line in virtual memory. A cache line size for the cache line is programmable. When the tag bits of the virtual address match the tag field of one of the soft cache entries, the index from that entry is selected for generating a physical address. The physical address is generated using the selected index as an offset to a corresponding soft cache space in memory.

...read moreread less

Proceedings Article•DOI•

Using proxy cache relocation to accelerate Web browsing in wireless/mobile communications

[...]

Stathes Hadjiefthymiades¹, Lazaros Merakos¹•Institutions (1)

National and Kapodistrian University of Athens¹

01 Apr 2001

TL;DR: A cache management scheme is proposed, which involves the relocation of full caches to the most probable cells but also percentages of the caches to less likelyNeighborhoods, which demonstrates substantial benefits for the end user.

...read moreread less

Abstract: Mobile computing is considered of major importance to the computing industry for the forthcoming years due to the progress in the wireless communications area. A proxy-based architecture for accelerating Web browsing in cellular customer premises networks (CPN) is presented. Proxy caches, maintained in base stations, are constantly relocated to accompany the roaming user. A cache management scheme is proposed, which involves the relocation of full caches to the most probable cells but also percentages of the caches to less likely neighbors. Relocation is performed according to a movement prediction algorithm based on a learning automaton. The simulation of the scheme demonstrates substantial benefits for the end user.

...read moreread less

Patent•

Method and apparatus for software management of on-chip cache

[...]

Yu-Chung C. Liao, Peter A. Sandon, Howard Cheng, Peter Hsu

01 Aug 2001

TL;DR: In this paper, a microprocessor including a control unit and a cache connected with the control unit for storing data to be used by the control, wherein the cache is selectively configurable as either a single cache or as a partitioned cache having a locked cache portion and a normal cache portion.

...read moreread less

Abstract: A microprocessor including a control unit and a cache connected with the control unit for storing data to be used by the control, wherein the cache is selectively configurable as either a single cache or as a partitioned cache having a locked cache portion and a normal cache portion. The normal cache portion is controlled by a hardware implemented automatic replacement process. The locked cache portion is locked so that the automatic replacement process cannot modify the contents of the locked cache. An instruction is provided in the instruction set that enables software to selectively allocate lines in the locked cache portion to correspond to locations in an external memory, thereby enabling the locked cache portion to be completely managed by software.

...read moreread less

Collapse