Showing papers on "Cache coloring published in 2002"

PDF

Open Access

Proceedings Article•DOI•

An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches

[...]

Changkyu Kim¹, Doug Burger¹, Stephen W. Keckler¹•Institutions (1)

01 Oct 2002

TL;DR: This paper proposes physical designs for these Non-Uniform Cache Architectures (NUCAs) and extends these physical designs with logical policies that allow important data to migrate toward the processor within the same level of the cache.

...read moreread less

Abstract: Growing wire delays will force substantive changes in the designs of large caches. Traditional cache architectures assume that each level in the cache hierarchy has a single, uniform access time. Increases in on-chip communication delays will make the hit time of large on-chip caches a function of a line's physical location within the cache. Consequently, cache access times will become a continuum of latencies rather than a single discrete latency. This non-uniformity can be exploited to provide faster access to cache lines in the portions of the cache that reside closer to the processor. In this paper, we evaluate a series of cache designs that provides fast hits to multi-megabyte cache memories. We first propose physical designs for these Non-Uniform Cache Architectures (NUCAs). We extend these physical designs with logical policies that allow important data to migrate toward the processor within the same level of the cache. We show that, for multi-megabyte level-two caches, an adaptive, dynamic NUCA design achieves 1.5 times the IPC of a Uniform Cache Architecture of any size, outperforms the best static NUCA scheme by 11%, outperforms the best three-level hierarchy--while using less silicon area--by 13%, and comes within 13% of an ideal minimal hit latency solution.

...read moreread less

799 citations

Proceedings Article•DOI•

Scratchpad memory: a design alternative for cache on-chip memory in embedded systems

[...]

Rajeshwari M. Banakar¹, Stefan Steinke¹, Bo-Sik Lee¹, Mahesh Balakrishnan¹, Peter Marwedel¹ - Show less +1 more•Institutions (1)

Indian Institute of Technology Delhi¹

06 May 2002

TL;DR: The results clearly establish scratch pad memory as a low power alternative in most situations with an average energy reduction of 40% and the average area-time reduction for the scratchpad memory was 46% of the cache memory.

...read moreread less

Abstract: In this paper we address the problem of on-chip memory selection for computationally intensive applications, by proposing scratch pad memory as an alternative to cache. Area and energy for different scratch pad and cache sizes are computed using the CACTI tool while performance was evaluated using the trace results of the simulator. The target processor chosen for evaluation was AT91M40400. The results clearly establish scratehpad memory as a low power alternative in most situations with an average energy reducation of 40%. Further the average area-time reduction for the seratchpad memory was 46% of the cache memory.

...read moreread less

751 citations

Journal Article•DOI•

Hierarchical Web caching systems: modeling, design and experimental results

[...]

Hao Che, Ye Tung¹, Zhijun Wang²•Institutions (2)

University of South Alabama¹, University of Texas at Arlington²

07 Nov 2002-IEEE Journal on Selected Areas in Communications

TL;DR: Both model-based and real trace simulation studies show that the proposed cooperative architecture results in more than 50% memory saving and substantial central processing unit (CPU) power saving for the management and update of cache entries compared with the traditional uncooperative hierarchical caching architecture.

...read moreread less

Abstract: This paper aims at finding fundamental design principles for hierarchical Web caching. An analytical modeling technique is developed to characterize an uncooperative two-level hierarchical caching system where the least recently used (LRU) algorithm is locally run at each cache. With this modeling technique, we are able to identify a characteristic time for each cache, which plays a fundamental role in understanding the caching processes. In particular, a cache can be viewed roughly as a low-pass filter with its cutoff frequency equal to the inverse of the characteristic time. Documents with access frequencies lower than this cutoff frequency have good chances to pass through the cache without cache hits. This viewpoint enables us to take any branch of the cache tree as a tandem of low-pass filters at different cutoff frequencies, which further results in the finding of two fundamental design principles. Finally, to demonstrate how to use the principles to guide the caching algorithm design, we propose a cooperative hierarchical Web caching architecture based on these principles. Both model-based and real trace simulation studies show that the proposed cooperative architecture results in more than 50% memory saving and substantial central processing unit (CPU) power saving for the management and update of cache entries compared with the traditional uncooperative hierarchical caching architecture.

...read moreread less

512 citations

Proceedings Article•DOI•

A new memory monitoring scheme for memory-aware scheduling and partitioning

[...]

G.E. Suh¹, Srinivas Devadas¹, Larry Rudolph¹•Institutions (1)

Massachusetts Institute of Technology¹

02 Feb 2002

TL;DR: A scheme that enables an accurate estimate of the isolated miss-rates of each process as a function of cache size under the standard LRU replacement policy is described, which can be used to schedule jobs or to partition the cache to minimize the overall miss-rate.

...read moreread less

Abstract: We propose a low overhead, online memory monitoring scheme utilizing a set of novel hardware counters. The counters indicate the marginal gain in cache hits as the size of the cache is increased, which gives the cache miss-rate as a function of cache size. Using the counters, we describe a scheme that enables an accurate estimate of the isolated miss-rates of each process as a function of cache size under the standard LRU replacement policy. This information can be used to schedule jobs or to partition the cache to minimize the overall miss-rate. The data collected by the monitors can also be used by an analytical model of cache and memory behavior to produce a more accurate overall miss-rate for the collection of processes sharing a cache in both time and space. This overall miss-rate can be used to improve scheduling and partitioning schemes.

...read moreread less

325 citations

Report•DOI•

My Cache or Yours? Making Storage More Exclusive

[...]

Theodore M. Wong¹, John Wilkes²•Institutions (2)

Carnegie Mellon University¹, Hewlett-Packard²

10 Jun 2002

TL;DR: In this article, the authors explore the benefits of a simple scheme to achieve exclusive caching, in which a data block is cached at either a client or the disk array, but not both.

...read moreread less

Abstract: Modern high-end disk arrays often have several gigabytes of cache RAM. Unfortunately, most array caches use management policies which duplicate the same data blocks at both the client and array levels of the cache hierarchy: they are inclusive. Thus, the aggregate cache behaves as if it was only as big as the larger of the client and array caches, instead of as large as the sum of the two. Inclusiveness is wasteful: cache RAM is expensive. We explore the benefits of a simple scheme to achieve exclusive caching, in which a data block is cached at either a client or the disk array, but not both. Exclusiveness helps to create the effect of a single, large unified cache. We introduce a DEMOTE operation to transfer data ejected from the client to the array, and explore its effectiveness with simulation studies. We quantify the benefits and overheads of demotions across both synthetic and real-life workloads. The results show that we can obtain useful—sometimes substantial—speedups. During our investigation, we also developed some new cache-insertion algorithms that show promise for multiclient systems, and report on some of their properties.

...read moreread less

285 citations

Posted Content•

Theoretical Use of Cache Memory as a Cryptanalytic Side-Channel.

[...]

Daniel Page¹•Institutions (1)

University of Bristol¹

01 Jan 2002-IACR Cryptology ePrint Archive

TL;DR: In this article, the idea of cache memory being used as a side-channel which leaks information during the run of a cryptographic algorithm has been investigated, and it has been shown that an attacker may be able to reveal or narrow the possible values of secret information held on the target device.

...read moreread less

Abstract: We expand on the idea, proposed by Kelsey et al [?], of cache memory being used as a side-channel which leaks information during the run of a cryptographic algorithm By using this side-channel, an attacker may be able to reveal or narrow the possible values of secret information held on the target device We describe an attack which encrypts 2 chosen plaintexts on the target processor in order to collect cache profiles and then performs around 2 computational steps to recover the key As well as describing and simulating the theoretical attack, we discuss how hardware and algorithmic alterations can be used to defend against such techniques

...read moreread less

260 citations

Journal Article•DOI•

Timekeeping in the memory system: predicting and optimizing memory behavior

[...]

Zhigang Hu¹, Stefanos Kaxiras², Margaret Martonosi¹•Institutions (2)

Princeton University¹, Agere Systems²

01 May 2002

TL;DR: The extent to which detailed timing characteristics of past memory reference events are strongly predictive of future program reference behavior is shown, and a family of time-keeping techniques that optimize behavior based on observations about particular cache time durations, such as the cache access interval or the cache dead time are proposed.

...read moreread less

Abstract: Techniques for analyzing and improving memory referencing behavior continue to be important for achieving good overall program performance due to the ever-increasing performance gap between processors and main memory. This paper offers a fresh perspective on the problem of predicting and optimizing memory behavior. Namely, we show quantitatively the extent to which detailed timing characteristics of past memory reference events are strongly predictive of future program reference behavior. We propose a family of time-keeping techniques that optimize behavior based on observations about particular cache time durations, such as the cache access interval or the cache dead time. Timekeeping techniques can be used to build small simple, and high-accuracy (often 90% or more) predictors for identifying conflict misses, for predicting dead blocks, and even for estimating the time at which the next reference to a cache frame will occur and the address that will be accessed. Based on these predictors, we demonstrate two new and complementary time-based hardware structures: (1) a time-based victim cache that improves performance by only storing conflict miss lines with likely reuse, and (2) a time-based prefetching technique that hones in on the right address to prefetch, and the right time to schedule the prefetch. Our victim cache technique improves performance over previous proposals by better selections of what to place in the victim cache. Our prefetching technique outperforms similar prior hardware prefetching proposals, despite being orders of magnitude smaller. Overall, these techniques improve performance by more than 11% across the SPEC2000 benchmark suite.

...read moreread less

157 citations

Proceedings Article•DOI•

Optimal proxy cache allocation for efficient streaming media distribution

[...]

Bing Wang, Subhabrata Sen¹, Micah Adler², Don Towsley²•Institutions (2)

AT&T¹, University of Massachusetts Amherst²

07 Nov 2002

TL;DR: The problem of efficiently streaming a set of heterogeneous videos from a remote server through a proxy to multiple asynchronous clients so that they can experience playback with low startup delays is addressed.

...read moreread less

Abstract: In this paper, we address the problem of efficiently streaming a set of heterogeneous videos from a remote server through a proxy to multiple asynchronous clients so that they can experience playback with low startup delays. We develop a technique to analytically determine the optimal proxy prefix cache allocation to the videos that minimizes the aggregate network bandwidth cost. We integrate proxy caching with traditional server-based reactive transmission schemes such as batching, patching and stream merging to develop a set of proxy-assisted delivery schemes. We quantitatively explore the impact of the choice of transmission scheme, cache allocation policy, proxy cache size, and availability of unicast versus multicast capability, on the resultant transmission cost.. Our evaluations show that even a relatively small prefix cache (10%-20% of the video repository) is sufficient to realize substantial savings in transmission cost. We find that carefully designed proxy-assisted reactive transmission schemes can produce significant cost savings even in predominantly unicast environments such as the Internet.

...read moreread less

156 citations

Proceedings Article•DOI•

Exploiting choice in resizable cache design to optimize deep-submicron processor energy-delay

[...]

Se-Hyun Yang¹, Michael D. Powell¹, Babak Falsafi², T. N. Vijaykumar²•Institutions (2)

Carnegie Mellon University¹, Purdue University²

02 Feb 2002

TL;DR: A hybrid selective-sets-and-ways cache organization is proposed that always offers equal or better resizing granularity than both of previously proposed organizations, and the energy savings from resizing d-cache and i-cache together are investigated.

...read moreread less

Abstract: Cache memories account for a significant fraction of a chip's overall energy dissipation. Recent research advocates using "resizable" caches to exploit cache requirement variability in applications to reduce cache size and eliminate energy dissipation in the cache's unused sections with minimal impact on performance. Current proposals for resizable caches fundamentally vary in two design aspects: (1) cache organization, where one organization, referred to as selective-ways, varies the cache's set-associativity, while the other, referred to as selective-sets, varies the number of cache sets, and (2) resizing strategy, where one proposal statically sets the cache size prior to an application's execution, while the other allows for dynamic resizing both within and across applications. In this paper, we compare and contrast, for the first time, the proposed design choices for resizable caches, and evaluate the effectiveness of cache resizings in reducing the overall energy-delay in deep-submicron processors. In addition, we propose a hybrid selective-sets-and-ways cache organization that always offers equal or better resizing granularity than both of previously proposed organizations. We also investigate the energy savings from resizing d-cache and i-cache together to characterize the interaction between d-cache and i-cache resizings.

...read moreread less

150 citations

Patent•

Log-structured write cache for data storage devices and systems

[...]

Steven Robert Hetzler, Daniel F. Smith

27 Dec 2002

TL;DR: In this article, a log-structured write cache for a data storage system and method for improving the performance of the storage system is described, where cache lines where write data is temporarily accumulated in a nonvolatile state so that it can be sequentially written to the target storage locations at a later time.

...read moreread less

Abstract: A log-structured write cache for a data storage system and method for improving the performance of the storage system are described. The system might be a RAID storage array, a disk drive, an optical disk, or a tape storage system. The write cache is preferably implemented in the main storage medium of the system, but can also be provided in other storage components of the system. The write cache includes cache lines where write data is temporarily accumulated in a non-volatile state so that it can be sequentially written to the target storage locations at a later time, thereby improving the overall performance of the system. Meta-data for each cache line is also maintained in the write cache. The meta-data includes the target sector address for each sector in the line and a sequence number that indicates the order in which data is posted to the cache lines. A buffer table entry is provided for each cache line. A hash table is used to search the buffer table for a sector address that is needed at each data read and write operation.

...read moreread less

140 citations

Journal Article•DOI•

Proactive power-aware cache management for mobile computing systems

[...]

Guohong Cao¹•Institutions (1)

Pennsylvania State University¹

01 Jun 2002-IEEE Transactions on Computers

TL;DR: This paper proposes a proactive cache management scheme that not only improves the cache hit ratio, the throughput, and the bandwidth utilization, but also reduces the query delay and the power consumption.

...read moreread less

Abstract: Recent work has shown that invalidation report (IR)-based cache management is an attractive approach for mobile environments. However, the IR-based cache invalidation solution has some limitations, such as long query delay, low bandwidth utilization, and it is not suitable for applications where data change frequently. In this paper, we propose a proactive cache management scheme to address these issues. Instead of passively waiting, the clients intelligently prefetch the data that are most likely used in the future. Based on a novel prefetch-access ratio concept, the proposed scheme can dynamically optimize performance or power based on the available resources and the performance requirements. To deal with frequently updated data, different techniques (indexing and caching) are applied to handle different components of the data based on their update frequency. Detailed simulation experiments are carried out to evaluate the proposed methodology. Compared to previous schemes, our solution not only improves the cache hit ratio, the throughput, and the bandwidth utilization, but also reduces the query delay and the power consumption.

...read moreread less

Patent•

Disk drive maintaining a cache link attribute for each of a plurality of allocation states

[...]

Gregory B. Thelin¹, Ming Y. Wang¹•Institutions (1)

Western Digital¹

30 Sep 2002

TL;DR: In this paper, a disk drive is disclosed comprising a cache buffer for caching data written to the disk and data read from the disk, the cache buffer comprising a plurality of cache segments linked together to form a plurality cache links.

...read moreread less

Abstract: A disk drive is disclosed comprising a cache buffer for caching data written to the disk and data read from the disk, the cache buffer comprising a plurality of cache segments linked together to form a plurality of cache links. At least one segment attribute is associated with each cache segment, including an allocation state, and at least one link attribute is associated with the segment attributes within each cache link. When a host command is received from a host computer, the link attributes are evaluated to allocate cache segments for a cache link associated with the host command.

...read moreread less

Patent•

Centralized bounded domain caching control system for network edge servers

[...]

Stephen McHenry, David Veach, Paul Czarnik, Carl Schroeder, David Zink, Dan Koren, Neal Caldecott, Shari Trumbo-McHenry - Show less +4 more

06 Aug 2002

TL;DR: In this article, the cache content storage and replacement policies for a distributed plurality of network edge caches are centrally determined by a content selection server that executes a first process over a bounded content domain against a predefined set of domain content identifiers.

...read moreread less

Abstract: A network edge cache management system centrally determines cache content storage and replacement policies for a distributed plurality of network edge caches. The management system includes a content selection server that executes a first process over a bounded content domain against a predefined set of domain content identifiers to produce a meta-content description of the bounded content domain, a second process against the meta-content description to define a plurality of content groups representing respective content sub-sets of the bounded content domain, a third process to associate respective sets of predetermined cache management attributes with the plurality of content groups, and a fourth process to generate a plurality of cache control rule bases selectively storing identifications of the plurality of content groups and corresponding associated sets of the predetermined cache management attributes. The cache control rule bases are distributed to the plurality of network edge cache servers.

...read moreread less

Journal Article•DOI•

Data page layouts for relational databases on deep memory hierarchies

[...]

Anastassia Ailamaki¹, David J. DeWitt², Mark D. Hill²•Institutions (2)

Carnegie Mellon University¹, University of Wisconsin-Madison²

11 Nov 2002

TL;DR: This paper proposes a new data organization model called PAX (Partition Attributes Across), that significantly improves cache performance by grouping together all values of each attribute within each page, and shows that PAX performs well across different memory system designs.

...read moreread less

Abstract: Relational database systems have traditionally optimized for I/O performance and organized records sequentially on disk pages using the N-ary Storage Model (NSM) (a.k.a., slotted pages). Recent research, however, indicates that cache utilization and performance is becoming increasingly important on modern platforms. In this paper, we first demonstrate that in-page data placement is the key to high cache performance and that NSM exhibits low cache utilization on modern platforms. Next, we propose a new data organization model called PAX (Partition Attributes Across), that significantly improves cache performance by grouping together all values of each attribute within each page. Because PAX only affects layout inside the pages, it incurs no storage penalty and does not affect I/O behavior. According to our experimental results (which were obtained without using any indices on the participating relations), when compared to NSM: (a) PAX exhibits superior cache and memory bandwidth utilization, saving at least 75% of NSM's stall time due to data cache accesses; (b) range selection queries and updates on memory-resident relations execute 1725% faster; and (c) TPC-H queries involving I/O execute 1148% faster. Finally, we show that PAX performs well across different memory system designs.

...read moreread less

Proceedings Article•DOI•

Fractal prefetching B+-Trees: optimizing both cache and disk performance

[...]

Shimin Chen¹, Phillip B. Gibbons², Todd C. Mowry¹, Gary Valentin³•Institutions (3)

Carnegie Mellon University¹, Bell Labs², IBM³

03 Jun 2002

TL;DR: Fractal prefetching B+-Trees (fpB+Trees) as discussed by the authors embeds cache-optimized trees within disk optimized trees, in order to optimize both cache and I/O performance.

...read moreread less

Abstract: B+-Trees have been traditionally optimized for I/O performance with disk pages as tree nodes. Recently, researchers have proposed new types of B+-Trees optimized for CPU cache performance in main memory environments, where the tree node sizes are one or a few cache lines. Unfortunately, due primarily to this large discrepancy in optimal node sizes, existing disk-optimized B+-Trees suffer from poor cache performance while cache-optimized B+-Trees exhibit poor disk performance. In this paper, we propose fractal prefetching B+-Trees (fpB+-Trees), which embed "cache-optimized" trees within "disk-optimized" trees, in order to optimize both cache and I/O performance. We design and evaluate two approaches to breaking disk pages into cache-optimized nodes: disk-first and cache-first. These approaches are somewhat biased in favor of maximizing disk and cache performance, respectively, as demonstrated by our results. Both implementations of fpB+-Trees achieve dramatically better cache performance than disk-optimized B+-Trees: a factor of 1.1-1.8 improvement for search, up to a factor of 4.2 improvement for range scans, and up to a 20-fold improvement for updates, all without significant degradation of I/O performance. In addition, fpB+-Trees accelerate I/O performance for range scans by using jump-pointer arrays to prefetch leaf pages, thereby achieving a speed-up of 2.5-5 on IBM's DB2 Universal Database.

...read moreread less

Proceedings Article•DOI•

Pointer cache assisted prefetching

[...]

Jamison D. Collins¹, Suleyman Sair¹, Brad Calder¹, Dean M. Tullsen¹•Institutions (1)

University of California, San Diego¹

18 Nov 2002

TL;DR: This paper proposes the use of a pointer cache, which tracks pointer transitions, to aid prefetching, and examines using the pointer cache in a wide issue superscalar processor as a value predictor and to aidPrefetching when a chain of pointers is being traversed.

...read moreread less

Abstract: Data prefetching effectively reduces the negative effects of long load latencies on the performance of modern processors. Hardware prefetchers employ hardware structures to predict future memory addresses based on previous patterns. Thread-based prefetchers use portions of the actual program code to determine future load addresses for prefetching. This paper proposes the use of a pointer cache, which tracks pointer transitions, to aid prefetching. The pointer cache provides, for a given pointer's effective address, the base address of the object pointed to by the pointer. We examine using the pointer cache in a wide issue superscalar processor as a value predictor and to aid prefetching when a chain of pointers is being traversed. When a load misses in the L1 cache, but hits in the pointer cache, the first two cache blocks of the pointed to object are prefetched. In addition, the load's dependencies are broken by using the pointer cache hit as a value prediction. We also examine using the pointer cache to allow speculative precomputation to run farther ahead of the main thread of execution than in prior studies. Previously proposed thread-based prefetchers are limited in how far they can run ahead of the main thread when traversing a chain of recurrent dependent loads. When combined with the pointer cache, a speculative thread can make better progress ahead of the main thread, rapidly traversing data structures in the face of cache misses caused by pointer transitions.

...read moreread less

Patent•

Method and system for automatically updating content stored on servers connected by a network

[...]

Thomas E. Kee, Ryan C. Kearny, Donald Joseph DeCaprio, Christian D. Saether

25 Jan 2002

TL;DR: In this article, a system and computer implementable method for updating content on servers coupled to a network is described, which includes updating an origin server with a version of files used to provide content, retrieving data that indicates an action to be performed on one or more cache servers in conjunction with updating the origin server, and performing the action to update entries in the one/more cache servers.

...read moreread less

Abstract: A system and computer implementable method for updating content on servers coupled to a network. The method includes updating an origin server with a version of files used to provide content, retrieving data that indicates an action to be performed on one or more cache servers in conjunction with updating the origin server, and performing the action to update entries in the one or more cache servers. Each entry in each cache server is associated with a subset of the content on the origin server and may include an expiration field and/or a time to live field. An example of a subset of content to which a cache entry may be associated is a Web page. Cache servers are not required to poll origin servers to determine whether new content is available. Cache servers may be pre-populated using push or pull techniques.

...read moreread less

Patent•

Sharing a second tier cache memory in a multi-processor

[...]

Fred Gruner, David Hass, Ramesh Panwar, Nazar A. Zaidi

25 Mar 2002

TL;DR: In this article, the second tier cache memory includes a data ring interface and a snoop ring interface, which are coupled to the first-tier cache memory in a set of caches.

...read moreread less

Abstract: A set of cache memory includes a set of first tier cache memory and a second tier cache memory. In the set of first tier cache memory each first tier cache memory is coupled to a compute engine in a set of compute engines. The second tier cache memory is coupled to each first tier cache memory in the set of first tier cache memory. The second tier cache memory includes a data ring interface and a snoop ring interface.

...read moreread less

Patent•

Method and apparatus for shared cache coherency for a chip multiprocessor or multiprocessor system

[...]

Vladimir Pentkovski, Vivek Garg, Narayanan S. Iyer, Jagannath Keshava

23 Aug 2002

TL;DR: In this paper, the authors present a method and apparatus for shared cache coherency for a chip multiprocessor or a multi-core system. But they do not specify the cache lines themselves.

...read moreread less

Abstract: A method and apparatus for shared cache coherency for a chip multiprocessor or a multiprocessor system. In one embodiment, a multicore processor includes a plurality of processor cores, each having a private cache, and a shared cache. An internal snoop bus is coupled to each private cache and the shared cache to communicate data from each private cache to other private caches and the shared cache. In another embodiment, an apparatus includes a plurality of processor cores and a plurality of caches. One of the plurality of caches maintains cache lines in two different modified states. The first modified state indicates a most recent copy of a modified cache line, and the second modified state indicates a stale copy of the modified cache line.

...read moreread less

Patent•

Cache management system and method

[...]

David D'Orto, Neil Kenig, Peter Petersen, Gregory Pavlik

03 Dec 2002

TL;DR: In this article, a cache management system comprises a cache adapted store data corresponding to a data source and a cache manager adapted to access a set of rules to determine a frequency for automatically updating the data in the cache.

...read moreread less

Abstract: A cache management system comprises a cache adapted store data corresponding to a data source. The cache management system also comprises a cache manager adapted to access a set of rules to determine a frequency for automatically updating the data in the cache. The cache manager is also adapted to automatically communicate with the data source to update the data in the cache corresponding to the determined frequency.

...read moreread less

Proceedings Article•DOI•

The hardness of cache conscious data placement

[...]

Erez Petrank¹, Dror Rawitz¹•Institutions (1)

Technion – Israel Institute of Technology¹

01 Jan 2002

TL;DR: This work investigates the complexity of finding the optimal placement of objects (or code) in the memory, in the sense that this placement reduces the cache misses to the minimum, and shows that this problem is one of the toughest amongst the interesting algorithmic problems in computer science.

...read moreread less

Abstract: The growing gap between the speed of memory access and cache access has made cache misses an influential factor in program efficiency. Much effort has been spent recently on reducing the number of cache misses during program run. This effort includes wise rearranging of program code, cache-conscious data placement, and algorithmic modifications that improve the program cache behavior. In this work we investigate the complexity of finding the optimal placement of objects (or code) in the memory, in the sense that this placement reduces the cache misses to the minimum. We show that this problem is one of the toughest amongst the interesting algorithmic problems in computer science. In particular, suppose one is given a sequence of memory accesses and one has to place the data in the memory so as to minimize the number of cache misses for this sequence. We show that if P ≠ NP, then one cannot efficiently approximate the optimal solution even up to a very liberal approximation ratio. Thus, this problem joins the small family of extremely inapproximable optimization problems. The other two famous members in this family are minimum coloring and maximum clique.

...read moreread less

Patent•

Systems and methods for efficient memory allocation for streaming of multimedia files

[...]

Thomas Pinckney, Garry Kenneth Kessler, Christopher Provenzano, Benjamin Thomas

19 Apr 2002

TL;DR: In this article, a streaming delivery accelerator (SDA) receives content from a content provider, caches at least part of the content, forming a cache file, and streams the cache file to a user.

...read moreread less

Abstract: Systems and methods for streaming of multimedia files over a network are described. A streaming delivery accelerator (SDA) receives content from a content provider, caches at least part of the content, forming a cache file, and streams the cache file to a user. The described systems and methods are directed to separate (shred) the content into contiguous cache files suitable for streaming. The shredded cache files may have different transmission bit rates and/or different content, such as audio, text, etc. Checksums can migrate from the content file to the shredded cache files and between different network protocols without the need for recomputing the checksums.

...read moreread less

Proceedings Article•DOI•

Energy efficient Frequent Value data Cache design

[...]

Jun Yang¹, Rajiv Gupta²•Institutions (2)

University of California, Riverside¹, University of Arizona²

18 Nov 2002

TL;DR: This paper proposes the design of the Frequent Value Cache (FVC), a cache in which storing a frequent value requires few bits as they are stored in encoded form while all other values are storage in unencoded form using 32 bits.

...read moreread less

Abstract: Recent work has shown that a small number of distinct frequently occurring values often account for a large portion of memory accesses. In this paper we demonstrate how this frequent value phenomenon can be exploited in designing a cache that trades off performance with energy efficiency. We propose the design of the Frequent Value Cache (FVC) in which storing a frequent value requires few bits as they are stored in encoded form while all other values are stored in unencoded form using 32 bits. The data array is partitioned into two arrays such that if a frequent value is accessed only the first data array is accessed; otherwise an additional cycle is needed to access the second data array. Experiments with some of the SPEC95 benchmarks show that on an average a 64 Kb/64-value FVC provides 28.8% reduction in Ll cache energy and 3.38% increase in execution time delay over a conventional 64 Kb cache.

...read moreread less

Proceedings Article•DOI•

Leakage energy management in cache hierarchies

[...]

Lin Li¹, Ismail Kadayif¹, Yuh-Fang Tsai¹, Narayanan Vijaykrishnan¹, Mahmut Kandemir¹, Mary Jane Irwin¹, Anand Sivasubramaniam¹ - Show less +3 more•Institutions (1)

Pennsylvania State University¹

22 Sep 2002

TL;DR: This work presents several architectural techniques that exploit the data duplication across the different levels of cache hierarchy, and employs both state-preserving and state-destroying leakage control mechanisms to L2 subblocks when their data also exist in L1.

...read moreread less

Abstract: Energy management is important for a spectrum of systems ranging from high-performance architectures to low-end mobile and embedded devices. With the increasing number of transistors, smaller feature sizes, lower supply and threshold voltages, the focus on energy optimization is shifting from dynamic to leakage energy. Leakage energy is of particular concern in dense cache memories that form a major portion of the transistor budget. In this work, we present several architectural techniques that exploit the data duplication across the different levels of cache hierarchy. Specifically, we employ both state-preserving (data-retaining) and state-destroying leakage control mechanisms to L2 subblocks when their data also exist in L1. Using a set of media and array-dominated applications, we demonstrate the effectiveness of the proposed techniques through cycle-accurate simulation. We also compare our schemes with the previously proposed cache decay policy. This comparison indicates that one of our schemes generates competitive results with cache decay.

...read moreread less

Proceedings Article•DOI•

Bloom filtering cache misses for accurate data speculation and prefetching

[...]

Jih-Kwon Peir¹, Shih-Chang Lai², Shih-Lien Lu³, Jared Stark³, Konrad K. Lai³ - Show less +1 more•Institutions (3)

University of Florida¹, Oregon State University², Intel³

22 Jun 2002

TL;DR: This paper introduces a new hit/miss predictor that uses a Bloom Filter to identify cache misses early in the pipeline, which allows the processor to more accurately schedule instructions that are dependent on loads and to more precisely prefetch data into the cache.

...read moreread less

Abstract: A processor must know a load instruction's latency to schedule the load's dependent instructions at the correct time. Unfortunately, modern processors do not know this latency until well after the dependent instructions should have been scheduled to avoid pipeline bubbles between themselves and the load. One solution to this problem is to predict the load's latency, by predicting whether the load will hit or miss in the data cache. Existing cache hit/miss predictors, however, can only correctly predict about 50% of cache misses.This paper introduces a new hit/miss predictor that uses a Bloom Filter to identify cache misses early in the pipeline. This early identification of cache misses allows the processor to more accurately schedule instructions that are dependent on loads and to more precisely prefetch data into the cache. Simulations using a modified SimpleScalar model show that the proposed Bloom Filter is nearly perfect, with a prediction accuracy greater than 99% for the SPECint2000 benchmarks. IPC (Instructions Per Cycle) performance improved by 19% over a processor that delayed the scheduling of instructions dependent on a load until the load latency was known, and by 6% and 7% over a processor that always predicted a load would hit the cache and with a counter-based hit/miss predictor respectively. This IPC reaches 99.7% of the IPC of a processor with perfect scheduling.

...read moreread less

Journal Article•DOI•

On filter effects in web caching hierarchies

[...]

Carey Williamson¹•Institutions (1)

University of Calgary¹

01 Feb 2002-ACM Transactions on Internet Technology

TL;DR: The simulation results demonstrate that size-based partitioning and heterogeneous cache replacement policies each offer improvements in overall caching performance, and considers novel cache management techniques that can better exploit the changing workload characteristics across a multilevel Web proxy caching hierarchy.

...read moreread less

Abstract: This article studies the "filter effects" that occur in Web proxy caching hierarchies due to the presence of multiple levels of caches. That is, the presence of one level of cache changes the structural characteristics of the workload presented to the next level of cache, since only the requests that miss in one cache are forwarded to the next cache.Trace-driven simulations, with empirical and synthetic traces, are used to demonstrate the presence and magnitude of the filter effects in a multilevel Web proxy caching hierarchy. Experiments focus on the effects of cache size, cache replacement policy, Zipf slope, and the depth of the Web proxy caching hierarchy.Finally, the article considers novel cache management techniques that can better exploit the changing workload characteristics across a multilevel Web proxy caching hierarchy. Trace-driven simulations are used to evaluate the performance of these approaches. The simulation results demonstrate that size-based partitioning and heterogeneous cache replacement policies each offer improvements in overall caching performance. The sensitivity of the results to the degree of workload overlap among child-level proxy caches is also studied.

...read moreread less

Patent•

Tagging packets with a lookup key to facilitate usage of a unified packet forwarding cache

[...]

Kjeld Egevang¹, Niels Beier, Jacob Christensen¹•Institutions (1)

Intel¹

27 Sep 2002

TL;DR: In this article, the authors present an approach and methods for a Network Address Translation (NAT)-aware unified cache, which allows multiple packet processing applications distributed among one or more processors of a network device to share a unified cache without requiring a cache synchronization protocol.

...read moreread less

Abstract: Apparatus and methods are provided for a Network Address Translation (NAT)-aware unified cache. According to one embodiment, multiple packet-processing applications distributed among one or more processors of a network device share one or more unified caches without requiring a cache synchronization protocol. When a packet is received at the network device, a first packet-processing application, such as NAT or another application that modifies part of the packet header upon which a cache lookup key is based, tags the packet with a cache lookup key based upon the original contents of the packet header. Then, other packet-processing applications attempting to access the cache entry from the unified cache subsequent to the tagging by the first packet-processing application use the tag (the cache lookup key generated by the first packet-processing application) rather than determining the cache lookup key based upon the current contents of the packet header.

...read moreread less

Patent•

Scsi-to-ip cache storage device and method

[...]

Qing Yang, Xubin He

15 Aug 2002

TL;DR: A SCSI-to-IP cache storage system interconnects a host computing device or a storage unit to a switched packet network as mentioned in this paper, which includes a SCSI interface that facilitates system communications with the host computing devices or the storage unit, and an Ethernet interface that allows the system to receive data from and send data to the Internet.

...read moreread less

Abstract: A SCSI-to-IP cache storage system interconnects a host computing device or a storage unit to a switched packet network. The cache storage system includes a SCSI interface (40) that facilitates system communications with the host computing device or the storage unit, and an Ethernet interface (42) that allows the system to receive data from and send data to the Internet. The cache storage system further comprises a processing unit (44) that includes a processor (46), a memory (48) and a log disk (52) configured as a sequential access device. The log disk (52) caches data along with the memory (48) resident in the processing unit (44), wherein the log disk (52) and the memory (48) are configured as a two-level hierarchical cache.

...read moreread less

Patent•

Computer apparatus and method for caching results of a database query

[...]

Jeremy A. Arnold¹, Eric L. Barsness¹, Richard D. Dettinger¹, John M. Santosuosso¹•Institutions (1)

IBM¹

18 Apr 2002

TL;DR: In this paper, a query processor caches data retrieved from executing prepared statements, and uses the cached data for subsequent accesses to the data, if certain conditions for using cached data are met.

...read moreread less

Abstract: A query processor caches data retrieved from executing prepared statements, and uses the cached data for subsequent accesses to the data, if certain conditions for using the cached data are met. The preferred embodiments also include a data staleness handler that takes care of issues that arise from data that may have changed in the database but is not reflected in the cache. One way to handle data staleness in the cache is to specifically enable or disable caching in a query. If caching is disabled, the query processor will access the data in the database. Another way to handle data staleness in the cache is to provide a timer that causes the cache to be invalidated when the timer times out. Yet another way to handle data staleness in the cache is to provide specified conditions that must be met for caching to occur, such as time or date limitations. Still another way to handle data staleness in the cache is to provide an update trigger for the data in the database that corresponds to the cached data. When the data in the database is updated, the update trigger fires, causing the cache to be invalidated. Note that invalidating the cache could also be followed by automatically updating the cache. By caching the results of processing a prepared statement, other queries that use the same prepared statement may be able to access data in the cache instead of going to the database.

...read moreread less

Proceedings Article•DOI•

Let's study whole-program cache behaviour analytically

[...]

Xavier Vera¹, Jingling Xue²•Institutions (2)

Mälardalen University College¹, University of New South Wales²

02 Feb 2002

TL;DR: Based on a new characterisation of data reuse across multiple loop nests, a method is preset, a prototyping implementation and some experimental results for analysing the cache behaviour of whole programs with regular computations and can be used to guide compiler locality optimisations and improve cache simulation performance.

...read moreread less

Abstract: Based on a new characterisation of data reuse across multiple loop nests, we preset a method, a prototyping implementation and some experimental results for analysing the cache behaviour of whole programs with regular computations. Validation against cache simulation using real codes shows the efficiency and accuracy of our method. The largest program, we have analysed, Applu from SPECfP95, has 3868 lines, 16 subroutines and 2565 references. In the case of a 32KB cache with a 32B line size, our method obtains the miss ratio with an absolute error of about 0.80% in about 128 seconds while the simulator used runs for nearly 5 hours on a 933MHz Pentium. III PC. Our method can be used to guide compiler locality optimisations and improve cache simulation performance.

...read moreread less

Collapse