Showing papers on "Cache algorithms published in 2005"

PDF

Open Access

Proceedings Article•DOI•

Predicting inter-thread cache contention on a chip multi-processor architecture

[...]

Dhruba Chandra¹, Fei Guo¹, Seongbeom Kim¹, Yan Solihin¹•Institutions (1)

12 Feb 2005

TL;DR: Three performance models are proposed that predict the impact of cache sharing on co-scheduled threads and the most accurate model, the inductive probability model, achieves an average error of only 3.9%.

...read moreread less

Abstract: This paper studies the impact of L2 cache sharing on threads that simultaneously share the cache, on a chip multi-processor (CMP) architecture. Cache sharing impacts threads nonuniformly, where some threads may be slowed down significantly, while others are not. This may cause severe performance problems such as sub-optimal throughput, cache thrashing, and thread starvation for threads that fail to occupy sufficient cache space to make good progress. Unfortunately, there is no existing model that allows extensive investigation of the impact of cache sharing. To allow such a study, we propose three performance models that predict the impact of cache sharing on co-scheduled threads. The input to our models is the isolated L2 cache stack distance or circular sequence profile of each thread, which can be easily obtained on-line or off-line. The output of the models is the number of extra L2 cache misses for each thread due to cache sharing. The models differ by their complexity and prediction accuracy. We validate the models against a cycle-accurate simulation that implements a dual-core CMP architecture, on fourteen pairs of mostly SPEC benchmarks. The most accurate model, the inductive probability model, achieves an average error of only 3.9%. Finally, to demonstrate the usefulness and practicality of the model, a case study that details the relationship between an application's temporal reuse behavior and its cache sharing impact is presented.

...read moreread less

543 citations

Patent•

Domain name resolution using a distributed DNS network

[...]

Zaide “Edward” Liu¹, Eric Sven-Johan Swildens¹, Richard David Day¹•Institutions (1)

Akamai Technologies¹

31 Oct 2005

TL;DR: In this paper, the edge DNS cache servers are published as the authoritative servers for customer domains instead of the origin server, and when a request for a DNS record results in a cache miss, the edge cache servers get the information from the origin servers and cache it for use in response to future requests.

...read moreread less

Abstract: A distributed DNS network includes a central origin server that actually controls the zone, and edge DNS cache servers configured to cache the DNS content of the origin server. The edge DNS cache servers are published as the authoritative servers for customer domains instead of the origin server. When a request for a DNS record results in a cache miss, the edge DNS cache servers get the information from the origin server and cache it for use in response to future requests. Multiple edge DNS cache servers can be deployed at multiple locations. Since an unlimited number of edge DNS cache servers can be deployed, the system is highly scalable. The disclosed techniques protect against DoS attacks, as DNS requests are not made to the origin server directly.

...read moreread less

370 citations

Journal Article•DOI•

Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors

[...]

Michael Zhang¹, Krste Asanovic¹•Institutions (1)

Massachusetts Institute of Technology¹

01 May 2005

TL;DR: This paper presents a new cache management policy, victim replication, which combines the advantages of private and shared schemes, and shows that victim replication reduces the average memory access latency of the shared L2 cache by an average of 16% for multi-threaded benchmarks and 24% for single-threading benchmarks.

...read moreread less

Abstract: In this paper, we consider tiled chip multiprocessors (CMP) where each tile contains a slice of the total on-chip L2 cache storage and tiles are connected by an on-chip network. The L2 slices can be managed using two basic schemes: 1) each slice is treated as a private L2 cache for the tile 2) all slices are treated as a single large L2 cache shared by all tiles. Private L2 caches provide the lowest hit latency but reduce the total effective cache capacity, as each tile creates local copies of any line it touches. A shared L2 cache increases the effective cache capacity for shared data, but incurs long hit latencies when L2 data is on a remote tile. We present a new cache management policy, victim replication, which combines the advantages of private and shared schemes. Victim replication is a variant of the shared scheme which attempts to keep copies of local primary cache victims within the local L2 cache slice. Hits to these replicated copies reduce the effective latency of the shared L2 cache, while retaining the benefits of a higher effective capacity for shared data. We evaluate the various schemes using full-system simulation of both single-threaded and multi-threaded benchmarks running on an 8-processor tiled CMP. We show that victim replication reduces the average memory access latency of the shared L2 cache by an average of 16%for multi-threaded benchmarks and 24%for single-threaded benchmarks, providing better overall performance than either private or shared schemes.

...read moreread less

331 citations

Patent•

Flash controller cache architecture

[...]

Kevin M. Conley¹, Reuven Elhamias•Institutions (1)

SanDisk¹

07 Mar 2005

TL;DR: A buffer cache interposed between a non-volatile memory and a host may be partitioned into segments that may operate with different policies as mentioned in this paper, such as write-through, write and read-look-ahead.

...read moreread less

Abstract: A buffer cache interposed between a non-volatile memory and a host may be partitioned into segments that may operate with different policies. Cache policies include write-through, write and read-look-ahead. Write-through and write back policies may improve speed. Read-look-ahead cache allows more efficient use of the bus between the buffer cache and non-volatile memory. A session command allows data to be maintained in volatile memory by guaranteeing against power loss.

...read moreread less

256 citations

Patent•

Peer-to-peer name resolution protocol (PNRP) and multilevel cache for use therewith

[...]

Christian Huitema¹, John L. Miller¹•Institutions (1)

Microsoft¹

13 Jun 2005

TL;DR: A peer-to-peer name resolution protocol (PNRP) is proposed in this paper, which allows resolution of names which are mapped onto the circular number space through a hash function.

...read moreread less

Abstract: A serverless name resolution protocol ensures convergence despite the size of the network, without requiring an ever-increasing cache and with a reasonable numbers of hops. This convergence is ensured through a multi-level cache and a proactive cache initialization strategy. The multi-level cache is built based on a circular number space. Each level contains information from different levels of slivers of the circular space. A mechanism is included to add a level to the multi-level cache when the node determines that the last level is full. A peer-to-peer name resolution protocol (PNRP) includes a mechanism to allow resolution of names which are mapped onto the circular number space through a hash function. Further, the PNRP may also operate with the domain name system by providing each node with an identification consisting of a domain name service (DNS) component and a unique number.

...read moreread less

228 citations

Proceedings Article•DOI•

A NUCA substrate for flexible CMP cache sharing

[...]

Jaehyuk Huh¹, Changkyu Kim¹, Hazim Shafi², Lixin Zhang², Doug Burger¹, Stephen W. Keckler¹ - Show less +2 more•Institutions (2)

University of Texas at Austin¹, IBM²

20 Jun 2005

TL;DR: It is demonstrated that migratory, dynamic NUCA approaches improve performance significantly for a subset of the workloads at the cost of increased power consumption and complexity, especially as per-application cache partitioning strategies are applied.

...read moreread less

Abstract: We propose an organization for the on-chip memory system of a chip multiprocessor, in which 16 processors share a 16MB pool of 256 L2 cache banks. The L2 cache is organized as a non-uniform cache architecture (NUCA) array with a switched network embedded in it for high performance. We show that this organization can support the spectrum of degrees of sharing: unshared, in which each processor has a private portion of the cache, thus reducing hit latency, completely shared, in which every processor shares the entire cache, thus minimizing misses, and every point in between. We find the optimal degree of sharing for a number of cache bank mapping policies, and also evaluate a per-application cache partitioning strategy. We conclude that a static NUCA organization with sharing degrees of two or four work best across a suite of commercial and scientific parallel workloads. We also demonstrate that migratory, dynamic NUCA approaches improve performance significantly for a subset of the workloads at the cost of increased power consumption and complexity, especially as per-application cache partitioning strategies are applied.

...read moreread less

218 citations

Journal Article•DOI•

Cache and memory error detection, correction, and reduction techniques for terrestrial servers and workstations

[...]

C.W. Slayman¹•Institutions (1)

Sun Microsystems¹

05 Dec 2005-IEEE Transactions on Device and Materials Reliability

TL;DR: In most system applications, a combination of several techniques is required to meet the necessary reliability and data-integrity targets, and the tradeoffs of these techniques in terms of area, power, and performance penalties versus increased reliability are covered.

...read moreread less

Abstract: As the size of the SRAM cache and DRAM memory grows in servers and workstations, cosmic-ray errors are becoming a major concern for systems designers and end users. Several techniques exist to detect and mitigate the occurrence of cosmic-ray upset, such as error detection, error correction, cache scrubbing, and array interleaving. This paper covers the tradeoffs of these techniques in terms of area, power, and performance penalties versus increased reliability. In most system applications, a combination of several techniques is required to meet the necessary reliability and data-integrity targets.

...read moreread less

205 citations

Journal Article•DOI•

The V-Way Cache: Demand Based Associativity via Global Replacement

[...]

Moinuddin K. Qureshi¹, David A. Thompson¹, Yale N. Patt¹•Institutions (1)

University of Texas at Austin¹

01 May 2005

TL;DR: The proposed variable-way, or V-Way, set-associative cache achieves an average miss rate reduction of 13% on sixteen benchmarks from the SPEC CPU2000 suite, which translates into an average IPC improvement of 8%.

...read moreread less

Abstract: As processor speeds increase and memory latency becomes more critical, intelligent design and management of secondary caches becomes increasingly important. The efficiency of current set-associative caches is reduced because programs exhibit a non-uniform distribution of memory accesses across different cache sets. We propose a technique to vary the associativity of a cache on a per-set basis in response to the demands of the program. By increasing the number of tag-store entries relative to the number of data lines, we achieve the performance benefit of global replacement while maintaining the constant hit latency of a set-associative cache. The proposed variable-way, or V-Way, set-associative cache achieves an average miss rate reduction of 13% on sixteen benchmarks from the SPEC CPU2000 suite. This translates into an average IPC improvement of 8%.

...read moreread less

204 citations

Proceedings Article•DOI•

Cache oblivious stencil computations

[...]

Matteo Frigo¹, Volker Strumpen¹•Institutions (1)

IBM¹

20 Jun 2005

TL;DR: This work presents a cache oblivious algorithm for stencil computations, which arise for example in finite-difference methods, and it exploits temporal locality optimally throughout the entire memory hierarchy.

...read moreread less

Abstract: We present a cache oblivious algorithm for stencil computations, which arise for example in finite-difference methods. Our algorithm applies to arbitrary stencils in n-dimensional spaces. On an "ideal cache" of size Z, our algorithm saves a factor of Θ(Z1/n) cache misses compared to a naive algorithm, and it exploits temporal locality optimally throughout the entire memory hierarchy.

...read moreread less

196 citations

Patent•

Method and apparatus for selecting cache and proxy policy

[...]

Bret P. O'Rourke¹, Dawson F. Dean¹, Chih-Kan Wang¹, Mark D. Van Antwerp¹, David J. Roth¹, Chadd B. Knowlton¹ - Show less +2 more•Institutions (1)

Microsoft¹

22 Nov 2005

TL;DR: In this paper, a cache server is coupled with a media serving engine that is capable of caching media content, and a set of cache policies is accessible by the cache engine to define the operation of the cache.

...read moreread less

Abstract: A cache server includes a media serving engine that is capable of distributing media content. A cache engine is coupled to the media serving engine and capable of caching media content. A set of cache policies is accessible by the cache engine to define the operation of the cache engine. The cache server can be configured to operate as either a cache server or an origin server. The cache server also includes a data communication interface coupled to the cache engine and the media serving engine to allow the cache engine to receive media content across a network and to allow the media serving engine to distribute media content across the network. The cache policies include policies for distributing media content from the media server, policies for handling cache misses, and policies for prefetching media content.

...read moreread less

165 citations

Journal Article•DOI•

Data cache prefetching using a global history buffer

[...]

Kyle J. Nesbit¹, James E. Smith¹•Institutions (1)

University of Wisconsin-Madison¹

01 Jan 2005-IEEE Micro

TL;DR: A GHB (global history buffer) supports existing prefetch algorithms more effectively than conventional prefetch tables and contains a more complete picture of cache miss history, which reduces stale table data, improving accuracy and reducing memory traffic.

...read moreread less

Abstract: Over the past couple of decades, trends in both microarchitecture and underlying semiconductor technology have significantly reduced microprocessor clock periods These trends have significantly increased relative main-memory latencies as measured in processor clock cycles To avoid large performance losses caused by long memory access delays, microprocessors rely heavily on a hierarchy of cache memories But cache memories are not always effective, either because they are not large enough to hold a program's working set, or because memory access patterns don't exhibit behavior that matches a cache memory's demand-driven, line-structured organization To partially overcome cache memories' limitations, we organize data cache prefetch information in a new way, a GHB (global history buffer) supports existing prefetch algorithms more effectively than conventional prefetch tables It reduces stale table data, improving accuracy and reducing memory traffic It contains a more complete picture of cache miss history and is smaller than conventional tables

...read moreread less

Patent•

Data-aware cache state machine

[...]

Wei Liu, Steven H. Kahle

31 Jan 2005

TL;DR: Data-aware cache as discussed by the authors is a method and system directed to improve effectiveness and efficiency of cache and data management by differentiating data based on certain attributes associated with the data and reducing the bottleneck to storage.

...read moreread less

Abstract: A method and system directed to improve effectiveness and efficiency of cache and data management by differentiating data based on certain attributes associated with the data and reducing the bottleneck to storage. The data-aware cache differentiates and manages data using a state machine having certain states. The data-aware cache may use data pattern and traffic statistics to retain frequently used data in cache longer by transitioning it into Sticky or StickyDirty states. The data-aware cache may also use content or application related attributes to differentiate and retain certain data in cache longer. Further, the data-aware cache may provide cache status and statistics information to a data-aware data flow manager, thus assisting data-aware data flow manager to determine which data to cache and which data to pipe directly through, or to switch cache policies dynamically, thus avoiding some of the overhead associated with caches. The data-aware cache may also place clean and dirty data in separate states, enabling more efficient cache mirroring and flush, thus improve system reliability and performance.

...read moreread less

Proceedings Article•DOI•

AES power attack based on induced cache miss and countermeasure

[...]

Guido Bertoni¹, Vittorio Zaccaria¹, Luca Breveglieri², Matteo Monchiero², Gianluca Palermo² - Show less +1 more•Institutions (2)

STMicroelectronics¹, Polytechnic University of Milan²

04 Apr 2005

TL;DR: A new attack against a software implementation of the Advanced Encryption Standard, aimed at flushing elements of the SBOX from the cache, thus inducing a cache miss during the encryption phase, which can be used to recover part of the secret key.

...read moreread less

Abstract: This paper presents a new attack against a software implementation of the Advanced Encryption Standard. The attack aims at flushing elements of the SBOX from the cache, thus inducing a cache miss during the encryption phase. The power trace is then used to detect when the cache miss occurs; if the miss happens in the first round of the AES then the information can be used to recover part of the secret key. The attack has been simulated using the Wattch simulation framework and a simple software implementation of AES (using a single table for the SBOX). The attack can be easily extended to more sophisticated versions of AES with more than one table. Eventually, we present a simple countermeasure which does not require randomization.

...read moreread less

Proceedings Article•

Query caching and view selection for XML databases

[...]

Bhushan Mandhani¹, Dan Suciu¹•Institutions (1)

University of Washington¹

30 Aug 2005

TL;DR: A notion of XPath query/ view answerability is described, which allows us to reduce tree operations to string operations for matching a query/view pair, and it is shown how to store and maintain the cached views in relational tables, so that cache lookup is very efficient.

...read moreread less

Abstract: In this paper, we propose a method for maintaining a semantic cache of materialized XPath views. The cached views include queries that have been previously asked, and additional selected views. The cache can be stored inside or outside the database. We describe a notion of XPath query/view answerability, which allows us to reduce tree operations to string operations for matching a query/view pair. We show how to store and maintain the cached views in relational tables, so that cache lookup is very efficient. We also describe a technique for view selection, given a warm-up workload. We experimentally demonstrate the efficiency of our caching techniques, and performance gains obtained by employing such a cache.

...read moreread less

Journal Article•DOI•

Power-aware storage cache management

[...]

Q. Zhu¹, Yuanyuan Zhou¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

01 May 2005-IEEE Transactions on Computers

TL;DR: An offline energy-optimal cache replacement algorithm using dynamic programming, which minimizes the disk energy consumption and an offline power-aware greedy algorithm that is more energy-efficient than Belady's offline algorithm (which minimizes cache misses only).

...read moreread less

Abstract: Reducing energy consumption is an important issue for data centers. Among the various components of a data center, storage is one of the biggest energy consumers. Previous studies have shown that the average idle period for a server disk in a data center is very small compared to the time taken to spin down and spin up. This significantly limits the effectiveness of disk power management schemes. This article proposes several power-aware storage cache management algorithms that provide more opportunities for the underlying disk power management schemes to save energy. More specifically, we present an offline energy-optimal cache replacement algorithm using dynamic programming, which minimizes the disk energy consumption. We also present an offline power-aware greedy algorithm that is more energy-efficient than Belady's offline algorithm (which minimizes cache misses only). We also propose two online power-aware algorithms, PA-LRU and PB-LRU. Simulation results with both a real system and synthetic workloads show that, compared to LRU, our online algorithms can save up to 22 percent more disk energy and provide up to 64 percent better average response time. We have also investigated the effects of four storage cache write policies on disk energy consumption.

...read moreread less

Proceedings Article•

SARC: sequential prefetching in adaptive replacement cache

[...]

Binny S. Gill¹, Dharmendra S. Modha¹•Institutions (1)

IBM¹

10 Apr 2005

TL;DR: A self-tuning, low overhead, simple to implement, locally adaptive, novel cache management policy SARC is designed that dynamically and adaptively partitions the cache space amongst sequential and random streams so as to reduce the read misses.

...read moreread less

Abstract: Sequentiality of reference is an ubiquitous access pattern dating back at least to Multics. Sequential workloads lend themselves to highly accurate prediction and prefetching. In spite of the simplicity of the workload, design and analysis of a good sequential prefetching algorithm and associated cache replacement policy turns out to be surprisingly intricate. As first contribution, we uncover and remedy an anomaly (akin to famous Belady's anomaly) that plagues sequential prefetching when integrated with caching. Typical workloads contain a mix of sequential and random streams. As second contribution, we design a self-tuning, low overhead, simple to implement, locally adaptive, novel cache management policy SARC that dynamically and adaptively partitions the cache space amongst sequential and random streams so as to reduce the read misses. As third contribution, we implemented SARC along with two popular state-of-the-art LRU variants on hardware for IBM's flagship storage controller Shark. On Shark hardware with 8 GB cache and 16 RAID-5 arrays that is serving a workload akin to Storage Performance Council's widely adopted SPC-1 benchmark, SARC consistently and dramatically outperforms the two LRU variants shifting the throughput-response time curve to the right and thus fundamentally increasing the capacity of the system. As anecdotal evidence, at the peak throughput, SARC has average response time of 5.18ms as compared to 33.35ms and 8.92ms for the two LRU variants.

...read moreread less

Journal Article•DOI•

Cache-oblivious mesh layouts

[...]

Sung-Eui Yoon¹, Peter Lindstrom², Valerio Pascucci², Dinesh Manocha¹•Institutions (2)

University of North Carolina at Chapel Hill¹, Lawrence Livermore National Laboratory²

01 Jul 2005

TL;DR: An out-of-core multilevel minimization algorithm is designed and implemented and its performance is tested on unstructured meshes composed of tens to hundreds of millions of triangles, which can significantly reduce the number of cache misses.

...read moreread less

Abstract: We present a novel method for computing cache-oblivious layouts of large meshes that improve the performance of interactive visualization and geometric processing algorithms. Given that the mesh is accessed in a reasonably coherent manner, we assume no particular data access patterns or cache parameters of the memory hierarchy involved in the computation. Furthermore, our formulation extends directly to computing layouts of multi-resolution and bounding volume hierarchies of large meshes.We develop a simple and practical cache-oblivious metric for estimating cache misses. Computing a coherent mesh layout is reduced to a combinatorial optimization problem. We designed and implemented an out-of-core multilevel minimization algorithm and tested its performance on unstructured meshes composed of tens to hundreds of millions of triangles. Our layouts can significantly reduce the number of cache misses. We have observed 2--20 times speedups in view-dependent rendering, collision detection, and isocontour extraction without any modification of the algorithms or runtime applications.

...read moreread less

Proceedings Article•DOI•

Scheduling analysis of real-time systems with precise modeling of cache related preemption delay

[...]

Jan Staschulat¹, Simon Schliecker¹, Rolf Ernst¹•Institutions (1)

Braunschweig University of Technology¹

06 Jul 2005

TL;DR: A conservative polynomial algorithm is presented that extends real-time scheduling analysis to consider cache effects due to the preempted and the preempting task for the preemption delay for each task accurately.

...read moreread less

Abstract: Accurate timing analysis is key to efficient embedded system synthesis and integration. Caches are needed to increase the processor performance but they are hard to use because of their complex behaviour especially in preemptive scheduling. Current approaches use simplified assumptions or propose exponentially complex scheduling analysis algorithms to bound the cache related preemption delay at a context switch. We present a conservative polynomial algorithm that extends real-time scheduling analysis to consider cache effects due to the preempted and the preempting task for the preemption delay. Dataflow analysis on task level is combined with real-time scheduling analysis to determine the response time including cache related preemption delay for each task accurately. The experiments show significant improvement in analysis precision over previous polynomial approaches for typical embedded benchmarks.

...read moreread less

Proceedings Article•

Performance of multithreaded chip multiprocessors and implications for operating system design

[...]

Alexandra Fedorova¹, Margo Seltzer¹, Christoper Small², Daniel S. Nussbaum²•Institutions (2)

Harvard University¹, Sun Microsystems²

10 Apr 2005

TL;DR: To use the L2 cache as efficiently as possible, an L2-conscious scheduling algorithm is proposed and it is possible to reduce miss ratios in the L1 cache and improve processor throughput by 27-45%.

...read moreread less

Abstract: We investigated how operating system design should be adapted for multithreaded chip multiprocessors (CMT) - a new generation of processors that exploit thread-level parallelism to mask the memory latency in modern workloads. We determined that the L2 cache is a critical shared resource on CMT and that an insufficient amount of L2 cache can undermine the ability to hide memory latency on these processors. To use the L2 cache as efficiently as possible, we propose an L2-conscious scheduling algorithm and quantify its performance potential. Using this algorithm it is possible to reduce miss ratios in the L2 cache by 25-37% and improve processor throughput by 27-45%.

...read moreread less

Patent•

Method for programmer-controlled cache line eviction policy

[...]

Mason Cabot¹•Institutions (1)

Intel¹

20 Dec 2005

TL;DR: In this article, a cache line eviction mechanism is proposed that enables programmers to mark portions of code with different cache priority levels based on anticipated or measured access patterns for those code portions.

...read moreread less

Abstract: A method and apparatus to enable programmatic control of cache line eviction policies. A mechanism is provided that enables programmers to mark portions of code with different cache priority levels based on anticipated or measured access patterns for those code portions. Corresponding cues to assist in effecting the cache eviction policies associated with given priority levels are embedded in machine code generated from source-and/or assembly-level code. Cache architectures are provided that partition cache space into multiple pools, each pool being assigned a different priority. In response to execution of a memory access instruction, an appropriate cache pool is selected and searched based on information contained in the instruction's cue. On a cache miss, a cache line is selected from that pool to be evicted using a cache eviction policy associated with the pool. Implementations of the mechanism or described for both n-way set associative caches and fully-associative caches.

...read moreread less

Proceedings Article•DOI•

Balancing Performance and Reliability in the Memory Hierarchy

[...]

Ghazanfar Asadi¹, Vilas Sridharan¹, Mehdi B. Tahoori¹, David Kaeli¹•Institutions (1)

Northeastern University¹

20 Mar 2005

TL;DR: A new method to accurately estimate the reliability of cache memories is presented and three different techniques are presented to reduce the susceptibility of first-level caches to soft errors by two orders of magnitude.

...read moreread less

Abstract: Cosmic-ray induced soft errors in cache memories are becoming a major threat to the reliability of microprocessor-based systems. In this paper, we present a new method to accurately estimate the reliability of cache memories. We have measured the MTTF (mean-time-to-failure) of unprotected first-level (L1) caches for twenty programs taken from SPEC2000 benchmark suite. Our results show that a 16 KB first-level cache possesses a MTTF of at least 400 years (for a raw error rate of 0.002 FIT/bit.) However, this MTTF is significantly reduced for higher error rates and larger cache sizes. Our results show that for selected programs, a 64 KB first-level cache is more than 10 times as vulnerable to soft errors versus a 16 KB cache memory. Our work also illustrates that the reliability of cache memories is highly application-dependent. Finally, we present three different techniques to reduce the susceptibility of first-level caches to soft errors by two orders of magnitude. Our analysis shows how to achieve a balance between performance and reliability

...read moreread less

Proceedings Article•DOI•

Fast data-locality profiling of native execution

[...]

Erik J. Berg¹, Erik Hagersten¹•Institutions (1)

Uppsala University¹

06 Jun 2005

TL;DR: StatCache is presented, a performance tool based on a statistical cache model that has a small run-time overhead while providing much of the flexibility of simulator-based tools and demonstrates how the flexibility can be used to better understand the characteristics of cache-related performance problems.

...read moreread less

Abstract: Performance tools based on hardware counters can efficiently profile the cache behavior of an application and help software developers improve its cache utilization. Simulator-based tools can potentially provide more insights and flexibility and model many different cache configurations, but have the drawback of large run-time overhead.We present StatCache, a performance tool based on a statistical cache model. It has a small run-time overhead while providing much of the flexibility of simulator-based tools. A monitor process running in the background collects sparse memory access statistics about the analyzed application running natively on a host computer. Generic locality information is derived and presented in a code-centric and/or data-centric view.We evaluate the accuracy and performance of the tool using ten SPEC CPU2000 benchmarks. We also exemplify how the flexibility of the tool can be used to better understand the characteristics of cache-related performance problems.

...read moreread less

Patent•

Techniques for displaying and caching tiled map data on constrained-resource services

[...]

Josh Sacks¹•Institutions (1)

Google¹

11 Aug 2005

TL;DR: In this paper, the authors proposed a cache-based approach to cache pre-computed map images (e.g., map tiles) to minimize the latency of a mapping application on high-latency and low-throughput networks.

...read moreread less

Abstract: Techniques are disclosed that enable users to access and use digital mapping systems with constrained-resource services and/or mobile devices (e.g., cell phones and PDAs). In particular, latency of a mapping application on high-latency and low-throughput networks is minimized. One embodiment utilizes volatile and non-volatile storage of the mobile device to cache pre-computed map images (e.g., map tiles). An asynchronous cache can be used to prevent delays caused by potentially slow non-volatile storage. Meta-data about each map image and usage patterns can be stored and used by the cache to optimize hit rates.

...read moreread less

Proceedings Article•DOI•

The performance impact of kernel prefetching on buffer cache replacement algorithms

[...]

Ali R. Butt¹, Chris Gniady¹, Y. Charlie Hu¹•Institutions (1)

Purdue University¹

06 Jun 2005

TL;DR: This paper shows that kernel prefetching can have a significant impact on the relative performance in terms of the number of actual disk l/Os of many well-known replacement algorithms; it can not only narrow the performance gap but also change therelative performance benefits of different algorithms.

...read moreread less

Abstract: A fundamental challenge in improving the file system performance is to design effective block replacement algorithms to minimize buffer cache misses. Despite the well-known interactions between prefetching and caching, almost all buffer cache replacement algorithms have been proposed and studied comparatively without taking into account file system prefetching which exists in all modern operating systems. This paper shows that such kernel prefetching can have a significant impact on the relative performance in terms of the number of actual disk I/Os of many well-known replacement algorithms; it can not only narrow the performance gap but also change the relative performance benefits of different algorithms. These results demonstrate the importance for buffer caching research to take file system prefetching into consideration.

...read moreread less

Patent•

Method and system for run-time cache logging

[...]

Charbel Khawand¹, Jianping W. Miller¹•Institutions (1)

Motorola¹

22 Dec 2005

TL;DR: In this article, the authors present a method for run-time cache optimization based on profiling a program code during a runtime execution, logging the performance for producing a cache log, and rearranging a portion of program code in view of the cache log for producing rearranged portion.

...read moreread less

Abstract: A method ( 400 ) and system ( 106 ) is provided for run-time cache optimization. The method includes profiling ( 402 ) a performance of a program code during a run-time execution, logging ( 408 ) the performance for producing a cache log, and rearranging ( 410 ) a portion of program code in view of the cache log for producing a rearranged portion. The rearranged portion is supplied to a memory management unit ( 240 ) for managing at least one cache memory ( 110 - 140 ). The cache log can be collected during a real-time operation of a communication device and is fed back to a linking process ( 244 ) to maximize a cache locality compile-time. The method further includes loading a saved profile corresponding with a run-time operating mode, and reprogramming a new code image associated with the saved profile.

...read moreread less

Patent•

Instruction-assisted cache management for efficient use of cache and memory

[...]

Mark B. Rosenbluth¹, Sridhar Lakshmanamurthy¹•Institutions (1)

Intel¹

30 Sep 2005

TL;DR: In this paper, instruction-assisted cache management for efficient use of cache and memory is discussed, where hints (e.g., modifiers) are added to read and write memory access instructions to identify the memory access for temporal data.

...read moreread less

Abstract: Instruction-assisted cache management for efficient use of cache and memory. Hints (e.g., modifiers) are added to read and write memory access instructions to identify the memory access is for temporal data. In view of such hints, alternative cache policy and allocation policies are implemented that minimize cache and memory access. Under one policy, a write cache miss may result in a write of data to a partial cache line without a memory read/write cycle to fill the remainder of the line. Under another policy, a read cache miss may result in a read from memory without allocating or writing the read data to a cache line. A cache line soft-lock mechanism is also disclosed, wherein cache lines may be selectably soft locked to indicate preference for keeping those cache lines over non-locked lines.

...read moreread less

Patent•

System and method for performing granular invalidation of cached dynamically generated objects in a data communication network

[...]

Prabakar Sundarrajan¹, Prakash Khemani¹, Kailash Kailash¹, Ajay Soni¹, Rajiv Sinha¹, Saravana Annamalaisami¹, Bharath Bhushan Kr¹, Anil Kumar¹ - Show less +4 more•Institutions (1)

Citrix Systems¹

30 Dec 2005

TL;DR: In this article, a method and system for providing granular timed invalidation of dynamically generated objects stored in a cache is presented, which can cache objects with expiry times down to very small intervals of time.

...read moreread less

Abstract: The present invention is directed towards a method and system for providing granular timed invalidation of dynamically generated objects stored in a cache. The techniques of the present invention incorporates the ability to configure the expiration time of objects stored by the cache to fine granular time intervals, such as the granularity of time intervals provided by a packet processing timer of a packet processing engine. As such, the present invention can cache objects with expiry times down to very small intervals of time. This characteristic is referred to as “invalidation granularity.” By providing this fine granularity in expiry time, the cache of the present invention can cache and serve objects that frequently change, sometimes even many times within a second. One technique is to leverage the packet processing timers used by the device of the present invention that are able operate at time increments on the order of milliseconds to permit invalidation or expiry granularity down to 10 ms or less.

...read moreread less

Journal Article•DOI•

IATAC: a smart predictor to turn-off L2 cache lines

[...]

Jaume Abella¹, Antonio González¹, Xavier Vera¹, Michael O'Boyle²•Institutions (2)

Polytechnic University of Catalonia¹, University of Edinburgh²

01 Mar 2005-ACM Transactions on Architecture and Code Optimization

TL;DR: This paper introduces IATAC (inter-access time per access count), a new hardware technique to reduce cache leakage for L2 caches that outperforms all previous state-of-the-art techniques.

...read moreread less

Abstract: As technology evolves, power dissipation increases and cooling systems become more complex and expensive. There are two main sources of power dissipation in a processor: dynamic power and leakage. Dynamic power has been the most significant factor, but leakage will become increasingly significant in future. It is predicted that leakage will shortly be the most significant cost as it grows at about a 5× rate per generation. Thus, reducing leakage is essential for future processor design. Since large caches occupy most of the area, they are one of the leakiest structures in the chip and hence, a main source of energy consumption for future processors.This paper introduces IATAC (inter-access time per access count), a new hardware technique to reduce cache leakage for L2 caches. IATAC dynamically adapts the cache size to the program requirements turning off cache lines whose content is not likely to be reused. Our evaluation shows that this approach outperforms all previous state-of-the-art techniques. IATAC turns off 65p of the cache lines across different L2 cache configurations with a very small performance degradation of around 2p.

...read moreread less

Journal Article•

The hardness of cache conscious data placement

[...]

Erez Petrank¹, Dror Rawitz²•Institutions (2)

Technion – Israel Institute of Technology¹, University of Haifa²

06 Jun 2005-Nordic Journal of Computing

TL;DR: In this article, the complexity of finding an optimal placement of objects (or code) in the memory, in the sense that this placement reduces the number of cache misses during program execution to the minimum, was investigated.

...read moreread less

Abstract: We investigate the complexity of finding an optimal placement of objects (or code) in the memory, in the sense that this placement reduces the number of cache misses during program execution to the minimum. We show that this problem is one of the toughest amongst the interesting NP optimization problems in computer science, in particular, suppose one is given a sequence of memory accesses and one has to place the data in the memory so as to minimize the number of cache misses for this sequence. We show that if P ≠ NP, then one cannot efficiently approximate the optimal solution even up to a very liberal approximation ratio. Thus, this problem joins the family of extremely inapproximable optimization problems. Two famous members in this family are minimum graph coloring and maximum clique. In light of this harsh lower bound only mild approximation ratios can be obtained. We provide an algorithm that can map arbitrary access sequences within such a mild ratio.Next, we study the information loss when compressing the access sequence, keeping only pairwise relations. We show that the reduced information hides the optimal solution and highlights solutions that are far from optimal. Furthermore, we show that even if one restricts his attention to pairwise information, finding a good placement is computationally difficult.

...read moreread less

Proceedings Article•DOI•

A unified compressed memory hierarchy

[...]

E.G. Hallnor¹, Steven K. Reinhardt¹•Institutions (1)

University of Michigan¹

12 Feb 2005

TL;DR: This work proposes and analyzes a memory hierarchy that uses a unified compression scheme encompassing the last-level on-chip cache, the off-chip memory channel, and off- chip main memory that achieves a peak improvement of 292%, compared to 165% and 83% for cache or bus compression alone.

...read moreread less

Abstract: The memory system's large and growing contribution to system performance motivates more aggressive approaches to improving its efficiency. We propose and analyze a memory hierarchy that uses a unified compression scheme encompassing the last-level on-chip cache, the off-chip memory channel, and off-chip main memory. This scheme simultaneously increases the effective on-chip cache capacity, off-chip bandwidth, and main memory size, while avoiding compression and decompression overheads between levels. Simulations of the SPEC CPU2000 benchmarks using a 1MB cache and 128-byte blocks show an average speedup of 19%, while degrading performance by no more than 5%. The combined scheme achieves a peak improvement of 292%, compared to 165% and 83% for cache or bus compression alone. The compressed system generally provides even better performance as the block size is increased to 512 bytes.

...read moreread less

Collapse