Showing papers on "Cache coloring published in 2005"

PDF

Open Access

Proceedings Article•DOI•

Predicting inter-thread cache contention on a chip multi-processor architecture

[...]

Dhruba Chandra¹, Fei Guo¹, Seongbeom Kim¹, Yan Solihin¹•Institutions (1)

12 Feb 2005

TL;DR: Three performance models are proposed that predict the impact of cache sharing on co-scheduled threads and the most accurate model, the inductive probability model, achieves an average error of only 3.9%.

...read moreread less

Abstract: This paper studies the impact of L2 cache sharing on threads that simultaneously share the cache, on a chip multi-processor (CMP) architecture. Cache sharing impacts threads nonuniformly, where some threads may be slowed down significantly, while others are not. This may cause severe performance problems such as sub-optimal throughput, cache thrashing, and thread starvation for threads that fail to occupy sufficient cache space to make good progress. Unfortunately, there is no existing model that allows extensive investigation of the impact of cache sharing. To allow such a study, we propose three performance models that predict the impact of cache sharing on co-scheduled threads. The input to our models is the isolated L2 cache stack distance or circular sequence profile of each thread, which can be easily obtained on-line or off-line. The output of the models is the number of extra L2 cache misses for each thread due to cache sharing. The models differ by their complexity and prediction accuracy. We validate the models against a cycle-accurate simulation that implements a dual-core CMP architecture, on fourteen pairs of mostly SPEC benchmarks. The most accurate model, the inductive probability model, achieves an average error of only 3.9%. Finally, to demonstrate the usefulness and practicality of the model, a case study that details the relationship between an application's temporal reuse behavior and its cache sharing impact is presented.

...read moreread less

543 citations

Journal Article•DOI•

Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors

[...]

Michael Zhang¹, Krste Asanovic¹•Institutions (1)

Massachusetts Institute of Technology¹

01 May 2005

TL;DR: This paper presents a new cache management policy, victim replication, which combines the advantages of private and shared schemes, and shows that victim replication reduces the average memory access latency of the shared L2 cache by an average of 16% for multi-threaded benchmarks and 24% for single-threading benchmarks.

...read moreread less

Abstract: In this paper, we consider tiled chip multiprocessors (CMP) where each tile contains a slice of the total on-chip L2 cache storage and tiles are connected by an on-chip network. The L2 slices can be managed using two basic schemes: 1) each slice is treated as a private L2 cache for the tile 2) all slices are treated as a single large L2 cache shared by all tiles. Private L2 caches provide the lowest hit latency but reduce the total effective cache capacity, as each tile creates local copies of any line it touches. A shared L2 cache increases the effective cache capacity for shared data, but incurs long hit latencies when L2 data is on a remote tile. We present a new cache management policy, victim replication, which combines the advantages of private and shared schemes. Victim replication is a variant of the shared scheme which attempts to keep copies of local primary cache victims within the local L2 cache slice. Hits to these replicated copies reduce the effective latency of the shared L2 cache, while retaining the benefits of a higher effective capacity for shared data. We evaluate the various schemes using full-system simulation of both single-threaded and multi-threaded benchmarks running on an 8-processor tiled CMP. We show that victim replication reduces the average memory access latency of the shared L2 cache by an average of 16%for multi-threaded benchmarks and 24%for single-threaded benchmarks, providing better overall performance than either private or shared schemes.

...read moreread less

331 citations

Patent•

Flash controller cache architecture

[...]

Kevin M. Conley¹, Reuven Elhamias•Institutions (1)

SanDisk¹

07 Mar 2005

TL;DR: A buffer cache interposed between a non-volatile memory and a host may be partitioned into segments that may operate with different policies as mentioned in this paper, such as write-through, write and read-look-ahead.

...read moreread less

Abstract: A buffer cache interposed between a non-volatile memory and a host may be partitioned into segments that may operate with different policies. Cache policies include write-through, write and read-look-ahead. Write-through and write back policies may improve speed. Read-look-ahead cache allows more efficient use of the bus between the buffer cache and non-volatile memory. A session command allows data to be maintained in volatile memory by guaranteeing against power loss.

...read moreread less

256 citations

Patent•

Peer-to-peer name resolution protocol (PNRP) and multilevel cache for use therewith

[...]

Christian Huitema¹, John L. Miller¹•Institutions (1)

Microsoft¹

13 Jun 2005

TL;DR: A peer-to-peer name resolution protocol (PNRP) is proposed in this paper, which allows resolution of names which are mapped onto the circular number space through a hash function.

...read moreread less

Abstract: A serverless name resolution protocol ensures convergence despite the size of the network, without requiring an ever-increasing cache and with a reasonable numbers of hops. This convergence is ensured through a multi-level cache and a proactive cache initialization strategy. The multi-level cache is built based on a circular number space. Each level contains information from different levels of slivers of the circular space. A mechanism is included to add a level to the multi-level cache when the node determines that the last level is full. A peer-to-peer name resolution protocol (PNRP) includes a mechanism to allow resolution of names which are mapped onto the circular number space through a hash function. Further, the PNRP may also operate with the domain name system by providing each node with an identification consisting of a domain name service (DNS) component and a unique number.

...read moreread less

228 citations

Proceedings Article•DOI•

A NUCA substrate for flexible CMP cache sharing

[...]

Jaehyuk Huh¹, Changkyu Kim¹, Hazim Shafi², Lixin Zhang², Doug Burger¹, Stephen W. Keckler¹ - Show less +2 more•Institutions (2)

University of Texas at Austin¹, IBM²

20 Jun 2005

TL;DR: It is demonstrated that migratory, dynamic NUCA approaches improve performance significantly for a subset of the workloads at the cost of increased power consumption and complexity, especially as per-application cache partitioning strategies are applied.

...read moreread less

Abstract: We propose an organization for the on-chip memory system of a chip multiprocessor, in which 16 processors share a 16MB pool of 256 L2 cache banks. The L2 cache is organized as a non-uniform cache architecture (NUCA) array with a switched network embedded in it for high performance. We show that this organization can support the spectrum of degrees of sharing: unshared, in which each processor has a private portion of the cache, thus reducing hit latency, completely shared, in which every processor shares the entire cache, thus minimizing misses, and every point in between. We find the optimal degree of sharing for a number of cache bank mapping policies, and also evaluate a per-application cache partitioning strategy. We conclude that a static NUCA organization with sharing degrees of two or four work best across a suite of commercial and scientific parallel workloads. We also demonstrate that migratory, dynamic NUCA approaches improve performance significantly for a subset of the workloads at the cost of increased power consumption and complexity, especially as per-application cache partitioning strategies are applied.

...read moreread less

218 citations

Journal Article•DOI•

Cache and memory error detection, correction, and reduction techniques for terrestrial servers and workstations

[...]

C.W. Slayman¹•Institutions (1)

Sun Microsystems¹

05 Dec 2005-IEEE Transactions on Device and Materials Reliability

TL;DR: In most system applications, a combination of several techniques is required to meet the necessary reliability and data-integrity targets, and the tradeoffs of these techniques in terms of area, power, and performance penalties versus increased reliability are covered.

...read moreread less

Abstract: As the size of the SRAM cache and DRAM memory grows in servers and workstations, cosmic-ray errors are becoming a major concern for systems designers and end users. Several techniques exist to detect and mitigate the occurrence of cosmic-ray upset, such as error detection, error correction, cache scrubbing, and array interleaving. This paper covers the tradeoffs of these techniques in terms of area, power, and performance penalties versus increased reliability. In most system applications, a combination of several techniques is required to meet the necessary reliability and data-integrity targets.

...read moreread less

205 citations

Journal Article•DOI•

The V-Way Cache: Demand Based Associativity via Global Replacement

[...]

Moinuddin K. Qureshi¹, David A. Thompson¹, Yale N. Patt¹•Institutions (1)

University of Texas at Austin¹

01 May 2005

TL;DR: The proposed variable-way, or V-Way, set-associative cache achieves an average miss rate reduction of 13% on sixteen benchmarks from the SPEC CPU2000 suite, which translates into an average IPC improvement of 8%.

...read moreread less

Abstract: As processor speeds increase and memory latency becomes more critical, intelligent design and management of secondary caches becomes increasingly important. The efficiency of current set-associative caches is reduced because programs exhibit a non-uniform distribution of memory accesses across different cache sets. We propose a technique to vary the associativity of a cache on a per-set basis in response to the demands of the program. By increasing the number of tag-store entries relative to the number of data lines, we achieve the performance benefit of global replacement while maintaining the constant hit latency of a set-associative cache. The proposed variable-way, or V-Way, set-associative cache achieves an average miss rate reduction of 13% on sixteen benchmarks from the SPEC CPU2000 suite. This translates into an average IPC improvement of 8%.

...read moreread less

204 citations

Proceedings Article•DOI•

Cache oblivious stencil computations

[...]

Matteo Frigo¹, Volker Strumpen¹•Institutions (1)

IBM¹

20 Jun 2005

TL;DR: This work presents a cache oblivious algorithm for stencil computations, which arise for example in finite-difference methods, and it exploits temporal locality optimally throughout the entire memory hierarchy.

...read moreread less

Abstract: We present a cache oblivious algorithm for stencil computations, which arise for example in finite-difference methods. Our algorithm applies to arbitrary stencils in n-dimensional spaces. On an "ideal cache" of size Z, our algorithm saves a factor of Θ(Z1/n) cache misses compared to a naive algorithm, and it exploits temporal locality optimally throughout the entire memory hierarchy.

...read moreread less

196 citations

Patent•

Method and apparatus for selecting cache and proxy policy

[...]

Bret P. O'Rourke¹, Dawson F. Dean¹, Chih-Kan Wang¹, Mark D. Van Antwerp¹, David J. Roth¹, Chadd B. Knowlton¹ - Show less +2 more•Institutions (1)

Microsoft¹

22 Nov 2005

TL;DR: In this paper, a cache server is coupled with a media serving engine that is capable of caching media content, and a set of cache policies is accessible by the cache engine to define the operation of the cache.

...read moreread less

Abstract: A cache server includes a media serving engine that is capable of distributing media content. A cache engine is coupled to the media serving engine and capable of caching media content. A set of cache policies is accessible by the cache engine to define the operation of the cache engine. The cache server can be configured to operate as either a cache server or an origin server. The cache server also includes a data communication interface coupled to the cache engine and the media serving engine to allow the cache engine to receive media content across a network and to allow the media serving engine to distribute media content across the network. The cache policies include policies for distributing media content from the media server, policies for handling cache misses, and policies for prefetching media content.

...read moreread less

165 citations

Journal Article•DOI•

Data cache prefetching using a global history buffer

[...]

Kyle J. Nesbit¹, James E. Smith¹•Institutions (1)

University of Wisconsin-Madison¹

01 Jan 2005-IEEE Micro

TL;DR: A GHB (global history buffer) supports existing prefetch algorithms more effectively than conventional prefetch tables and contains a more complete picture of cache miss history, which reduces stale table data, improving accuracy and reducing memory traffic.

...read moreread less

Abstract: Over the past couple of decades, trends in both microarchitecture and underlying semiconductor technology have significantly reduced microprocessor clock periods These trends have significantly increased relative main-memory latencies as measured in processor clock cycles To avoid large performance losses caused by long memory access delays, microprocessors rely heavily on a hierarchy of cache memories But cache memories are not always effective, either because they are not large enough to hold a program's working set, or because memory access patterns don't exhibit behavior that matches a cache memory's demand-driven, line-structured organization To partially overcome cache memories' limitations, we organize data cache prefetch information in a new way, a GHB (global history buffer) supports existing prefetch algorithms more effectively than conventional prefetch tables It reduces stale table data, improving accuracy and reducing memory traffic It contains a more complete picture of cache miss history and is smaller than conventional tables

...read moreread less

161 citations

Patent•

Data-aware cache state machine

[...]

Wei Liu, Steven H. Kahle

31 Jan 2005

TL;DR: Data-aware cache as discussed by the authors is a method and system directed to improve effectiveness and efficiency of cache and data management by differentiating data based on certain attributes associated with the data and reducing the bottleneck to storage.

...read moreread less

Abstract: A method and system directed to improve effectiveness and efficiency of cache and data management by differentiating data based on certain attributes associated with the data and reducing the bottleneck to storage. The data-aware cache differentiates and manages data using a state machine having certain states. The data-aware cache may use data pattern and traffic statistics to retain frequently used data in cache longer by transitioning it into Sticky or StickyDirty states. The data-aware cache may also use content or application related attributes to differentiate and retain certain data in cache longer. Further, the data-aware cache may provide cache status and statistics information to a data-aware data flow manager, thus assisting data-aware data flow manager to determine which data to cache and which data to pipe directly through, or to switch cache policies dynamically, thus avoiding some of the overhead associated with caches. The data-aware cache may also place clean and dirty data in separate states, enabling more efficient cache mirroring and flush, thus improve system reliability and performance.

...read moreread less

Proceedings Article•

Performance of multithreaded chip multiprocessors and implications for operating system design

[...]

Alexandra Fedorova¹, Margo Seltzer¹, Christoper Small², Daniel S. Nussbaum²•Institutions (2)

Harvard University¹, Sun Microsystems²

10 Apr 2005

TL;DR: To use the L2 cache as efficiently as possible, an L2-conscious scheduling algorithm is proposed and it is possible to reduce miss ratios in the L1 cache and improve processor throughput by 27-45%.

...read moreread less

Abstract: We investigated how operating system design should be adapted for multithreaded chip multiprocessors (CMT) - a new generation of processors that exploit thread-level parallelism to mask the memory latency in modern workloads. We determined that the L2 cache is a critical shared resource on CMT and that an insufficient amount of L2 cache can undermine the ability to hide memory latency on these processors. To use the L2 cache as efficiently as possible, we propose an L2-conscious scheduling algorithm and quantify its performance potential. Using this algorithm it is possible to reduce miss ratios in the L2 cache by 25-37% and improve processor throughput by 27-45%.

...read moreread less

Patent•

Method for programmer-controlled cache line eviction policy

[...]

Mason Cabot¹•Institutions (1)

Intel¹

20 Dec 2005

TL;DR: In this article, a cache line eviction mechanism is proposed that enables programmers to mark portions of code with different cache priority levels based on anticipated or measured access patterns for those code portions.

...read moreread less

Abstract: A method and apparatus to enable programmatic control of cache line eviction policies. A mechanism is provided that enables programmers to mark portions of code with different cache priority levels based on anticipated or measured access patterns for those code portions. Corresponding cues to assist in effecting the cache eviction policies associated with given priority levels are embedded in machine code generated from source-and/or assembly-level code. Cache architectures are provided that partition cache space into multiple pools, each pool being assigned a different priority. In response to execution of a memory access instruction, an appropriate cache pool is selected and searched based on information contained in the instruction's cue. On a cache miss, a cache line is selected from that pool to be evicted using a cache eviction policy associated with the pool. Implementations of the mechanism or described for both n-way set associative caches and fully-associative caches.

...read moreread less

Proceedings Article•DOI•

Memory coloring: a compiler approach for scratchpad memory management

[...]

Lian Li¹, Lin Gao¹, Jingling Xue¹•Institutions (1)

University of New South Wales¹

17 Sep 2005

TL;DR: A general-purpose compiler approach, called memory coloring, to efficiently allocating the arrays in a program to an SPM, by adapting an existing graph-colouring algorithm for register allocation to assign the array in the program into the register file.

...read moreread less

Abstract: Scratchpad memory (SPM), a fast software-managed on-chip SRAM, is now widely used in modern embedded processors. Compared to hardware-managed cache, it is more efficient in performance, power and area cost, and has the added advantage of better time predictability. This paper introduces a general-purpose compiler approach, called memory coloring, to efficiently allocating the arrays in a program to an SPM. The novelty of our approach lies in partitioning an SPM into a "register file", splitting the live ranges of arrays to create potential data transfer statements between the SPM and off-chip memory, and finally, adapting an existing graph-colouring algorithm for register allocation to assign the arrays in the program into the register file. Our approach is efficient due to the practical efficiency of graph-colouring algorithms. We have implemented this work in SUIF and machSUIF. Preliminary results over benchmarks show that our approach represents a promising solution to automatic SPM management.

...read moreread less

Proceedings Article•DOI•

Balancing Performance and Reliability in the Memory Hierarchy

[...]

Ghazanfar Asadi¹, Vilas Sridharan¹, Mehdi B. Tahoori¹, David Kaeli¹•Institutions (1)

Northeastern University¹

20 Mar 2005

TL;DR: A new method to accurately estimate the reliability of cache memories is presented and three different techniques are presented to reduce the susceptibility of first-level caches to soft errors by two orders of magnitude.

...read moreread less

Abstract: Cosmic-ray induced soft errors in cache memories are becoming a major threat to the reliability of microprocessor-based systems. In this paper, we present a new method to accurately estimate the reliability of cache memories. We have measured the MTTF (mean-time-to-failure) of unprotected first-level (L1) caches for twenty programs taken from SPEC2000 benchmark suite. Our results show that a 16 KB first-level cache possesses a MTTF of at least 400 years (for a raw error rate of 0.002 FIT/bit.) However, this MTTF is significantly reduced for higher error rates and larger cache sizes. Our results show that for selected programs, a 64 KB first-level cache is more than 10 times as vulnerable to soft errors versus a 16 KB cache memory. Our work also illustrates that the reliability of cache memories is highly application-dependent. Finally, we present three different techniques to reduce the susceptibility of first-level caches to soft errors by two orders of magnitude. Our analysis shows how to achieve a balance between performance and reliability

...read moreread less

Proceedings Article•DOI•

Impact of modern memory subsystems on cache optimizations for stencil computations

[...]

Shoaib Kamil¹, Parry Husbands¹, Leonid Oliker¹, John Shalf¹, Katherine Yelick¹ - Show less +1 more•Institutions (1)

Lawrence Berkeley National Laboratory¹

12 Jun 2005

TL;DR: The impact of evolving memory system features, such as large on-chip caches, automatic prefetch, and the growing distance to main memory on 3D stencil computations are investigated.

...read moreread less

Abstract: In this work we investigate the impact of evolving memory system features, such as large on-chip caches, automatic prefetch, and the growing distance to main memory on 3D stencil computations. These calculations form the basis for a wide range of scientific applications from simple Jacobi iterations to complex multigrid and block structured adaptive PDE solvers. First we develop a simple benchmark to evaluate the effectiveness of prefetching in cache-based memory systems. Next we present a small parameterized probe and validate its use as a proxy for general stencil computations on three modern microprocessors. We then derive an analytical memory cost model for quantifying cache-blocking behavior and demonstrate its effectiveness in predicting the stencil-computation performance. Overall results demonstrate that recent trends memory system organization have reduced the efficacy of traditional cache-blocking optimizations.

...read moreread less

Proceedings Article•DOI•

Fast data-locality profiling of native execution

[...]

Erik J. Berg¹, Erik Hagersten¹•Institutions (1)

Uppsala University¹

06 Jun 2005

TL;DR: StatCache is presented, a performance tool based on a statistical cache model that has a small run-time overhead while providing much of the flexibility of simulator-based tools and demonstrates how the flexibility can be used to better understand the characteristics of cache-related performance problems.

...read moreread less

Abstract: Performance tools based on hardware counters can efficiently profile the cache behavior of an application and help software developers improve its cache utilization. Simulator-based tools can potentially provide more insights and flexibility and model many different cache configurations, but have the drawback of large run-time overhead.We present StatCache, a performance tool based on a statistical cache model. It has a small run-time overhead while providing much of the flexibility of simulator-based tools. A monitor process running in the background collects sparse memory access statistics about the analyzed application running natively on a host computer. Generic locality information is derived and presented in a code-centric and/or data-centric view.We evaluate the accuracy and performance of the tool using ten SPEC CPU2000 benchmarks. We also exemplify how the flexibility of the tool can be used to better understand the characteristics of cache-related performance problems.

...read moreread less

Patent•

Method and system for run-time cache logging

[...]

Charbel Khawand¹, Jianping W. Miller¹•Institutions (1)

Motorola¹

22 Dec 2005

TL;DR: In this article, the authors present a method for run-time cache optimization based on profiling a program code during a runtime execution, logging the performance for producing a cache log, and rearranging a portion of program code in view of the cache log for producing rearranged portion.

...read moreread less

Abstract: A method ( 400 ) and system ( 106 ) is provided for run-time cache optimization. The method includes profiling ( 402 ) a performance of a program code during a run-time execution, logging ( 408 ) the performance for producing a cache log, and rearranging ( 410 ) a portion of program code in view of the cache log for producing a rearranged portion. The rearranged portion is supplied to a memory management unit ( 240 ) for managing at least one cache memory ( 110 - 140 ). The cache log can be collected during a real-time operation of a communication device and is fed back to a linking process ( 244 ) to maximize a cache locality compile-time. The method further includes loading a saved profile corresponding with a run-time operating mode, and reprogramming a new code image associated with the saved profile.

...read moreread less

Patent•

Instruction-assisted cache management for efficient use of cache and memory

[...]

Mark B. Rosenbluth¹, Sridhar Lakshmanamurthy¹•Institutions (1)

Intel¹

30 Sep 2005

TL;DR: In this paper, instruction-assisted cache management for efficient use of cache and memory is discussed, where hints (e.g., modifiers) are added to read and write memory access instructions to identify the memory access for temporal data.

...read moreread less

Abstract: Instruction-assisted cache management for efficient use of cache and memory. Hints (e.g., modifiers) are added to read and write memory access instructions to identify the memory access is for temporal data. In view of such hints, alternative cache policy and allocation policies are implemented that minimize cache and memory access. Under one policy, a write cache miss may result in a write of data to a partial cache line without a memory read/write cycle to fill the remainder of the line. Under another policy, a read cache miss may result in a read from memory without allocating or writing the read data to a cache line. A cache line soft-lock mechanism is also disclosed, wherein cache lines may be selectably soft locked to indicate preference for keeping those cache lines over non-locked lines.

...read moreread less

Journal Article•DOI•

IATAC: a smart predictor to turn-off L2 cache lines

[...]

Jaume Abella¹, Antonio González¹, Xavier Vera¹, Michael O'Boyle²•Institutions (2)

Polytechnic University of Catalonia¹, University of Edinburgh²

01 Mar 2005-ACM Transactions on Architecture and Code Optimization

TL;DR: This paper introduces IATAC (inter-access time per access count), a new hardware technique to reduce cache leakage for L2 caches that outperforms all previous state-of-the-art techniques.

...read moreread less

Abstract: As technology evolves, power dissipation increases and cooling systems become more complex and expensive. There are two main sources of power dissipation in a processor: dynamic power and leakage. Dynamic power has been the most significant factor, but leakage will become increasingly significant in future. It is predicted that leakage will shortly be the most significant cost as it grows at about a 5× rate per generation. Thus, reducing leakage is essential for future processor design. Since large caches occupy most of the area, they are one of the leakiest structures in the chip and hence, a main source of energy consumption for future processors.This paper introduces IATAC (inter-access time per access count), a new hardware technique to reduce cache leakage for L2 caches. IATAC dynamically adapts the cache size to the program requirements turning off cache lines whose content is not likely to be reused. Our evaluation shows that this approach outperforms all previous state-of-the-art techniques. IATAC turns off 65p of the cache lines across different L2 cache configurations with a very small performance degradation of around 2p.

...read moreread less

Journal Article•

The hardness of cache conscious data placement

[...]

Erez Petrank¹, Dror Rawitz²•Institutions (2)

Technion – Israel Institute of Technology¹, University of Haifa²

06 Jun 2005-Nordic Journal of Computing

TL;DR: In this article, the complexity of finding an optimal placement of objects (or code) in the memory, in the sense that this placement reduces the number of cache misses during program execution to the minimum, was investigated.

...read moreread less

Abstract: We investigate the complexity of finding an optimal placement of objects (or code) in the memory, in the sense that this placement reduces the number of cache misses during program execution to the minimum. We show that this problem is one of the toughest amongst the interesting NP optimization problems in computer science, in particular, suppose one is given a sequence of memory accesses and one has to place the data in the memory so as to minimize the number of cache misses for this sequence. We show that if P ≠ NP, then one cannot efficiently approximate the optimal solution even up to a very liberal approximation ratio. Thus, this problem joins the family of extremely inapproximable optimization problems. Two famous members in this family are minimum graph coloring and maximum clique. In light of this harsh lower bound only mild approximation ratios can be obtained. We provide an algorithm that can map arbitrary access sequences within such a mild ratio.Next, we study the information loss when compressing the access sequence, keeping only pairwise relations. We show that the reduced information hides the optimal solution and highlights solutions that are far from optimal. Furthermore, we show that even if one restricts his attention to pairwise information, finding a good placement is computationally difficult.

...read moreread less

Proceedings Article•DOI•

A unified compressed memory hierarchy

[...]

E.G. Hallnor¹, Steven K. Reinhardt¹•Institutions (1)

University of Michigan¹

12 Feb 2005

TL;DR: This work proposes and analyzes a memory hierarchy that uses a unified compression scheme encompassing the last-level on-chip cache, the off-chip memory channel, and off- chip main memory that achieves a peak improvement of 292%, compared to 165% and 83% for cache or bus compression alone.

...read moreread less

Abstract: The memory system's large and growing contribution to system performance motivates more aggressive approaches to improving its efficiency. We propose and analyze a memory hierarchy that uses a unified compression scheme encompassing the last-level on-chip cache, the off-chip memory channel, and off-chip main memory. This scheme simultaneously increases the effective on-chip cache capacity, off-chip bandwidth, and main memory size, while avoiding compression and decompression overheads between levels. Simulations of the SPEC CPU2000 benchmarks using a 1MB cache and 128-byte blocks show an average speedup of 19%, while degrading performance by no more than 5%. The combined scheme achieves a peak improvement of 292%, compared to 165% and 83% for cache or bus compression alone. The compressed system generally provides even better performance as the block size is increased to 512 bytes.

...read moreread less

Journal Article•DOI•

Generating cache hints for improved program efficiency

[...]

Kristof Beyls¹, Erik H. D'Hollander¹•Institutions (1)

Ghent University¹

01 Apr 2005-Journal of Systems Architecture

TL;DR: The implementation of the static hints scheme in the Open64-compiler for the Itanium processor shows a speedup of 10% on average on a set of pointer-intensive and regular loop-based programs and up to 34% reduction in cache misses.

...read moreread less

Journal Article•DOI•

A highly configurable cache for low energy embedded systems

[...]

Chuanjun Zhang¹, Frank Vahid², Walid Najjar²•Institutions (2)

San Diego State University¹, University of California, Riverside²

01 May 2005-ACM Transactions in Embedded Computing Systems

TL;DR: A study of 23 programs drawn from Powerstone, MediaBench, and Spec2000 benchmark suites shows that the configurable cache tuned to each program saved energy for every program compared to a conventional four-way set-associative cache as well as compared to an conventional direct-mapped cache, with an average savings of energy related to memory access.

...read moreread less

Abstract: Energy consumption is a major concern in many embedded computing systems. Several studies have shown that cache memories account for about 50p of the total energy consumed in these systems. The performance of a given cache architecture is determined, to a large degree, by the behavior of the application executing on the architecture. Desktop systems have to accommodate a very wide range of applications and therefore the cache architecture is usually set by the manufacturer as a best compromise given current applications, technology, and cost. Unlike desktop systems, embedded systems are designed to run a small range of well-defined applications. In this context, a cache architecture that is tuned for that narrow range of applications can have both increased performance as well as lower energy consumption. We introduce a novel cache architecture intended for embedded microprocessor platforms. The cache has three software-configurable parameters that can be tuned to particular applications. First, the cache's associativity can be configured to be direct-mapped, two-way, or four-way set-associative, using a novel technique we call way concatenation. Second, the cache's total size can be configured by shutting down ways. Finally, the cache's line size can be configured to have 16, 32, or 64 bytes. A study of 23 programs drawn from Powerstone, MediaBench, and Spec2000 benchmark suites shows that the configurable cache tuned to each program saved energy for every program compared to a conventional four-way set-associative cache as well as compared to a conventional direct-mapped cache, with an average savings of energy related to memory access of over 40p.

...read moreread less

Patent•

System and method for adaptively managing pages in a memory

[...]

Nimrod Megiddo¹, Dharmendra S. Modha¹•Institutions (1)

IBM¹

13 Jun 2005

TL;DR: In this article, an adaptive replacement cache policy dynamically maintains two lists of pages, a recency list and a frequency list, in addition to a cache directory, keeping these two lists to roughly the same size, the cache size c.

...read moreread less

Abstract: An adaptive replacement cache policy dynamically maintains two lists of pages, a recency list and a frequency list, in addition to a cache directory. The policy keeps these two lists to roughly the same size, the cache size c. Together, the two lists remember twice the number of pages that would fit in the cache. At any time, the policy selects a variable number of the most recent pages to exclude from the two lists. The policy adaptively decides in response to an evolving workload how many top pages from each list to maintain in the cache at any given time. It achieves such online, on-the-fly adaptation by using a learning rule that allows the policy to track a workload quickly and effectively.

...read moreread less

Patent•

Multi-level cache architecture having a selective victim cache

[...]

Steven Paul Vanderwiel¹•Institutions (1)

IBM¹

26 Oct 2005

TL;DR: In this paper, a selection mechanism selects lines evicted from the higher level cache for storage in the victim cache, only some of the evicted lines being selected for the victim.

...read moreread less

Abstract: A computer system cache memory contains at least two levels. A lower level selective victim cache receives cache lines evicted from a higher level cache. A selection mechanism selects lines evicted from the higher level cache for storage in the victim cache, only some of the evicted lines being selected for the victim. Preferably, two priority bits associated with each cache line are used to select lines for the victim. The priority bits indicate whether the line has been re-referenced while in the higher level cache, and whether it has been reloaded after eviction from the higher level cache.

...read moreread less

Patent•

Distributed and packed metadata structure for disk cache

[...]

Robert W. Faber¹•Institutions (1)

Intel¹

15 Sep 2005

TL;DR: In this paper, an apparatus and method to reduce the initialization time of a system is disclosed, where metadata associated with the cache line is stored in a distributed format in non-volatile memory with its associated cache line.

...read moreread less

Abstract: An apparatus and method to reduce the initialization time of a system is disclosed. In one embodiment, upon a cache line update, metadata associated with the cache line is stored in a distributed format in non-volatile memory with its associated cache line. Upon indication of an expected shut down, metadata is copied from volatile memory and stored in non-volatile memory in a packed format. In the packed format, multiple metadata associated with multiple cache lines are stored together in, for example, a single memory block. Thus, upon system power up, if the system was shut down in an expected manner, metadata may be restored in volatile memory from the metadata stored in the packed format, with a significantly reduced boot time over restoring metadata from the metadata stored in the distributed format.

...read moreread less

Patent•

Database RAM cache

[...]

Richard Matthew Piper¹, Mark Christopher Pilon, Felix M. Landry•Institutions (1)

Alcatel-Lucent¹

14 Oct 2005

TL;DR: In this article, a system and method for providing a shared RAM cache of a database, accessible by multiple processes, is described, and synchronization between the database and the shared cache is assured by using a unidirectional notification mechanism.

...read moreread less

Abstract: A system and method are provided for providing a shared RAM cache of a database, accessible by multiple processes. By sharing a single cache rather than local copies of the database, memory is saved and synchronization of data accessed by different processes is assured. Synchronization between the database and the shared cache is assured by using a unidirectional notification mechanism between the database and the shared cache. Client APIs within the processes search the data within the shared cache directly, rather than by making a request to a database server. Therefore server load is not affected by the number of requesting applications and data fetch time is not affected by Inter-Process Communication delay or by additional context switching. A new synchronization scheme allows multiple processes to be used in building and maintaining the cache, greatly reducing start up time.

...read moreread less

Journal Article•DOI•

A way-halting cache for low-energy high-performance systems

[...]

Chuanjun Zhang¹, Frank Vahid², Jun Yang², Walid Najjar²•Institutions (2)

San Diego State University¹, University of California, Riverside²

01 Mar 2005-ACM Transactions on Architecture and Code Optimization

TL;DR: In this paper, the authors proposed a way-halting cache, which is a four-way set-associative cache that stores the four lowest-order bits of all ways' tags into a fully associative memory.

...read moreread less

Abstract: Caches contribute to much of a microprocessor system's power and energy consumption. Numerous new cache architectures, such as phased, pseudo-set-associative, way predicting, reactive-associative, way-shutdown, way-concatenating, and highly-associative, are intended to reduce power and/or energy, but they all impose some performance overhead. We have developed a new cache architecture, called a way-halting cache, that reduces energy further than previously mentioned architectures, while imposing no performance overhead. Our way-halting cache is a four-way set-associative cache that stores the four lowest-order bits of all ways' tags into a fully associative memory, which we call the halt tag array. The lookup in the halt tag array is done in parallel with, and is no slower than, the set-index decoding. The halt tag array predetermines which tags cannot match due to their low-order 4 bits mismatching. Further accesses to ways with known mismatching tags are then halted, thus saving power. Our halt tag array has an additional feature of using static logic only, rather than dynamic logic used in highly associative caches, making our cache simpler to design with existing tools. We provide data from experiments on 29 benchmarks drawn from Powerstone, Mediabench, and Spec 2000, based on our layouts in 0.18 micron CMOS technology. On average, we obtained 55p savings of memory-access related energy over a conventional four-way set-associative cache. We show that savings are greater than previous methods, and nearly twice that of highly associative caches, while imposing no performance overhead and only 2p cache area overhead.

...read moreread less

Journal Article•DOI•

Exploring the cache design space for large scale CMPs

[...]

Lisa Hsu¹, Ravi Iyer², Srihari Makineni², Steve Reinhardt¹, Donald Newell² - Show less +1 more•Institutions (2)

University of Michigan¹, Intel²

01 Nov 2005-ACM Sigarch Computer Architecture News

TL;DR: The range of methodologies that are developing to overcome the challenges of exploring the cache design space for LCMP platforms are described and a trace-driven approach to characterizing one key server workload (OLTP) in both a homogeneous and a heterogeneous workload environment is focused on.

...read moreread less

Abstract: With the advent of dual-core chips in the marketplace, small-scale CMP (chip multiprocessor) architectures are becoming commonplace. We expect a continuing trend of increasing the number of cores on a die to maximize the performance/power efficiency of a single chip. We believe an era of large-scale CMPs (LCMPs) with several tens to hundreds of cores is on the way, but as of now architects have little understanding of how best to build a cache hierarchy given such a large number of cores/threads to support. With this in mind, our initial goals are to prune the cache design space for LCMPs by characterizing basic server workload behavior in such an environment.In this paper, we describe the range of methodologies that we are developing to overcome the challenges of exploring the cache design space for LCMP platforms. We then focus on employing a trace-driven approach to characterizing one key server workload (OLTP) in both a homogeneous and a heterogeneous workload environment. We study the effect of increasing threads (from 1 to 128) on a three-level cache hierarchy with emphasis on second and third level caches. We study the effect of varying sizes at these cache levels and show the effects of threads contending for cache space, the effects of prefetching instruction addresses, and the effects of inclusion. We make initial observations and conclusions about the factors on which LCMP cache hierarchy design decisions should be based and discuss future work.

...read moreread less

Collapse