scispace - formally typeset
Search or ask a question

Showing papers on "Cache coloring published in 2005"


Proceedings ArticleDOI
12 Feb 2005
TL;DR: Three performance models are proposed that predict the impact of cache sharing on co-scheduled threads and the most accurate model, the inductive probability model, achieves an average error of only 3.9%.
Abstract: This paper studies the impact of L2 cache sharing on threads that simultaneously share the cache, on a chip multi-processor (CMP) architecture. Cache sharing impacts threads nonuniformly, where some threads may be slowed down significantly, while others are not. This may cause severe performance problems such as sub-optimal throughput, cache thrashing, and thread starvation for threads that fail to occupy sufficient cache space to make good progress. Unfortunately, there is no existing model that allows extensive investigation of the impact of cache sharing. To allow such a study, we propose three performance models that predict the impact of cache sharing on co-scheduled threads. The input to our models is the isolated L2 cache stack distance or circular sequence profile of each thread, which can be easily obtained on-line or off-line. The output of the models is the number of extra L2 cache misses for each thread due to cache sharing. The models differ by their complexity and prediction accuracy. We validate the models against a cycle-accurate simulation that implements a dual-core CMP architecture, on fourteen pairs of mostly SPEC benchmarks. The most accurate model, the inductive probability model, achieves an average error of only 3.9%. Finally, to demonstrate the usefulness and practicality of the model, a case study that details the relationship between an application's temporal reuse behavior and its cache sharing impact is presented.

543 citations


Journal ArticleDOI
01 May 2005
TL;DR: This paper presents a new cache management policy, victim replication, which combines the advantages of private and shared schemes, and shows that victim replication reduces the average memory access latency of the shared L2 cache by an average of 16% for multi-threaded benchmarks and 24% for single-threading benchmarks.
Abstract: In this paper, we consider tiled chip multiprocessors (CMP) where each tile contains a slice of the total on-chip L2 cache storage and tiles are connected by an on-chip network. The L2 slices can be managed using two basic schemes: 1) each slice is treated as a private L2 cache for the tile 2) all slices are treated as a single large L2 cache shared by all tiles. Private L2 caches provide the lowest hit latency but reduce the total effective cache capacity, as each tile creates local copies of any line it touches. A shared L2 cache increases the effective cache capacity for shared data, but incurs long hit latencies when L2 data is on a remote tile. We present a new cache management policy, victim replication, which combines the advantages of private and shared schemes. Victim replication is a variant of the shared scheme which attempts to keep copies of local primary cache victims within the local L2 cache slice. Hits to these replicated copies reduce the effective latency of the shared L2 cache, while retaining the benefits of a higher effective capacity for shared data. We evaluate the various schemes using full-system simulation of both single-threaded and multi-threaded benchmarks running on an 8-processor tiled CMP. We show that victim replication reduces the average memory access latency of the shared L2 cache by an average of 16%for multi-threaded benchmarks and 24%for single-threaded benchmarks, providing better overall performance than either private or shared schemes.

331 citations


Patent
07 Mar 2005
TL;DR: A buffer cache interposed between a non-volatile memory and a host may be partitioned into segments that may operate with different policies as mentioned in this paper, such as write-through, write and read-look-ahead.
Abstract: A buffer cache interposed between a non-volatile memory and a host may be partitioned into segments that may operate with different policies. Cache policies include write-through, write and read-look-ahead. Write-through and write back policies may improve speed. Read-look-ahead cache allows more efficient use of the bus between the buffer cache and non-volatile memory. A session command allows data to be maintained in volatile memory by guaranteeing against power loss.

256 citations


Patent
13 Jun 2005
TL;DR: A peer-to-peer name resolution protocol (PNRP) is proposed in this paper, which allows resolution of names which are mapped onto the circular number space through a hash function.
Abstract: A serverless name resolution protocol ensures convergence despite the size of the network, without requiring an ever-increasing cache and with a reasonable numbers of hops. This convergence is ensured through a multi-level cache and a proactive cache initialization strategy. The multi-level cache is built based on a circular number space. Each level contains information from different levels of slivers of the circular space. A mechanism is included to add a level to the multi-level cache when the node determines that the last level is full. A peer-to-peer name resolution protocol (PNRP) includes a mechanism to allow resolution of names which are mapped onto the circular number space through a hash function. Further, the PNRP may also operate with the domain name system by providing each node with an identification consisting of a domain name service (DNS) component and a unique number.

228 citations


Proceedings ArticleDOI
20 Jun 2005
TL;DR: It is demonstrated that migratory, dynamic NUCA approaches improve performance significantly for a subset of the workloads at the cost of increased power consumption and complexity, especially as per-application cache partitioning strategies are applied.
Abstract: We propose an organization for the on-chip memory system of a chip multiprocessor, in which 16 processors share a 16MB pool of 256 L2 cache banks. The L2 cache is organized as a non-uniform cache architecture (NUCA) array with a switched network embedded in it for high performance. We show that this organization can support the spectrum of degrees of sharing: unshared, in which each processor has a private portion of the cache, thus reducing hit latency, completely shared, in which every processor shares the entire cache, thus minimizing misses, and every point in between. We find the optimal degree of sharing for a number of cache bank mapping policies, and also evaluate a per-application cache partitioning strategy. We conclude that a static NUCA organization with sharing degrees of two or four work best across a suite of commercial and scientific parallel workloads. We also demonstrate that migratory, dynamic NUCA approaches improve performance significantly for a subset of the workloads at the cost of increased power consumption and complexity, especially as per-application cache partitioning strategies are applied.

218 citations


Journal ArticleDOI
C.W. Slayman1
TL;DR: In most system applications, a combination of several techniques is required to meet the necessary reliability and data-integrity targets, and the tradeoffs of these techniques in terms of area, power, and performance penalties versus increased reliability are covered.
Abstract: As the size of the SRAM cache and DRAM memory grows in servers and workstations, cosmic-ray errors are becoming a major concern for systems designers and end users. Several techniques exist to detect and mitigate the occurrence of cosmic-ray upset, such as error detection, error correction, cache scrubbing, and array interleaving. This paper covers the tradeoffs of these techniques in terms of area, power, and performance penalties versus increased reliability. In most system applications, a combination of several techniques is required to meet the necessary reliability and data-integrity targets.

205 citations


Journal ArticleDOI
01 May 2005
TL;DR: The proposed variable-way, or V-Way, set-associative cache achieves an average miss rate reduction of 13% on sixteen benchmarks from the SPEC CPU2000 suite, which translates into an average IPC improvement of 8%.
Abstract: As processor speeds increase and memory latency becomes more critical, intelligent design and management of secondary caches becomes increasingly important. The efficiency of current set-associative caches is reduced because programs exhibit a non-uniform distribution of memory accesses across different cache sets. We propose a technique to vary the associativity of a cache on a per-set basis in response to the demands of the program. By increasing the number of tag-store entries relative to the number of data lines, we achieve the performance benefit of global replacement while maintaining the constant hit latency of a set-associative cache. The proposed variable-way, or V-Way, set-associative cache achieves an average miss rate reduction of 13% on sixteen benchmarks from the SPEC CPU2000 suite. This translates into an average IPC improvement of 8%.

204 citations


Proceedings ArticleDOI
Matteo Frigo1, Volker Strumpen1
20 Jun 2005
TL;DR: This work presents a cache oblivious algorithm for stencil computations, which arise for example in finite-difference methods, and it exploits temporal locality optimally throughout the entire memory hierarchy.
Abstract: We present a cache oblivious algorithm for stencil computations, which arise for example in finite-difference methods. Our algorithm applies to arbitrary stencils in n-dimensional spaces. On an "ideal cache" of size Z, our algorithm saves a factor of Θ(Z1/n) cache misses compared to a naive algorithm, and it exploits temporal locality optimally throughout the entire memory hierarchy.

196 citations


Patent
22 Nov 2005
TL;DR: In this paper, a cache server is coupled with a media serving engine that is capable of caching media content, and a set of cache policies is accessible by the cache engine to define the operation of the cache.
Abstract: A cache server includes a media serving engine that is capable of distributing media content. A cache engine is coupled to the media serving engine and capable of caching media content. A set of cache policies is accessible by the cache engine to define the operation of the cache engine. The cache server can be configured to operate as either a cache server or an origin server. The cache server also includes a data communication interface coupled to the cache engine and the media serving engine to allow the cache engine to receive media content across a network and to allow the media serving engine to distribute media content across the network. The cache policies include policies for distributing media content from the media server, policies for handling cache misses, and policies for prefetching media content.

165 citations


Journal ArticleDOI
TL;DR: A GHB (global history buffer) supports existing prefetch algorithms more effectively than conventional prefetch tables and contains a more complete picture of cache miss history, which reduces stale table data, improving accuracy and reducing memory traffic.
Abstract: Over the past couple of decades, trends in both microarchitecture and underlying semiconductor technology have significantly reduced microprocessor clock periods These trends have significantly increased relative main-memory latencies as measured in processor clock cycles To avoid large performance losses caused by long memory access delays, microprocessors rely heavily on a hierarchy of cache memories But cache memories are not always effective, either because they are not large enough to hold a program's working set, or because memory access patterns don't exhibit behavior that matches a cache memory's demand-driven, line-structured organization To partially overcome cache memories' limitations, we organize data cache prefetch information in a new way, a GHB (global history buffer) supports existing prefetch algorithms more effectively than conventional prefetch tables It reduces stale table data, improving accuracy and reducing memory traffic It contains a more complete picture of cache miss history and is smaller than conventional tables

161 citations


Patent
31 Jan 2005
TL;DR: Data-aware cache as discussed by the authors is a method and system directed to improve effectiveness and efficiency of cache and data management by differentiating data based on certain attributes associated with the data and reducing the bottleneck to storage.
Abstract: A method and system directed to improve effectiveness and efficiency of cache and data management by differentiating data based on certain attributes associated with the data and reducing the bottleneck to storage. The data-aware cache differentiates and manages data using a state machine having certain states. The data-aware cache may use data pattern and traffic statistics to retain frequently used data in cache longer by transitioning it into Sticky or StickyDirty states. The data-aware cache may also use content or application related attributes to differentiate and retain certain data in cache longer. Further, the data-aware cache may provide cache status and statistics information to a data-aware data flow manager, thus assisting data-aware data flow manager to determine which data to cache and which data to pipe directly through, or to switch cache policies dynamically, thus avoiding some of the overhead associated with caches. The data-aware cache may also place clean and dirty data in separate states, enabling more efficient cache mirroring and flush, thus improve system reliability and performance.

Proceedings Article
10 Apr 2005
TL;DR: To use the L2 cache as efficiently as possible, an L2-conscious scheduling algorithm is proposed and it is possible to reduce miss ratios in the L1 cache and improve processor throughput by 27-45%.
Abstract: We investigated how operating system design should be adapted for multithreaded chip multiprocessors (CMT) - a new generation of processors that exploit thread-level parallelism to mask the memory latency in modern workloads. We determined that the L2 cache is a critical shared resource on CMT and that an insufficient amount of L2 cache can undermine the ability to hide memory latency on these processors. To use the L2 cache as efficiently as possible, we propose an L2-conscious scheduling algorithm and quantify its performance potential. Using this algorithm it is possible to reduce miss ratios in the L2 cache by 25-37% and improve processor throughput by 27-45%.

Patent
Mason Cabot1
20 Dec 2005
TL;DR: In this article, a cache line eviction mechanism is proposed that enables programmers to mark portions of code with different cache priority levels based on anticipated or measured access patterns for those code portions.
Abstract: A method and apparatus to enable programmatic control of cache line eviction policies. A mechanism is provided that enables programmers to mark portions of code with different cache priority levels based on anticipated or measured access patterns for those code portions. Corresponding cues to assist in effecting the cache eviction policies associated with given priority levels are embedded in machine code generated from source-and/or assembly-level code. Cache architectures are provided that partition cache space into multiple pools, each pool being assigned a different priority. In response to execution of a memory access instruction, an appropriate cache pool is selected and searched based on information contained in the instruction's cue. On a cache miss, a cache line is selected from that pool to be evicted using a cache eviction policy associated with the pool. Implementations of the mechanism or described for both n-way set associative caches and fully-associative caches.

Proceedings ArticleDOI
17 Sep 2005
TL;DR: A general-purpose compiler approach, called memory coloring, to efficiently allocating the arrays in a program to an SPM, by adapting an existing graph-colouring algorithm for register allocation to assign the array in the program into the register file.
Abstract: Scratchpad memory (SPM), a fast software-managed on-chip SRAM, is now widely used in modern embedded processors. Compared to hardware-managed cache, it is more efficient in performance, power and area cost, and has the added advantage of better time predictability. This paper introduces a general-purpose compiler approach, called memory coloring, to efficiently allocating the arrays in a program to an SPM. The novelty of our approach lies in partitioning an SPM into a "register file", splitting the live ranges of arrays to create potential data transfer statements between the SPM and off-chip memory, and finally, adapting an existing graph-colouring algorithm for register allocation to assign the arrays in the program into the register file. Our approach is efficient due to the practical efficiency of graph-colouring algorithms. We have implemented this work in SUIF and machSUIF. Preliminary results over benchmarks show that our approach represents a promising solution to automatic SPM management.

Proceedings ArticleDOI
20 Mar 2005
TL;DR: A new method to accurately estimate the reliability of cache memories is presented and three different techniques are presented to reduce the susceptibility of first-level caches to soft errors by two orders of magnitude.
Abstract: Cosmic-ray induced soft errors in cache memories are becoming a major threat to the reliability of microprocessor-based systems. In this paper, we present a new method to accurately estimate the reliability of cache memories. We have measured the MTTF (mean-time-to-failure) of unprotected first-level (L1) caches for twenty programs taken from SPEC2000 benchmark suite. Our results show that a 16 KB first-level cache possesses a MTTF of at least 400 years (for a raw error rate of 0.002 FIT/bit.) However, this MTTF is significantly reduced for higher error rates and larger cache sizes. Our results show that for selected programs, a 64 KB first-level cache is more than 10 times as vulnerable to soft errors versus a 16 KB cache memory. Our work also illustrates that the reliability of cache memories is highly application-dependent. Finally, we present three different techniques to reduce the susceptibility of first-level caches to soft errors by two orders of magnitude. Our analysis shows how to achieve a balance between performance and reliability

Proceedings ArticleDOI
12 Jun 2005
TL;DR: The impact of evolving memory system features, such as large on-chip caches, automatic prefetch, and the growing distance to main memory on 3D stencil computations are investigated.
Abstract: In this work we investigate the impact of evolving memory system features, such as large on-chip caches, automatic prefetch, and the growing distance to main memory on 3D stencil computations. These calculations form the basis for a wide range of scientific applications from simple Jacobi iterations to complex multigrid and block structured adaptive PDE solvers. First we develop a simple benchmark to evaluate the effectiveness of prefetching in cache-based memory systems. Next we present a small parameterized probe and validate its use as a proxy for general stencil computations on three modern microprocessors. We then derive an analytical memory cost model for quantifying cache-blocking behavior and demonstrate its effectiveness in predicting the stencil-computation performance. Overall results demonstrate that recent trends memory system organization have reduced the efficacy of traditional cache-blocking optimizations.

Proceedings ArticleDOI
06 Jun 2005
TL;DR: StatCache is presented, a performance tool based on a statistical cache model that has a small run-time overhead while providing much of the flexibility of simulator-based tools and demonstrates how the flexibility can be used to better understand the characteristics of cache-related performance problems.
Abstract: Performance tools based on hardware counters can efficiently profile the cache behavior of an application and help software developers improve its cache utilization. Simulator-based tools can potentially provide more insights and flexibility and model many different cache configurations, but have the drawback of large run-time overhead.We present StatCache, a performance tool based on a statistical cache model. It has a small run-time overhead while providing much of the flexibility of simulator-based tools. A monitor process running in the background collects sparse memory access statistics about the analyzed application running natively on a host computer. Generic locality information is derived and presented in a code-centric and/or data-centric view.We evaluate the accuracy and performance of the tool using ten SPEC CPU2000 benchmarks. We also exemplify how the flexibility of the tool can be used to better understand the characteristics of cache-related performance problems.

Patent
22 Dec 2005
TL;DR: In this article, the authors present a method for run-time cache optimization based on profiling a program code during a runtime execution, logging the performance for producing a cache log, and rearranging a portion of program code in view of the cache log for producing rearranged portion.
Abstract: A method ( 400 ) and system ( 106 ) is provided for run-time cache optimization. The method includes profiling ( 402 ) a performance of a program code during a run-time execution, logging ( 408 ) the performance for producing a cache log, and rearranging ( 410 ) a portion of program code in view of the cache log for producing a rearranged portion. The rearranged portion is supplied to a memory management unit ( 240 ) for managing at least one cache memory ( 110 - 140 ). The cache log can be collected during a real-time operation of a communication device and is fed back to a linking process ( 244 ) to maximize a cache locality compile-time. The method further includes loading a saved profile corresponding with a run-time operating mode, and reprogramming a new code image associated with the saved profile.

Patent
30 Sep 2005
TL;DR: In this paper, instruction-assisted cache management for efficient use of cache and memory is discussed, where hints (e.g., modifiers) are added to read and write memory access instructions to identify the memory access for temporal data.
Abstract: Instruction-assisted cache management for efficient use of cache and memory. Hints (e.g., modifiers) are added to read and write memory access instructions to identify the memory access is for temporal data. In view of such hints, alternative cache policy and allocation policies are implemented that minimize cache and memory access. Under one policy, a write cache miss may result in a write of data to a partial cache line without a memory read/write cycle to fill the remainder of the line. Under another policy, a read cache miss may result in a read from memory without allocating or writing the read data to a cache line. A cache line soft-lock mechanism is also disclosed, wherein cache lines may be selectably soft locked to indicate preference for keeping those cache lines over non-locked lines.

Journal ArticleDOI
TL;DR: This paper introduces IATAC (inter-access time per access count), a new hardware technique to reduce cache leakage for L2 caches that outperforms all previous state-of-the-art techniques.
Abstract: As technology evolves, power dissipation increases and cooling systems become more complex and expensive. There are two main sources of power dissipation in a processor: dynamic power and leakage. Dynamic power has been the most significant factor, but leakage will become increasingly significant in future. It is predicted that leakage will shortly be the most significant cost as it grows at about a 5× rate per generation. Thus, reducing leakage is essential for future processor design. Since large caches occupy most of the area, they are one of the leakiest structures in the chip and hence, a main source of energy consumption for future processors.This paper introduces IATAC (inter-access time per access count), a new hardware technique to reduce cache leakage for L2 caches. IATAC dynamically adapts the cache size to the program requirements turning off cache lines whose content is not likely to be reused. Our evaluation shows that this approach outperforms all previous state-of-the-art techniques. IATAC turns off 65p of the cache lines across different L2 cache configurations with a very small performance degradation of around 2p.

Journal Article
TL;DR: In this article, the complexity of finding an optimal placement of objects (or code) in the memory, in the sense that this placement reduces the number of cache misses during program execution to the minimum, was investigated.
Abstract: We investigate the complexity of finding an optimal placement of objects (or code) in the memory, in the sense that this placement reduces the number of cache misses during program execution to the minimum. We show that this problem is one of the toughest amongst the interesting NP optimization problems in computer science, in particular, suppose one is given a sequence of memory accesses and one has to place the data in the memory so as to minimize the number of cache misses for this sequence. We show that if P ≠ NP, then one cannot efficiently approximate the optimal solution even up to a very liberal approximation ratio. Thus, this problem joins the family of extremely inapproximable optimization problems. Two famous members in this family are minimum graph coloring and maximum clique. In light of this harsh lower bound only mild approximation ratios can be obtained. We provide an algorithm that can map arbitrary access sequences within such a mild ratio.Next, we study the information loss when compressing the access sequence, keeping only pairwise relations. We show that the reduced information hides the optimal solution and highlights solutions that are far from optimal. Furthermore, we show that even if one restricts his attention to pairwise information, finding a good placement is computationally difficult.

Proceedings ArticleDOI
12 Feb 2005
TL;DR: This work proposes and analyzes a memory hierarchy that uses a unified compression scheme encompassing the last-level on-chip cache, the off-chip memory channel, and off- chip main memory that achieves a peak improvement of 292%, compared to 165% and 83% for cache or bus compression alone.
Abstract: The memory system's large and growing contribution to system performance motivates more aggressive approaches to improving its efficiency. We propose and analyze a memory hierarchy that uses a unified compression scheme encompassing the last-level on-chip cache, the off-chip memory channel, and off-chip main memory. This scheme simultaneously increases the effective on-chip cache capacity, off-chip bandwidth, and main memory size, while avoiding compression and decompression overheads between levels. Simulations of the SPEC CPU2000 benchmarks using a 1MB cache and 128-byte blocks show an average speedup of 19%, while degrading performance by no more than 5%. The combined scheme achieves a peak improvement of 292%, compared to 165% and 83% for cache or bus compression alone. The compressed system generally provides even better performance as the block size is increased to 512 bytes.

Journal ArticleDOI
TL;DR: The implementation of the static hints scheme in the Open64-compiler for the Itanium processor shows a speedup of 10% on average on a set of pointer-intensive and regular loop-based programs and up to 34% reduction in cache misses.

Journal ArticleDOI
TL;DR: A study of 23 programs drawn from Powerstone, MediaBench, and Spec2000 benchmark suites shows that the configurable cache tuned to each program saved energy for every program compared to a conventional four-way set-associative cache as well as compared to an conventional direct-mapped cache, with an average savings of energy related to memory access.
Abstract: Energy consumption is a major concern in many embedded computing systems. Several studies have shown that cache memories account for about 50p of the total energy consumed in these systems. The performance of a given cache architecture is determined, to a large degree, by the behavior of the application executing on the architecture. Desktop systems have to accommodate a very wide range of applications and therefore the cache architecture is usually set by the manufacturer as a best compromise given current applications, technology, and cost. Unlike desktop systems, embedded systems are designed to run a small range of well-defined applications. In this context, a cache architecture that is tuned for that narrow range of applications can have both increased performance as well as lower energy consumption. We introduce a novel cache architecture intended for embedded microprocessor platforms. The cache has three software-configurable parameters that can be tuned to particular applications. First, the cache's associativity can be configured to be direct-mapped, two-way, or four-way set-associative, using a novel technique we call way concatenation. Second, the cache's total size can be configured by shutting down ways. Finally, the cache's line size can be configured to have 16, 32, or 64 bytes. A study of 23 programs drawn from Powerstone, MediaBench, and Spec2000 benchmark suites shows that the configurable cache tuned to each program saved energy for every program compared to a conventional four-way set-associative cache as well as compared to a conventional direct-mapped cache, with an average savings of energy related to memory access of over 40p.

Patent
Nimrod Megiddo1, Dharmendra S. Modha1
13 Jun 2005
TL;DR: In this article, an adaptive replacement cache policy dynamically maintains two lists of pages, a recency list and a frequency list, in addition to a cache directory, keeping these two lists to roughly the same size, the cache size c.
Abstract: An adaptive replacement cache policy dynamically maintains two lists of pages, a recency list and a frequency list, in addition to a cache directory. The policy keeps these two lists to roughly the same size, the cache size c. Together, the two lists remember twice the number of pages that would fit in the cache. At any time, the policy selects a variable number of the most recent pages to exclude from the two lists. The policy adaptively decides in response to an evolving workload how many top pages from each list to maintain in the cache at any given time. It achieves such online, on-the-fly adaptation by using a learning rule that allows the policy to track a workload quickly and effectively.

Patent
Steven Paul Vanderwiel1
26 Oct 2005
TL;DR: In this paper, a selection mechanism selects lines evicted from the higher level cache for storage in the victim cache, only some of the evicted lines being selected for the victim.
Abstract: A computer system cache memory contains at least two levels. A lower level selective victim cache receives cache lines evicted from a higher level cache. A selection mechanism selects lines evicted from the higher level cache for storage in the victim cache, only some of the evicted lines being selected for the victim. Preferably, two priority bits associated with each cache line are used to select lines for the victim. The priority bits indicate whether the line has been re-referenced while in the higher level cache, and whether it has been reloaded after eviction from the higher level cache.

Patent
Robert W. Faber1
15 Sep 2005
TL;DR: In this paper, an apparatus and method to reduce the initialization time of a system is disclosed, where metadata associated with the cache line is stored in a distributed format in non-volatile memory with its associated cache line.
Abstract: An apparatus and method to reduce the initialization time of a system is disclosed. In one embodiment, upon a cache line update, metadata associated with the cache line is stored in a distributed format in non-volatile memory with its associated cache line. Upon indication of an expected shut down, metadata is copied from volatile memory and stored in non-volatile memory in a packed format. In the packed format, multiple metadata associated with multiple cache lines are stored together in, for example, a single memory block. Thus, upon system power up, if the system was shut down in an expected manner, metadata may be restored in volatile memory from the metadata stored in the packed format, with a significantly reduced boot time over restoring metadata from the metadata stored in the distributed format.

Patent
14 Oct 2005
TL;DR: In this article, a system and method for providing a shared RAM cache of a database, accessible by multiple processes, is described, and synchronization between the database and the shared cache is assured by using a unidirectional notification mechanism.
Abstract: A system and method are provided for providing a shared RAM cache of a database, accessible by multiple processes. By sharing a single cache rather than local copies of the database, memory is saved and synchronization of data accessed by different processes is assured. Synchronization between the database and the shared cache is assured by using a unidirectional notification mechanism between the database and the shared cache. Client APIs within the processes search the data within the shared cache directly, rather than by making a request to a database server. Therefore server load is not affected by the number of requesting applications and data fetch time is not affected by Inter-Process Communication delay or by additional context switching. A new synchronization scheme allows multiple processes to be used in building and maintaining the cache, greatly reducing start up time.

Journal ArticleDOI
TL;DR: In this paper, the authors proposed a way-halting cache, which is a four-way set-associative cache that stores the four lowest-order bits of all ways' tags into a fully associative memory.
Abstract: Caches contribute to much of a microprocessor system's power and energy consumption. Numerous new cache architectures, such as phased, pseudo-set-associative, way predicting, reactive-associative, way-shutdown, way-concatenating, and highly-associative, are intended to reduce power and/or energy, but they all impose some performance overhead. We have developed a new cache architecture, called a way-halting cache, that reduces energy further than previously mentioned architectures, while imposing no performance overhead. Our way-halting cache is a four-way set-associative cache that stores the four lowest-order bits of all ways' tags into a fully associative memory, which we call the halt tag array. The lookup in the halt tag array is done in parallel with, and is no slower than, the set-index decoding. The halt tag array predetermines which tags cannot match due to their low-order 4 bits mismatching. Further accesses to ways with known mismatching tags are then halted, thus saving power. Our halt tag array has an additional feature of using static logic only, rather than dynamic logic used in highly associative caches, making our cache simpler to design with existing tools. We provide data from experiments on 29 benchmarks drawn from Powerstone, Mediabench, and Spec 2000, based on our layouts in 0.18 micron CMOS technology. On average, we obtained 55p savings of memory-access related energy over a conventional four-way set-associative cache. We show that savings are greater than previous methods, and nearly twice that of highly associative caches, while imposing no performance overhead and only 2p cache area overhead.

Journal ArticleDOI
TL;DR: The range of methodologies that are developing to overcome the challenges of exploring the cache design space for LCMP platforms are described and a trace-driven approach to characterizing one key server workload (OLTP) in both a homogeneous and a heterogeneous workload environment is focused on.
Abstract: With the advent of dual-core chips in the marketplace, small-scale CMP (chip multiprocessor) architectures are becoming commonplace. We expect a continuing trend of increasing the number of cores on a die to maximize the performance/power efficiency of a single chip. We believe an era of large-scale CMPs (LCMPs) with several tens to hundreds of cores is on the way, but as of now architects have little understanding of how best to build a cache hierarchy given such a large number of cores/threads to support. With this in mind, our initial goals are to prune the cache design space for LCMPs by characterizing basic server workload behavior in such an environment.In this paper, we describe the range of methodologies that we are developing to overcome the challenges of exploring the cache design space for LCMP platforms. We then focus on employing a trace-driven approach to characterizing one key server workload (OLTP) in both a homogeneous and a heterogeneous workload environment. We study the effect of increasing threads (from 1 to 128) on a three-level cache hierarchy with emphasis on second and third level caches. We study the effect of varying sizes at these cache levels and show the effects of threads contending for cache space, the effects of prefetching instruction addresses, and the effects of inclusion. We make initial observations and conclusions about the factors on which LCMP cache hierarchy design decisions should be based and discuss future work.