scispace - formally typeset
Search or ask a question

Showing papers on "Cache invalidation published in 2006"


Proceedings ArticleDOI
09 Dec 2006
TL;DR: In this article, the authors propose a low-overhead, runtime mechanism that partitions a shared cache between multiple applications depending on the reduction in cache misses that each application is likely to obtain for a given amount of cache resources.
Abstract: This paper investigates the problem of partitioning a shared cache between multiple concurrently executing applications. The commonly used LRU policy implicitly partitions a shared cache on a demand basis, giving more cache resources to the application that has a high demand and fewer cache resources to the application that has a low demand. However, a higher demand for cache resources does not always correlate with a higher performance from additional cache resources. It is beneficial for performance to invest cache resources in the application that benefits more from the cache resources rather than in the application that has more demand for the cache resources. This paper proposes utility-based cache partitioning (UCP), a low-overhead, runtime mechanism that partitions a shared cache between multiple applications depending on the reduction in cache misses that each application is likely to obtain for a given amount of cache resources. The proposed mechanism monitors each application at runtime using a novel, cost-effective, hardware circuit that requires less than 2kB of storage. The information collected by the monitoring circuits is used by a partitioning algorithm to decide the amount of cache resources allocated to each application. Our evaluation, with 20 multiprogrammed workloads, shows that UCP improves performance of a dual-core system by up to 23% and on average 11% over LRU-based cache partitioning.

1,083 citations


Proceedings ArticleDOI
09 Dec 2006
TL;DR: This paper presents and studies a distributed L2 cache management approach through OS-level page allocation for future many-core processors that can provide differentiated execution environment to running programs by dynamically controlling data placement and cache sharing degrees.
Abstract: This paper presents and studies a distributed L2 cache management approach through OS-level page allocation for future many-core processors. L2 cache management is a crucial multicore processor design aspect to overcome non-uniform cache access latency for good program performance and to reduce on-chip network traffic and related power consumption. Unlike previously studied hardwarebased private and shared cache designs implementing a "fixed" caching policy, the proposed OS-microarchitecture approach is flexible; it can easily implement a wide spectrum of L2 caching policies without complex hardware support. Furthermore, our approach can provide differentiated execution environment to running programs by dynamically controlling data placement and cache sharing degrees. We discuss key design issues of the proposed approach and present preliminary experimental results showing the promise of our approach.

377 citations


Journal ArticleDOI
01 May 2006
TL;DR: This paper presents CMP cooperative caching, a unified framework to manage a CMP's aggregate on-chip cache resources by forming an aggregate "shared" cache through cooperation among private caches that performs robustly over a range of system/cache sizes and memory latencies.
Abstract: This paper presents CMP Cooperative Caching, a unified framework to manage a CMP's aggregate on-chip cache resources. Cooperative caching combines the strengths of private and shared cache organizations by forming an aggregate "shared" cache through cooperation among private caches. Locally active data are attracted to the private caches by their accessing processors to reduce remote on-chip references, while globally active data are cooperatively identified and kept in the aggregate cache to reduce off-chip accesses. Examples of cooperation include cache-to-cache transfers of clean data, replication-aware data replacement, and global replacement of inactive data. These policies can be implemented by modifying an existing cache replacement policy and cache coherence protocol, or by the new implementation of a directory-based protocol presented in this paper. Our evaluation using full-system simulation shows that cooperative caching achieves an off-chip miss rate similar to that of a shared cache, and a local cache hit rate similar to that of using private caches. Cooperative caching performs robustly over a range of system/cache sizes and memory latencies. For an 8-core CMP with 1MB L2 cache per core, the best cooperative caching scheme improves the performance of multithreaded commercial workloads by 5-11% compared with a shared cache and 4-38% compared with private caches. For a 4-core CMP running multiprogrammed SPEC2000 workloads, cooperative caching is on average 11% and 6% faster than shared and private cache organizations, respectively.

377 citations


Journal ArticleDOI
01 May 2006
TL;DR: Evaluations with the SPEC CPU2000 benchmarks show that MLP-aware cache replacement can improve performance by as much as 23% and a novel, low-hardware overhead mechanism called sampling based adaptive replacement (SBAR) is proposed, to dynamically choose between an MLp-aware and a traditional replacement policy, depending on which one is more effective at reducing the number of memory related stalls.
Abstract: Performance loss due to long-latency memory accesses can be reduced by servicing multiple memory accesses concurrently. The notion of generating and servicing long-latency cache misses in parallel is called Memory Level Parallelism (MLP). MLP is not uniform across cache misses - some misses occur in isolation while some occur in parallel with other misses. Isolated misses are more costly on performance than parallel misses. However, traditional cache replacement is not aware of the MLP-dependent cost differential between different misses. Cache replacement, if made MLP-aware, can improve performance by reducing the number of performance-critical isolated misses. This paper makes two key contributions. First, it proposes a framework for MLP-aware cache replacement by using a runtime technique to compute the MLP-based cost for each cache miss. It then describes a simple cache replacement mechanism that takes both MLP-based cost and recency into account. Second, it proposes a novel, low-hardware overhead mechanism called Sampling Based Adaptive Replacement (SBAR), to dynamically choose between an MLP-aware and a traditional replacement policy, depending on which one is more effective at reducing the number of memory related stalls. Evaluations with the SPEC CPU2000 benchmarks show that MLP-aware cache replacement can improve performance by as much as 23%.

316 citations


Proceedings ArticleDOI
16 Sep 2006
TL;DR: It is found that simple policies like LRU replacement and static uniform partitioning are not sufficient to provide near-optimal performance under any reasonable definition, indicating that some thread-aware cache resource allocation mechanism is required.
Abstract: As chip multiprocessors (CMPs) become increasingly mainstream, architects have likewise become more interested in how best to share a cache hierarchy among multiple simultaneous threads of execution. The complexity of this problem is exacerbated as the number of simultaneous threads grows from two or four to the tens or hundreds. However, there is no consensus in the architectural community on what "best" means in this context. Some papers in the literature seek to equalize each thread's performance loss due to sharing, while others emphasize maximizing overall system performance. Furthermore, the specific effect of these goals varies depending on the metric used to define "performance". In this paper we label equal performance targets as Communist cache policies and overall performance targets as Utilitarian cache policies. We compare both of these models to the most common current model of a free-for-all cache (a Capitalist policy). We consider various performance metrics, including miss rates, bandwidth usage, and IPC, including both absolute and relative values of each metric. Using analytical models and behavioral cache simulation, we find that the optimal partitioning of a shared cache can vary greatly as different but reasonable definitions of optimality are applied. We also find that, although Communist and Utilitarian targets are generally compatible, each policy has workloads for which it provides poor overall performance or poor fairness, respectively. Finally, we find that simple policies like LRU replacement and static uniform partitioning are not sufficient to provide near-optimal performance under any reasonable definition, indicating that some thread-aware cache resource allocation mechanism is required.

198 citations


Patent
24 Apr 2006
TL;DR: In this paper, a method and system of managing data access in a shared memory cache of a processor is described, which includes probing one or more memory addresses that map to a subset of the shared memory caches.
Abstract: A method and system of managing data access in a shared memory cache of a processor are disclosed. The method includes probing one or more memory addresses that map to a subset of the shared memory cache and sensing a plurality of events in the one or more memory addresses. Cache utilization information is then obtained by reading a hardware performance counter of the processor. The hardware performance counter is incremented based on the occurrence of the plurality of events. Based upon the cache utilization information, an occurrence of one of the plurality of events is reduced.

140 citations


Proceedings ArticleDOI
09 Dec 2006
TL;DR: This paper proposes an implementation of the cache coherence protocol within the network, embedding directories within each router node that manage and steer requests towards nearby data copies, enabling in-transit optimization of memory access delay.
Abstract: With the trend towards increasing number of processor cores in future chip architectures, scalable directory-based protocols for maintaining cache coherence will be needed. However, directory-based protocols face well-known problems in delay and scalability. Most current protocol optimizations targeting these problems maintain a firm abstraction of the interconnection network fabric as a communication medium: protocol optimizations consist of endto- end messages between requestor, directory and sharer nodes, while network optimizations separately target lowering communication latency for coherence messages. In this paper, we propose an implementation of the cache coherence protocol within the network, embedding directories within each router node that manage and steer requests towards nearby data copies, enabling in-transit optimization of memory access delay. Simulation results across a range of SPLASH-2 benchmarks demonstrate significant performance improvement and good system scalability, with up to 44.5% and 56% savings in average memory access latency for 16 and 64-node systems, respectively, when compared against the baseline directory cache coherence protocol. Detailed microarchitecture and implementation characterization affirms the low area and delay impact of in-network coherence.

128 citations


Proceedings ArticleDOI
27 Feb 2006
TL;DR: A detailed data-sharing analysis and chip-multiprocessor (CMP) cache study of several multithreaded data-mining bioinformatics workloads and shows that a shared 32 MB last-level cache is able to capture a tremendous amount of data- sharing and outperform a 32 MB private cache configuration by several orders of magnitude.
Abstract: With the continuing growth in the amount of genetic data, members of the bioinformatics community are developing a variety of data-mining applications to understand the data and discover meaningful information. These applications are important in defining the design and performance decisions of future high performance microprocessors. This paper presents a detailed data-sharing analysis and chip-multiprocessor (CMP) cache study of several multithreaded data-mining bioinformatics workloads. For a CMP with a three-level cache hierarchy, we model the last-level of the cache hierarchy as either multiple private caches or a single cache shared amongst different cores of the CMP. Our experiments show that the bioinformatics workloads exhibit significant data-sharing - 50-95% of the data cache is shared by the different threads of the workload. Furthermore, regardless of the amount of data cache shared, for some workloads, as many as 98% of the accesses to the last-level cache are to shared data cache lines. Additionally, the amount of data-sharing exhibited by the workloads is a function of the total cache size available - the larger the data cache the better the sharing behavior. Thus, partitioning the available last-level cache silicon area into multiple private caches can cause applications to lose their inherent data-sharing behavior. For the workloads in this study, a shared 32 MB last-level cache is able to capture a tremendous amount of data-sharing and outperform a 32 MB private cache configuration by several orders of magnitude. Specifically, with shared last-level caches, the bandwidth demands beyond the last-level cache can be reduced by factors of 3-625 when compared to private last-level caches.

125 citations


Patent
06 Apr 2006
TL;DR: In this paper, the authors present a system that facilitates selectively unmarking load-marked cache lines during transactional program execution, where load-marked cache lines are monitored during the transactional execution to detect interfering accesses from other threads.
Abstract: One embodiment of the present invention provides a system that facilitates selectively unmarking load-marked cache lines during transactional program execution, wherein load-marked cache lines are monitored during transactional execution to detect interfering accesses from other threads. During operation, the system encounters a release instruction during transactional execution of a block of instructions. In response to the release instruction, the system modifies the state of cache lines, which are specially load-marked to indicate they can be released from monitoring, to account for the release instruction being encountered. In doing so, the system can potentially cause the specially load-marked cache lines to become unmarked. In a variation on this embodiment, upon encountering a commit-and-start-new-transaction instruction, the system modifies load-marked cache lines to account for the commit-and-start-new-transaction instruction being encountered. In doing so, the system causes normally load-marked cache lines to become unmarked, while other specially load-marked cache lines may remain load-marked past the commit-and-start-new-transaction instruction.

112 citations


Journal ArticleDOI
TL;DR: It is claimed that there is a sufficient number of good policies for proxies with different characteristics, such as proxies with a small cache, limited bandwidth, and limited processing power, as well as suggest policies for different types of proxies.
Abstract: Research involving Web cache replacement policy has been active for at least a decade. In this-article we would like to claim that there is a sufficient number of good policies, and further proposals would only produce minute improvements. We argue that the focus should be fitness for purpose rather than proposing any new policies. Up to now, almost all policies were purported to perform better than others, creating confusion as to which policy should be used. Actually, a policy only performs well in certain environments. Therefore, the goal of this article is to identify the appropriate policies for proxies with different characteristics, such as proxies with a small cache, limited bandwidth, and limited processing power, as well as suggest policies for different types of proxies, such as ISP-level and root-level proxies

112 citations


Journal ArticleDOI
01 Mar 2006
TL;DR: Simulation results indicate that the proposed aggregate caching mechanism and a broadcast-based Simple Search algorithm can significantly improve an Imanet performance in terms of throughput and average number of hops to access data items.
Abstract: Internet-based mobile ad hoc network (Imanet) is an emerging technique that combines a wired network (e.g. Internet) and a mobile ad hoc network (Manet) for developing a ubiquitous communication infrastructure. To fulfill users' demand to access various kinds of information, however, an Imanet has several limitations such as limited accessibility to the wired Internet, insufficient wireless bandwidth, and longer message latency. In this paper, we address the issues involved in information search and access in Imanets. An aggregate caching mechanism and a broadcast-based Simple Search (SS) algorithm are proposed for improving the information accessibility and reducing average communication latency in Imanets. As a part of the aggregate cache, a cache admission control policy and a cache replacement policy, called Time and Distance Sensitive (TDS) replacement, are developed to reduce the cache miss ratio and improve the information accessibility. We evaluate the impact of caching, cache management, and the number of access points that are connected to the Internet, through extensive simulation. The simulation results indicate that the proposed aggregate caching mechanism can significantly improve an Imanet performance in terms of throughput and average number of hops to access data items.

Proceedings ArticleDOI
09 Dec 2006
TL;DR: A novel and general scheme by which any two cache management algorithms can be combined and adaptively switch between them, closely tracking the locality characteristics of a given program, inspired by recent work in virtual memory management at the operating system level.
Abstract: We present and evaluate the idea of adaptive processor cache management. Specifically, we describe a novel and general scheme by which we can combine any two cache management algorithms (e.g., LRU, LFU, FIFO, Random) and adaptively switch between them, closely tracking the locality characteristics of a given program. The scheme is inspired by recent work in virtual memory management at the operating system level, which has shown that it is possible to adapt over two replacement policies to provide an aggregate policy that always performs within a constant factor of the better component policy. A hardware implementation of adaptivity requires very simple logic but duplicate tag structures. To reduce the overhead, we use partial tags, which achieve good performance with a small hardware cost. In particular, adapting between LRU and LFU replacement policies on an 8-way 512KB L2 cache yields a 12.7% improvement in average CPI on applications that exhibit a non-negligible L2 miss ratio. Our approach increases total cache storage by 4.0%, but it still provides slightly better performance than a conventional 10-way setassociative 640KB cache which requires 25% more storage.

Proceedings ArticleDOI
05 Jul 2006
TL;DR: Experimental results provided in the paper show that with an appropriate selection of regions and cache contents, the worst-case performance of applications with locked instruction caches is competitive with the best- case performance of unlocked caches.
Abstract: Cache memories have been extensively used to bridge the gap between high speed processors and relatively slower main memories. However, they are sources of predictability problems because of their dynamic and adaptive behavior, and thus need special attention to be used in hard real-time systems. A lot of progress has been achieved in the last ten years to statically predict worst-case execution times (WCETs) of tasks on architectures with caches. However, cache-aware WCET analysis techniques are not always applicable due to the lack of documentation of hardware manuals concerning the cache replacement policies. Moreover, they tend to be pessimistic with some cache replacement policies (e.g. random replacement policies). Lastly, caches are sources of timing anomalies in dynamically scheduled processors (a cache miss may in some cases result in a shorter execution time than a hit). To reconciliate performance and predictability of caches, we propose in this paper algorithm for software control of instruction caches. The proposed algorithms statically divide the code of tasks into regions, for which the cache contents is statically selected. At run-time, at every transition between regions, the cache contents computed off-line is loaded into the cache and the cache replacement policy is disabled (the cache is locked). Experimental results provided in the paper show that with an appropriate selection of regions and cache contents, the worst-case performance of applications with locked instruction caches is competitive with the worst-case performance of unlocked caches.

Patent
29 Nov 2006
TL;DR: In this paper, the state of the cache during a cache preload operation is monitored to avoid the need to restart the cache pre-load operation from the beginning, and any data that has been preloaded into a cache prior to a failure may be retained after a failover occurs.
Abstract: An apparatus, program product and method monitor the state of a cache during a cache preload operation in a clustered computer system such that the monitored state can be used after a failover to potentially avoid the need to restart the cache preload operation from the beginning. In particular, by monitoring the state of the cache during a cache preload operation, any data that has been preloaded into a cache prior to a failure may be retained after a failover occurs, thus enabling the cache preload operation to continue from the point at which it was interrupted as a result of the failure.

Patent
04 Jan 2006
TL;DR: In this article, a method for routing a data request received by a caching system is described, where the data request is transmitted without determining whether the requested data are stored in the cache.
Abstract: A method for routing a data request received by a caching system is described. The caching system includes a router and a cache, and the data request identifies a source platform, a destination platform, and requested data. Where the source and destination platforms correspond to an entry in a list automatically generated by the caching system, the data request is transmitted without determining whether the requested data are stored in the cache.

Patent
27 Dec 2006
TL;DR: In this article, a permanent cache list of files not to be removed from a cache is determined in response to a user selection of an application to be added to the cache by adding a file to cache list if the file is a static dependency of the application or if a file has a high probability of being used in the future by the application.
Abstract: In some embodiments a permanent cache list of files not to be removed from a cache is determined in response to a user selection of an application to be added to the cache. The determination is made by adding a file to the cache list if the file is a static dependency of the application, or if a file has a high probability of being used in the future by the application. Other embodiments are described and claimed.

Patent
21 Jul 2006
TL;DR: In this article, the authors propose a method of securely synchronizing cache contents of a mobile browser with a server, which includes initiating a session between the browser and server, including transmission of browser state information regarding the cache contents and an authentication key to the server, maintaining a record of data sent from the server to the browser for storage in the cache, and transmitting data requests from the browser to the web server, in response to which the server uses the key as a seed generation function and accesses each the record of the data and returns only data that does not already form part of
Abstract: A method of securely synchronizing cache contents of a mobile browser with a server includes initiating a session between the browser and server, including transmission of browser state information regarding the cache contents and an authentication key to the server; maintaining a record of data sent from the server to the browser for storage in the cache; maintaining a record of the state information regarding the cache contents transmitted from the browser to the server; and transmitting data requests from the browser to the server, in response to which the server uses the key as a seed generation function and accesses each the record of data and returns only data that does not already form part of the cache contents, and wherein the data includes a result of a hash of data generated by the generation function for authentication by the browser before updating the cache contents with the data.

Patent
27 Mar 2006
TL;DR: In this paper, the authors propose a method for optimisation of the management of a server cache for dynamic pages which may be consulted by client terminals with differing characteristics which requires the provision of discrete versions of a dynamic page in the cache.
Abstract: The invention relates to a method for optimisation of the management of a server cache for dynamic pages which may be consulted by client terminals with differing characteristics which requires the provision of discrete versions (10) of a dynamic page in the cache. According to the method, when a terminal requests (11) a dynamic page, a verification step (12) for the presence of at least one version of the dynamic page in the cache is carried out, such that if the verification is positive the following complementary steps are carried out: procurement (13) of a set of characteristics specific to the type of client terminal, determination (14) of a subset of necessary characteristics from amongst the specific characteristics for the reproduction of the dynamic page on a client terminal, search (15), among the version(s) of the dynamic page in the cache for a suitable version using the subset of necessary characteristics and allocation (16) of the suitable version to the client terminal.

Journal ArticleDOI
26 Jun 2006
TL;DR: This paper is the first to propose an analytical model which predicts the performance of cache replacement policies, based on probability theory, and relies solely on the statistical properties of the application, without relying on heuristics or rules of thumbs.
Abstract: Due to the increasing gap between CPU and memory speed, cache performance plays an increasingly critical role in determining the overall performance of microprocessor systems. One of the important factors that a affect cache performance is the cache replacement policy. Despite the importance, current analytical cache performance models ignore the impact of cache replacement policies on cache performance. To the best of our knowledge, this paper is the first to propose an analytical model which predicts the performance of cache replacement policies. The input to our model is a simple circular sequence profiling of each application, which requires very little storage overhead. The output of the model is the predicted miss rates of an application under different replacement policies. The model is based on probability theory and utilizes Markov processes to compute each cache access' miss probability. The model realistic assumptions and relies solely on the statistical properties of the application, without relying on heuristics or rules of thumbs. The model's run time is less than 0.1 seconds, much lower than that of trace simulations. We validate the model by comparing the predicted miss rates of seventeen Spec2000 and NAS benchmark applications against miss rates obtained by detailed execution-driven simulations, across a range of different cache sizes, associativities, and four replacement policies, and show that the model is very accurate. The model's average prediction error is 1.41%,and there are only 14 out of 952 validation points in which the prediction errors are larger than 10%.

Patent
Sebastien Pouliot1
24 Aug 2006
TL;DR: In this article, a system and method for providing multiple assembly caches for storing shared application resources is described, where each assembly cache may be associated with a different security policies, locations, internal structures and management.
Abstract: The invention relates to a system and method for providing multiple assembly caches for storing shared application resources. Each assembly cache may be associated with a different security policies, locations, internal structures and management. An application may be determined to have access to an assembly cache based on the permission and security policy of the application and security policy of the assembly cache. Additionally, one or more assembly caches may have other policies for cache retention, resolution, and creation.

Journal ArticleDOI
TL;DR: This paper presents a cache oblivious algorithm for matrix multiplication that uses a block recursive structure and an element ordering that is based on Peano curves that leads to an asymptotically optimal spatial and temporal locality of the data access.

Proceedings ArticleDOI
24 Jan 2006
TL;DR: A method is given to rapidly find the L1 cache miss rate of an application and an energy model and an execution time model are developed to find the best cache configuration for the given embedded application.
Abstract: Modern embedded system execute a single application or a class of applications repeatedly. A new emerging methodology of designing embedded system utilizes configurable processors where the cache size, associativity, and line size can be chosen by the designer. In this paper, a method is given to rapidly find the L1 cache miss rate of an application. An energy model and an execution time model are developed to find the best cache configuration for the given embedded application. Using benchmarks from Mediabench, we find that our method is on average 45 times faster to explore the design space, compared to Dinero IV while still having 100% accuracy.

Patent
Marcus L. Kornegay1, Ngan N. Pham1
22 Nov 2006
TL;DR: In this article, a method for cache management of lines of binary code that are stored in the cache is described, where a cache manager log can identify a line, or lines, to be evicted based on data by accessing the cache directory.
Abstract: A method for cache management is disclosed. The method can assign or determined identifiers for lines of binary code that are, or will be stored in cache. The method can create a cache directory that utilizes the identifier to keep an eviction count and/or a reload count for cached lines. Thus, each time a line is entered into, or evicted from cache, the cache eviction log can be amended accordingly. When a processor receives or creates an instruction that requests that a line be evicted from cache, a cache manager log can identify a line, or lines of binary code to be evicted based on data by accessing the cache directory and then the line(s) can be evicted.

Proceedings ArticleDOI
04 Apr 2006
TL;DR: This work bound the penalty of cache interference for real-time tasks by providing accurate predictions of the data cache behavior across preemptions, and it is shown that such accurate modeling of data Cache behavior in preemptive systems significantly improves the WCET predictions for a task.
Abstract: Caches have become invaluable for higher-end architectures to hide, in part, the increasing gap between processor speed and memory access times. While the effect of caches on timing predictability of single real-time tasks has been the focus of much research, bounding the overhead of cache warm-ups after preemptions remains a challenging problem, particularly for data caches. In this paper, we bound the penalty of cache interference for real-time tasks by providing accurate predictions of the data cache behavior across preemptions. For every task, we derive data cache reference patterns for all scalar and non-scalar references. Partial timing of a task is performed up to a preemption point using these patterns. The effects of cache interference are then analyzed using a settheoretic approach, which identifies the number and location of additional misses due to preemption. A feedback mechanism provides the means to interact with the timing analyzer, which subsequently times another interval of a task bounded by the next preemption. Our experimental results demonstrate that it is sufficient to consider the n most expensive preemption points, where n is the maximum possible number of preemptions. Further, it is shown that such accurate modeling of data cache behavior in preemptive systems significantly improves the WCET predictions for a task. To the best of our knowledge, our work of bounding preemption delay for data caches is unprecedented.

Journal ArticleDOI
TL;DR: This paper presents the modeling approach for assessing the impact of soft errors using architectural simulators, and describes a new technique for reducing the vulnerability of data caches: refetching.
Abstract: Data caches are a fundamental component of most modern microprocessors. They provide for efficient read/write access to data memory. Errors occurring in the data cache can corrupt data values or state, and can easily propagate throughout the memory hierarchy. One of the main threats to data cache reliability is soft (transient, nonreproducible) errors. These errors can occur more often than hard (permanent) errors, and most often arise from single event upsets (SEUs) caused by strikes from energetic particles such as neutrons and alpha particles. Many protection techniques exist for data caches; the most common are ECC (error correcting codes) and parity. These protection techniques detect all single bit errors and, in the case of ECC, correct them. To make proper design decisions about which protection technique to use, accurate design-time modeling of cache reliability is crucial. In addition, as caches increase in storage capacity, another important goal is to reduce the failure rate of a cache, to limit disruption to normal system operation. In this paper, we present our modeling approach for assessing the impact of soft errors using architectural simulators. We also describe a new technique for reducing the vulnerability of data caches: refetching. By selectively refetching cache lines from the ECC-protected L2 cache, we can significantly reduce the vulnerability of the L1 data cache. We discuss and present results for two different algorithms that perform selective refetch. Experimental results show that we can obtain an 85 percent decrease in vulnerability when running the SPEC2K benchmark suite while only experiencing a slight decrease in performance. Our results demonstrate that selective refetch can cost-effectivety decrease the error rate of an L1 data cache

Proceedings ArticleDOI
09 Dec 2006
TL;DR: Through simulation studies, this paper establishes the superiority of molecular cache (caches built as aggregations of molecules) that offers a 29% power advantage over that of an equivalently performing traditional cache.
Abstract: CMPs enable simultaneous execution of multiple applications on the same platforms that share cache resources. Diversity in the cache access patterns of these simultaneously executing applications can potentially trigger inter-application interference, leading to cache pollution. Whereas a large cache can ameliorate this problem, the issues of larger power consumption with increasing cache size, amplified at sub-100nm technologies, makes this solution prohibitive. In this paper, in order to address the issues relating to power-aware performance of caches, we propose a caching structure that addresses the following: 1. Definition of application-specific cache partitions as an aggregation of caching units (molecules). The parameters of each molecule namely size, associativity and line size are chosen so that the power consumed by it and access time are optimal for the given technology. 2. Application-Specific resizing of cache partitions with variable and adaptive associativity per cache line, way size and variable line size. 3. A replacement policy that is transparent to the partition in terms of size, heterogeneity in associativity and line size. Through simulation studies we establish the superiority of molecular cache (caches built as aggregations of molecules) that offers a 29% power advantage over that of an equivalently performing traditional cache.

Patent
31 Oct 2006
TL;DR: In this article, a cache is provided for operatively coupling a processor with a main memory, which includes a cache memory and a cache controller operatively coupled with the cache memory.
Abstract: A cache is provided for operatively coupling a processor with a main memory. The cache includes a cache memory and a cache controller operatively coupled with the cache memory. The cache controller is configured to receive memory requests to be satisfied by the cache memory or the main memory. In addition, the cache controller is configured to process cache activity information to cause at least one of the memory requests to bypass the cache memory.

Proceedings ArticleDOI
01 Oct 2006
TL;DR: This work deconstructs and compares the two dominant existing approaches for L1 data cache (L1D) error protection and presents a new error protection scheme, called the punctured ECC recovery cache (PERC), that achieves the best features of both existing schemes.
Abstract: We deconstruct and compare the two dominant existing approaches for L1 data cache (L1D) error protection, with respect to performance, L2 cache bandwidth, power, and area. The two approaches are: (1) parity on the L1D with write-through to an ECC-protected L2, and (2) ECC protection on the L1D. Qualitatively, the first approach requires a write-through L1D, which places a large bandwidth and power demand on the L2. The second approach adds more bits in the L1D for error protection, which adds to the L1D's area and power while degrading its performance. Our quantitative results show that the relative costs of the second approach are small and that its benefits outweigh these costs. We also present a new error protection scheme, called the Punctured ECC Recovery Cache (PERC), that achieves the best features of both existing schemes.

Patent
18 Sep 2006
TL;DR: In this paper, a data processing apparatus and a method of managing at least one cache within such an apparatus, is provided, and an identification logic is provided which, for each cache, monitors data traffic within the data processing infrastructure and based thereon generates a preferred for eviction identification identifying one or more of the data values as preferable for eviction.
Abstract: A data processing apparatus, and method of managing at least one cache within such an apparatus, are provided. The data processing apparatus has at least one processing unit for executing a sequence of instructions, with each such processing unit having a cache associated therewith, each cache having a plurality of cache lines for storing data values for access by the associated processing unit when executing the sequence of instructions. Identification logic is provided which, for each cache, monitors data traffic within the data processing apparatus and based thereon generates a preferred for eviction identification identifying one or more of the data values as preferred for eviction. Cache maintenance logic is then arranged, for each cache, to implement a cache maintenance operation during which selection of one or more data values for eviction from that cache is performed having regard to any preferred for eviction identification generated by the identification logic for data values stored in that cache. It has been found that such an approach provides a very flexible technique for seeking to improve cache storage utilisation.

Patent
07 Sep 2006
TL;DR: In this article, a technique for use in managing a data cache involves receiving one or more data objects to be written to a storage device, and assigning a temperature value to the data objects before storing them in the data cache.
Abstract: A technique for use in managing a data cache involves receiving one or more data objects to be written to a storage device. A temperature value is assigned to the one or more data objects before storing the data objects in the data cache. The temperature value assigned to the one or more data objects is compared with a threshold value. A copy of the one or more data objects is stored in the data cache if the assigned temperature value exceeds the threshold value.