scispace - formally typeset
Search or ask a question

Showing papers on "Cache algorithms published in 2006"


Proceedings ArticleDOI
09 Dec 2006
TL;DR: In this article, the authors propose a low-overhead, runtime mechanism that partitions a shared cache between multiple applications depending on the reduction in cache misses that each application is likely to obtain for a given amount of cache resources.
Abstract: This paper investigates the problem of partitioning a shared cache between multiple concurrently executing applications. The commonly used LRU policy implicitly partitions a shared cache on a demand basis, giving more cache resources to the application that has a high demand and fewer cache resources to the application that has a low demand. However, a higher demand for cache resources does not always correlate with a higher performance from additional cache resources. It is beneficial for performance to invest cache resources in the application that benefits more from the cache resources rather than in the application that has more demand for the cache resources. This paper proposes utility-based cache partitioning (UCP), a low-overhead, runtime mechanism that partitions a shared cache between multiple applications depending on the reduction in cache misses that each application is likely to obtain for a given amount of cache resources. The proposed mechanism monitors each application at runtime using a novel, cost-effective, hardware circuit that requires less than 2kB of storage. The information collected by the monitoring circuits is used by a partitioning algorithm to decide the amount of cache resources allocated to each application. Our evaluation, with 20 multiprogrammed workloads, shows that UCP improves performance of a dual-core system by up to 23% and on average 11% over LRU-based cache partitioning.

1,083 citations


Proceedings ArticleDOI
Seon-Yeong Park1, Dawoon Jung1, Jeong-Uk Kang1, Jin-Soo Kim1, Joonwon Lee1 
22 Oct 2006
TL;DR: The Clean-First LRU (CFLRU) replacement algorithm is proposed that exploits the characteristics of flash memory and reduces the average replacement cost by 28.4% in swap system and by 26.2% in buffer cache, compared with LRU algorithm.
Abstract: In most operating systems which are customized for disk-based storage system, the replacement algorithm concerns only the number of memory hits. However, flash memory has different read and write cost in the aspects of time and energy so the replacement algorithm with flash memory should consider not only the hit count but also the replacement cost caused by selecting dirty victims. The replacement cost of dirty page is higher than that of clean page with regard to both access time and energy consumption. In this paper, we propose the Clean-First LRU (CFLRU) replacement algorithm that exploits the characteristics of flash memory. CFLRU splits the LRU list into the working region and the clean-first region and adopts a policy that evicts clean pages preferentially in the clean-first region until the number of page hits in the working region is preserved in a suitable level. Using the trace-driven simulation, the proposed algorithm reduces the average replacement cost by 28.4% in swap system and by 26.2% in buffer cache, compared with LRU algorithm. We also implement the CFLRU algorithm in the Linux kernel and present some optimization issues.

434 citations


Journal ArticleDOI
01 May 2006
TL;DR: This paper presents CMP cooperative caching, a unified framework to manage a CMP's aggregate on-chip cache resources by forming an aggregate "shared" cache through cooperation among private caches that performs robustly over a range of system/cache sizes and memory latencies.
Abstract: This paper presents CMP Cooperative Caching, a unified framework to manage a CMP's aggregate on-chip cache resources. Cooperative caching combines the strengths of private and shared cache organizations by forming an aggregate "shared" cache through cooperation among private caches. Locally active data are attracted to the private caches by their accessing processors to reduce remote on-chip references, while globally active data are cooperatively identified and kept in the aggregate cache to reduce off-chip accesses. Examples of cooperation include cache-to-cache transfers of clean data, replication-aware data replacement, and global replacement of inactive data. These policies can be implemented by modifying an existing cache replacement policy and cache coherence protocol, or by the new implementation of a directory-based protocol presented in this paper. Our evaluation using full-system simulation shows that cooperative caching achieves an off-chip miss rate similar to that of a shared cache, and a local cache hit rate similar to that of using private caches. Cooperative caching performs robustly over a range of system/cache sizes and memory latencies. For an 8-core CMP with 1MB L2 cache per core, the best cooperative caching scheme improves the performance of multithreaded commercial workloads by 5-11% compared with a shared cache and 4-38% compared with private caches. For a 4-core CMP running multiprogrammed SPEC2000 workloads, cooperative caching is on average 11% and 6% faster than shared and private cache organizations, respectively.

377 citations


Proceedings ArticleDOI
09 Dec 2006
TL;DR: This paper presents and studies a distributed L2 cache management approach through OS-level page allocation for future many-core processors that can provide differentiated execution environment to running programs by dynamically controlling data placement and cache sharing degrees.
Abstract: This paper presents and studies a distributed L2 cache management approach through OS-level page allocation for future many-core processors. L2 cache management is a crucial multicore processor design aspect to overcome non-uniform cache access latency for good program performance and to reduce on-chip network traffic and related power consumption. Unlike previously studied hardwarebased private and shared cache designs implementing a "fixed" caching policy, the proposed OS-microarchitecture approach is flexible; it can easily implement a wide spectrum of L2 caching policies without complex hardware support. Furthermore, our approach can provide differentiated execution environment to running programs by dynamically controlling data placement and cache sharing degrees. We discuss key design issues of the proposed approach and present preliminary experimental results showing the promise of our approach.

377 citations


Journal ArticleDOI
01 May 2006
TL;DR: Evaluations with the SPEC CPU2000 benchmarks show that MLP-aware cache replacement can improve performance by as much as 23% and a novel, low-hardware overhead mechanism called sampling based adaptive replacement (SBAR) is proposed, to dynamically choose between an MLp-aware and a traditional replacement policy, depending on which one is more effective at reducing the number of memory related stalls.
Abstract: Performance loss due to long-latency memory accesses can be reduced by servicing multiple memory accesses concurrently. The notion of generating and servicing long-latency cache misses in parallel is called Memory Level Parallelism (MLP). MLP is not uniform across cache misses - some misses occur in isolation while some occur in parallel with other misses. Isolated misses are more costly on performance than parallel misses. However, traditional cache replacement is not aware of the MLP-dependent cost differential between different misses. Cache replacement, if made MLP-aware, can improve performance by reducing the number of performance-critical isolated misses. This paper makes two key contributions. First, it proposes a framework for MLP-aware cache replacement by using a runtime technique to compute the MLP-based cost for each cache miss. It then describes a simple cache replacement mechanism that takes both MLP-based cost and recency into account. Second, it proposes a novel, low-hardware overhead mechanism called Sampling Based Adaptive Replacement (SBAR), to dynamically choose between an MLP-aware and a traditional replacement policy, depending on which one is more effective at reducing the number of memory related stalls. Evaluations with the SPEC CPU2000 benchmarks show that MLP-aware cache replacement can improve performance by as much as 23%.

316 citations


Proceedings ArticleDOI
09 Dec 2006
TL;DR: Adaptive Selective Replication (ASR) as discussed by the authors replicates cache blocks only when it estimates the benefit of replication (lower L2 hit latency) exceeds the cost (more L2 misses).
Abstract: The large working sets of commercial and scientific workloads stress the L2 caches of Chip Multiprocessors (CMPs). Some CMPs use a shared L2 cache to maximize the on-chip cache capacity and minimize off-chip misses. Others use private L2 caches, replicating data to limit the delay due to global wires and minimize cache access time. Recent hybrid proposals use selective replication to balance latency and capacity, but their static replication rules result in performance degradation for some combinations of workloads and system configurations. This paper proposes Adaptive Selective Replication (ASR), a mechanism that dynamically monitors workload behavior to control replication. ASR replicates cache blocks only when it estimates the benefit of replication (lower L2 hit latency) exceeds the cost (more L2 misses). Full-system simulations of 8-processor CMPs show that ASR provides robust performance: improving performance by as much as 29% versus shared caches, 19% versus private caches, and 12% versus CMP-NuRapid [9] and Victim Replication [41]. Furthermore, while ASR does not improve the performance of all workloads, it provides performance stability by always performing at least comparably to the best alternative including Cooperative Caching [8].

237 citations


Proceedings ArticleDOI
16 Sep 2006
TL;DR: This paper designs architectural support for OS to efficiently manage shared caches with a wide variety of policies and demonstrates that the scheme can support a wide range of policies including policies that provide passive performance differentiation, reactive fairness by miss-rate equalization and reactive performance differentiation.
Abstract: The role of the operating system (OS) in managing shared resources such as CPU time, memory, peripherals, and even energy is well motivated and understood [23]. Unfortunately, one key resource — lower-level shared cache in chip multi-processors — is commonly managed purely in hardware by rudimentary replacement policies such as least-recently-used (LRU). The rigid nature of the hardware cache management policy poses a serious problem since there is no single best cache management policy across all sharing scenarios. For example, the cache management policy for a scenario where applications from a single organization are running under "best effort" performance expectation is likely to be different from the policy for a scenario where applications from competing business entities (say, at a third party data center) are running under a minimum service level expectation. When it comes to managing shared caches, there is an inherent tension between flexibility and performance. On one hand, managing the shared cache in the OS offers immense policy flexibility since it may be implemented in software. Unfortunately, it is prohibitively expensive in terms of performance for the OS to be involved in managing temporally fine-grain events such as cache allocation. On the other hand, sophisticated hardware-only cache management techniques to achieve fair sharing or throughput maximization have been proposed. But they offer no policy flexibility. This paper addresses this problem by designing architectural support for OS to efficiently manage shared caches with a wide variety of policies. Our scheme consists of a hardware cache quota management mechanism, an OS interface and a set of OS level quota orchestration policies. The hardware mechanism guarantees that OS-specified quotas are enforced in shared caches, thus eliminating the need for (and the performance penalty of) temporally fine-grained OS intervention. The OS retains policy flexibility since it can tune the quotas during regularly scheduled OS interventions. We demonstrate that our scheme can support a wide range of policies including policies that provide (a) passive performance differentiation, (b) reactive fairness by miss-rate equalization and (c) reactive performance differentiation.

215 citations


Journal ArticleDOI
TL;DR: This article proposes SDC (Static Dynamic Cache), a new caching strategy aimed to efficiently exploit the temporal and spatial locality present in the stream of processed queries to improve the hit ratio of SDC by using an adaptive prefetching strategy.
Abstract: This article discusses efficiency and effectiveness issues in caching the results of queries submitted to a Web search engine (WSE). We propose SDC (Static Dynamic Cache), a new caching strategy aimed to efficiently exploit the temporal and spatial locality present in the stream of processed queries. SDC extracts from historical usage data the results of the most frequently submitted queries and stores them in a static, read-only portion of the cache. The remaining entries of the cache are dynamically managed according to a given replacement policy and are used for those queries that cannot be satisfied by the static portion. Moreover, we improve the hit ratio of SDC by using an adaptive prefetching strategy, which anticipates future requests by introducing a limited overhead over the back-end WSE. We experimentally demonstrate the superiority of SDC over purely static and dynamic policies by measuring the hit ratio achieved on three large query logs by varying the cache parameters and the replacement policy used for managing the dynamic part of the cache. Finally, we deploy and measure the throughput achieved by a concurrent version of our caching system. Our tests show how the SDC cache can be efficiently exploited by many threads that concurrently serve the queries of different users.

213 citations


Book ChapterDOI
17 Aug 2006
TL;DR: This work shows that access-driven cache-based attacks are becoming easier to understand and analyze, and when such attacks are mounted against systems performing AES, only a very limited number of encryptions are required to recover the whole key with a high probability of success.
Abstract: An access-driven attack is a class of cache-based side channel analysis. Like the time-driven attack, the cache's timings are under inspection as a source of information leakage. Access-driven attacks scrutinize the cache behavior with a finer granularity, rather than evaluating the overall execution time. Access-driven attacks leverage the ability to detect whether a cache line has been evicted, or not, as the primary mechanism for mounting an attack. In this paper we focus on the case of AES and we show that the vast majority of processors suffer from this cache-based vulnerability. Our best results are indeed performed on a processor without the multi-threading capabilities -- in contrast to previous works in this area that had suggested that multi-threading actually improved, or even made possible, this class of attack. Despite some technical difficulties required to mount such attacks, our work shows that access-driven cache-based attacks are becoming easier to understand and analyze. Also, when such attacks are mounted against systems performing AES, only a very limited number of encryptions are required to recover the whole key with a high probability of success, due to our last round analysis from the ciphertext.

208 citations


Proceedings ArticleDOI
16 Sep 2006
TL;DR: It is found that simple policies like LRU replacement and static uniform partitioning are not sufficient to provide near-optimal performance under any reasonable definition, indicating that some thread-aware cache resource allocation mechanism is required.
Abstract: As chip multiprocessors (CMPs) become increasingly mainstream, architects have likewise become more interested in how best to share a cache hierarchy among multiple simultaneous threads of execution. The complexity of this problem is exacerbated as the number of simultaneous threads grows from two or four to the tens or hundreds. However, there is no consensus in the architectural community on what "best" means in this context. Some papers in the literature seek to equalize each thread's performance loss due to sharing, while others emphasize maximizing overall system performance. Furthermore, the specific effect of these goals varies depending on the metric used to define "performance". In this paper we label equal performance targets as Communist cache policies and overall performance targets as Utilitarian cache policies. We compare both of these models to the most common current model of a free-for-all cache (a Capitalist policy). We consider various performance metrics, including miss rates, bandwidth usage, and IPC, including both absolute and relative values of each metric. Using analytical models and behavioral cache simulation, we find that the optimal partitioning of a shared cache can vary greatly as different but reasonable definitions of optimality are applied. We also find that, although Communist and Utilitarian targets are generally compatible, each policy has workloads for which it provides poor overall performance or poor fairness, respectively. Finally, we find that simple policies like LRU replacement and static uniform partitioning are not sufficient to provide near-optimal performance under any reasonable definition, indicating that some thread-aware cache resource allocation mechanism is required.

198 citations


Journal Article
TL;DR: In this paper, a new cache-based timing attack on AES is presented, which can be used to obtain secret keys of remote cryptosystems if the server under attack runs on a multitasking or simultaneous multithreading system with a large enough workload.
Abstract: We introduce a new robust cache-based timing attack on AES. We present experiments and concrete evidence that our attack can be used to obtain secret keys of remote cryptosystems if the server under attack runs on a multitasking or simultaneous multithreading system with a large enough workload. This is an important difference to recent cache-based timing attacks as these attacks either did not provide any supporting experimental results indicating if they can be applied remotely, or they are not realistically remote attacks.

Proceedings ArticleDOI
22 Oct 2006
TL;DR: Several optimizations on both the conventional cache-based memory systems of the Itanium 2, Opteron, and Power5, as well as the heterogeneous multicore design of the Cell processor are examined, including both an implicit cache oblivious approach and a cache-aware algorithm blocked to match the cache structure.
Abstract: Stencil-based kernels constitute the core of many scientific applications on block-structured grids. Unfortunately, these codes achieve a low fraction of peak performance, due primarily to the disparity between processor and main memory speeds. We examine several optimizations on both the conventional cache-based memory systems of the Itanium 2, Opteron, and Power5, as well as the heterogeneous multicore design of the Cell processor. The optimizations target cache reuse across stencil sweeps, including both an implicit cache oblivious approach and a cache-aware algorithm blocked to match the cache structure. Finally, we consider stencil computations on a machine with an explicitly-managed memory hierarchy, the Cell processor. Overall, results show that a cache-aware approach is significantly faster than a cache oblivious approach and that the explicitly managed memory on Cell is more efficient: Relative to the Power5, it has almost 2x more memory bandwidth and is 3.7x faster.

Proceedings ArticleDOI
12 Nov 2006
TL;DR: A novel caching algorithm for P2P traffic that is based on object segmentation, and partial admission and eviction of objects is developed, which is close to the byte hit rate achieved by an off-line optimal algorithm with complete knowledge of future requests.
Abstract: Peer-to-peer (P2P) file sharing systems generate a major portion of the Internet traffic, and this portion is expected to increase in the future. We explore the potential of deploying proxy caches in different autonomous systems (ASes) with the goal of reducing the cost incurred by Internet service providers and alleviating the load on the Internet backbone. We conduct a measurement study to model the popularity of P2P objects in different ASes. Our study shows that the popularity of P2P objects can be modeled by a Mandelbrot-Zipf distribution, regardless of the AS. Guided by our findings, we develop a novel caching algorithm for P2P traffic that is based on object segmentation, and partial admission and eviction of objects. Our trace-based simulations show that with a relatively small cache size, less than 10% of the total traffic, a byte hit rate of up to 35% can be achieved by our algorithm, which is close to the byte hit rate achieved by an off-line optimal algorithm with complete knowledge of future requests. Our results also show that our algorithm achieves a byte hit rate that is at least 40% more, and at most triple, the byte hit rate of the common Web caching algorithms. Furthermore, our algorithm is robust in face of aborted downloads, which is a common case in P2P systems.

Patent
24 Apr 2006
TL;DR: In this paper, a method and system of managing data access in a shared memory cache of a processor is described, which includes probing one or more memory addresses that map to a subset of the shared memory caches.
Abstract: A method and system of managing data access in a shared memory cache of a processor are disclosed. The method includes probing one or more memory addresses that map to a subset of the shared memory cache and sensing a plurality of events in the one or more memory addresses. Cache utilization information is then obtained by reading a hardware performance counter of the processor. The hardware performance counter is incremented based on the occurrence of the plurality of events. Based upon the cache utilization information, an occurrence of one of the plurality of events is reduced.

Proceedings ArticleDOI
27 Feb 2006
TL;DR: A detailed data-sharing analysis and chip-multiprocessor (CMP) cache study of several multithreaded data-mining bioinformatics workloads and shows that a shared 32 MB last-level cache is able to capture a tremendous amount of data- sharing and outperform a 32 MB private cache configuration by several orders of magnitude.
Abstract: With the continuing growth in the amount of genetic data, members of the bioinformatics community are developing a variety of data-mining applications to understand the data and discover meaningful information. These applications are important in defining the design and performance decisions of future high performance microprocessors. This paper presents a detailed data-sharing analysis and chip-multiprocessor (CMP) cache study of several multithreaded data-mining bioinformatics workloads. For a CMP with a three-level cache hierarchy, we model the last-level of the cache hierarchy as either multiple private caches or a single cache shared amongst different cores of the CMP. Our experiments show that the bioinformatics workloads exhibit significant data-sharing - 50-95% of the data cache is shared by the different threads of the workload. Furthermore, regardless of the amount of data cache shared, for some workloads, as many as 98% of the accesses to the last-level cache are to shared data cache lines. Additionally, the amount of data-sharing exhibited by the workloads is a function of the total cache size available - the larger the data cache the better the sharing behavior. Thus, partitioning the available last-level cache silicon area into multiple private caches can cause applications to lose their inherent data-sharing behavior. For the workloads in this study, a shared 32 MB last-level cache is able to capture a tremendous amount of data-sharing and outperform a 32 MB private cache configuration by several orders of magnitude. Specifically, with shared last-level caches, the bandwidth demands beyond the last-level cache can be reduced by factors of 3-625 when compared to private last-level caches.

Book ChapterDOI
04 Dec 2006
TL;DR: This paper presents an efficient trace-driven cache attack on a widely used implementation of the AES cryptosystem, and develops an accurate mathematical model that is used in the cost analysis of the attack.
Abstract: Cache based side-channel attacks have recently been attracted significant attention due to the new developments in the field. In this paper, we present an efficient trace-driven cache attack on a widely used implementation of the AES cryptosystem. We also evaluate the cost of the proposed attack in detail under the assumption of a noiseless environment. We develop an accurate mathematical model that we use in the cost analysis of our attack. We use two different metrics, specifically, the expected number of necessary traces and the cost of the analysis phase, for the cost evaluation purposes. Each of these metrics represents the cost of a different phase of the attack.

Journal Article
TL;DR: In this article, the authors present an efficient trace-driven cache attack on a widely used implementation of the AES cryptosystem and evaluate the cost of the proposed attack in detail under the assumption of a noiseless environment.
Abstract: Cache based side-channel attacks have recently been attracted significant attention due to the new developments in the field. In this paper, we present an efficient trace-driven cache attack on a widely used implementation of the AES cryptosystem. We also evaluate the cost of the proposed attack in detail under the assumption of a noiseless environment. We develop an accurate mathematical model that we use in the cost analysis of our attack. We use two different metrics, specifically, the expected number of necessary traces and the cost of the analysis phase, for the cost evaluation purposes. Each of these metrics represents the cost of a different phase of the attack.

Patent
06 Apr 2006
TL;DR: In this paper, the authors present a system that facilitates selectively unmarking load-marked cache lines during transactional program execution, where load-marked cache lines are monitored during the transactional execution to detect interfering accesses from other threads.
Abstract: One embodiment of the present invention provides a system that facilitates selectively unmarking load-marked cache lines during transactional program execution, wherein load-marked cache lines are monitored during transactional execution to detect interfering accesses from other threads. During operation, the system encounters a release instruction during transactional execution of a block of instructions. In response to the release instruction, the system modifies the state of cache lines, which are specially load-marked to indicate they can be released from monitoring, to account for the release instruction being encountered. In doing so, the system can potentially cause the specially load-marked cache lines to become unmarked. In a variation on this embodiment, upon encountering a commit-and-start-new-transaction instruction, the system modifies load-marked cache lines to account for the commit-and-start-new-transaction instruction being encountered. In doing so, the system causes normally load-marked cache lines to become unmarked, while other specially load-marked cache lines may remain load-marked past the commit-and-start-new-transaction instruction.

Journal ArticleDOI
TL;DR: It is claimed that there is a sufficient number of good policies for proxies with different characteristics, such as proxies with a small cache, limited bandwidth, and limited processing power, as well as suggest policies for different types of proxies.
Abstract: Research involving Web cache replacement policy has been active for at least a decade. In this-article we would like to claim that there is a sufficient number of good policies, and further proposals would only produce minute improvements. We argue that the focus should be fitness for purpose rather than proposing any new policies. Up to now, almost all policies were purported to perform better than others, creating confusion as to which policy should be used. Actually, a policy only performs well in certain environments. Therefore, the goal of this article is to identify the appropriate policies for proxies with different characteristics, such as proxies with a small cache, limited bandwidth, and limited processing power, as well as suggest policies for different types of proxies, such as ISP-level and root-level proxies

Journal ArticleDOI
01 Mar 2006
TL;DR: Simulation results indicate that the proposed aggregate caching mechanism and a broadcast-based Simple Search algorithm can significantly improve an Imanet performance in terms of throughput and average number of hops to access data items.
Abstract: Internet-based mobile ad hoc network (Imanet) is an emerging technique that combines a wired network (e.g. Internet) and a mobile ad hoc network (Manet) for developing a ubiquitous communication infrastructure. To fulfill users' demand to access various kinds of information, however, an Imanet has several limitations such as limited accessibility to the wired Internet, insufficient wireless bandwidth, and longer message latency. In this paper, we address the issues involved in information search and access in Imanets. An aggregate caching mechanism and a broadcast-based Simple Search (SS) algorithm are proposed for improving the information accessibility and reducing average communication latency in Imanets. As a part of the aggregate cache, a cache admission control policy and a cache replacement policy, called Time and Distance Sensitive (TDS) replacement, are developed to reduce the cache miss ratio and improve the information accessibility. We evaluate the impact of caching, cache management, and the number of access points that are connected to the Internet, through extensive simulation. The simulation results indicate that the proposed aggregate caching mechanism can significantly improve an Imanet performance in terms of throughput and average number of hops to access data items.

Proceedings ArticleDOI
12 Nov 2006
TL;DR: This article presents a polynomial-time centralized approximation algorithm that provably delivers a solution whose benefit is at least 1/4 (1/2 for uniform-size data items) of the optimal benefit of the cache placement problem of minimizing total data access cost in ad hoc networks with multiple data items and nodes with limited memory capacity.
Abstract: Data caching can significantly improve the efficiency of information access in a wireless ad hoc network by reducing the access latency and bandwidth usage. However, designing efficient distributed caching algorithms is non-trivial when network nodes have limited memory. In this article, we consider the cache placement problem of minimizing total data access cost in ad hoc networks with multiple data items and nodes with limited memory capacity. The above optimization problem is known to be NP-hard. Defining benefit as the reduction in total access cost, we present a polynomial-time centralized approximation algorithm that provably delivers a solution whose benefit is at least one-fourth (one-half for uniform-size data items) of the optimal benefit. The approximation algorithm is amenable to localized distributed implementation, which is shown via simulations to perform close to the approximation algorithm. Our distributed algorithm naturally extends to networks with mobile nodes. We simulate our distributed algorithm using a network simulator (ns2), and demonstrate that it significantly outperforms another existing caching technique (by Yin and Cao [30]) in all important performance metrics. The performance differential is particularly large in more challenging scenarios, such as higher access frequency and smaller memory.

Proceedings ArticleDOI
09 Dec 2006
TL;DR: A novel and general scheme by which any two cache management algorithms can be combined and adaptively switch between them, closely tracking the locality characteristics of a given program, inspired by recent work in virtual memory management at the operating system level.
Abstract: We present and evaluate the idea of adaptive processor cache management. Specifically, we describe a novel and general scheme by which we can combine any two cache management algorithms (e.g., LRU, LFU, FIFO, Random) and adaptively switch between them, closely tracking the locality characteristics of a given program. The scheme is inspired by recent work in virtual memory management at the operating system level, which has shown that it is possible to adapt over two replacement policies to provide an aggregate policy that always performs within a constant factor of the better component policy. A hardware implementation of adaptivity requires very simple logic but duplicate tag structures. To reduce the overhead, we use partial tags, which achieve good performance with a small hardware cost. In particular, adapting between LRU and LFU replacement policies on an 8-way 512KB L2 cache yields a 12.7% improvement in average CPI on applications that exhibit a non-negligible L2 miss ratio. Our approach increases total cache storage by 4.0%, but it still provides slightly better performance than a conventional 10-way setassociative 640KB cache which requires 25% more storage.

Proceedings ArticleDOI
05 Jul 2006
TL;DR: Experimental results provided in the paper show that with an appropriate selection of regions and cache contents, the worst-case performance of applications with locked instruction caches is competitive with the best- case performance of unlocked caches.
Abstract: Cache memories have been extensively used to bridge the gap between high speed processors and relatively slower main memories. However, they are sources of predictability problems because of their dynamic and adaptive behavior, and thus need special attention to be used in hard real-time systems. A lot of progress has been achieved in the last ten years to statically predict worst-case execution times (WCETs) of tasks on architectures with caches. However, cache-aware WCET analysis techniques are not always applicable due to the lack of documentation of hardware manuals concerning the cache replacement policies. Moreover, they tend to be pessimistic with some cache replacement policies (e.g. random replacement policies). Lastly, caches are sources of timing anomalies in dynamically scheduled processors (a cache miss may in some cases result in a shorter execution time than a hit). To reconciliate performance and predictability of caches, we propose in this paper algorithm for software control of instruction caches. The proposed algorithms statically divide the code of tasks into regions, for which the cache contents is statically selected. At run-time, at every transition between regions, the cache contents computed off-line is loaded into the cache and the cache replacement policy is disabled (the cache is locked). Experimental results provided in the paper show that with an appropriate selection of regions and cache contents, the worst-case performance of applications with locked instruction caches is competitive with the worst-case performance of unlocked caches.

Patent
29 Nov 2006
TL;DR: In this paper, the state of the cache during a cache preload operation is monitored to avoid the need to restart the cache pre-load operation from the beginning, and any data that has been preloaded into a cache prior to a failure may be retained after a failover occurs.
Abstract: An apparatus, program product and method monitor the state of a cache during a cache preload operation in a clustered computer system such that the monitored state can be used after a failover to potentially avoid the need to restart the cache preload operation from the beginning. In particular, by monitoring the state of the cache during a cache preload operation, any data that has been preloaded into a cache prior to a failure may be retained after a failover occurs, thus enabling the cache preload operation to continue from the point at which it was interrupted as a result of the failure.

Patent
04 Jan 2006
TL;DR: In this article, a method for routing a data request received by a caching system is described, where the data request is transmitted without determining whether the requested data are stored in the cache.
Abstract: A method for routing a data request received by a caching system is described. The caching system includes a router and a cache, and the data request identifies a source platform, a destination platform, and requested data. Where the source and destination platforms correspond to an entry in a list automatically generated by the caching system, the data request is transmitted without determining whether the requested data are stored in the cache.

Proceedings ArticleDOI
26 Mar 2006
TL;DR: This paper discusses the design choices when building thread-shared code caches and enumerate the difficulties of thread-local storage, synchronization, trace building, in-cache lookup tables, and cache eviction and presents efficient solutions to these problems that both scale well and do not require thread suspension.
Abstract: Software code caches are increasingly being used to amortize the runtime overhead of dynamic optimizers, simulators, emulators, dynamic translators, dynamic compilers, and other tools. Despite the now-wide spread use of code caches, techniques for efficiently sharing them across multiple threads have not been fully explored. Some systems simply do not support threads, while others resort to thread-private code caches. Although thread-private caches are much simpler to manage, synchronize, and provide scratch space for, they simply do not scale when applied to many-threaded programs. Thread-shared code caches are needed to target server applications, which employ hundreds of worker threads all performing similar tasks. Yet, those systems that do share their code caches often have brute-force, inefficient solutions to the challenges of concurrent code cache access: a single global lock on runtime system code and suspension of all threads for any cache management action. This limits the possibilities for cache design and has performance problems with applications that require frequent cache invalidations to maintain cache consistency. In this paper, we discuss the design choices when building thread-shared code caches and enumerate the difficulties of thread-local storage, synchronization, trace building, in-cache lookup tables, and cache eviction. We present efficient solutions to these problems that both scale well and do not require thread suspension. We evaluate our results in DynamoRIO, an industrial-strength dynamic binary translation system, on real-world server applications. On these applications our thread-shared caches use an order of magnitude less memory and improve throughput by up to four times compared to thread-private caches.

Patent
27 Dec 2006
TL;DR: In this article, a permanent cache list of files not to be removed from a cache is determined in response to a user selection of an application to be added to the cache by adding a file to cache list if the file is a static dependency of the application or if a file has a high probability of being used in the future by the application.
Abstract: In some embodiments a permanent cache list of files not to be removed from a cache is determined in response to a user selection of an application to be added to the cache. The determination is made by adding a file to the cache list if the file is a static dependency of the application, or if a file has a high probability of being used in the future by the application. Other embodiments are described and claimed.

Patent
Charles E. Narad1
07 Dec 2006
TL;DR: In this paper, the authors discuss how data shared between two memory accessing agents may be stored in a shared partition of the shared cache, and how data accessed by one of the accessing agents can be stored either in one or more private partitions of a shared cache.
Abstract: Some of the embodiments discussed herein may utilize partitions within a shared cache in various computing environments. In an embodiment, data shared between two memory accessing agents may be stored in a shared partition of the shared cache. Additionally, data accessed by one of the memory accessing agents may be stored in one or more private partitions of the shared cache.

Proceedings ArticleDOI
19 Mar 2006
TL;DR: A novel sample-based method for analyzing the data locality of a multithreaded application and the result produced is shown to be consistent with results from a traditional (and much slower) architecture simulation.
Abstract: The introduction of general-purpose microprocessors running multiple threads will put a focus on methods and tools helping a programmer to write efficient parallel applications. Such a tool should be fast enough to meet a software developer's need for short turn-around time, but also be accurate and flexible enough to provide trend-correct and intuitive feedback. This paper presents a novel sample-based method for analyzing the data locality of a multithreaded application. Very sparse data is collected during a single execution of the studied application. The architectural-independent information collected during the execution is fed to a mathematical memory-system model for predicting the cache miss ratio. The sparse data can be used to characterize the application's data locality with respect to almost any possible memory system, such as complicated multiprocessor multilevel cache hierarchies. Any combination of cache size, cache-line size and degree of sharing can be modeled. Each modeled design point takes only a fraction of a second to evaluate, even though the application from which the sampled data was collected may have executed for hours. This makes the tool not just usable for software developers, but also for hardware developers who need to evaluate a huge memory-system design space. The accuracy of the method is evaluated using a large number of commercial and technical multi-threaded applications. The result produced by the algorithm is shown to be consistent with results from a traditional (and much slower) architecture simulation.

Patent
11 Dec 2006
TL;DR: In this article, a cache memory preprocessor consists of a command inputter, which receives a multiple-way cache memory processing command from the processor, and a command implementer, which performs background processing upon multiple ways of the cache memory.
Abstract: A cache memory preprocessor prepares a cache memory for use by a processor. The processor accesses a main memory via a cache memory, which serves a data cache for the main memory. The cache memory preprocessor consists of a command inputter, which receives a multiple-way cache memory processing command from the processor, and a command implementer. The command implementer performs background processing upon multiple ways of the cache memory in order to implement the cache memory processing command received by the command inputter.