scispace - formally typeset
Search or ask a question

Showing papers on "Cache invalidation published in 2009"


Proceedings ArticleDOI
20 Jun 2009
TL;DR: Reactive NUCA (R-NUCA), a distributed cache design which reacts to the class of each cache access and places blocks at the appropriate location in the cache, is proposed.
Abstract: Increases in on-chip communication delay and the large working sets of server and scientific workloads complicate the design of the on-chip last-level cache for multicore processors. The large working sets favor a shared cache design that maximizes the aggregate cache capacity and minimizes off-chip memory requests. At the same time, the growing on-chip communication delay favors core-private caches that replicate data to minimize delays on global wires. Recent hybrid proposals offer lower average latency than conventional designs, but they address the placement requirements of only a subset of the data accessed by the application, require complex lookup and coherence mechanisms that increase latency, or fail to scale to high core counts.In this work, we observe that the cache access patterns of a range of server and scientific workloads can be classified into distinct classes, where each class is amenable to different block placement policies. Based on this observation, we propose Reactive NUCA (R-NUCA), a distributed cache design which reacts to the class of each cache access and places blocks at the appropriate location in the cache. R-NUCA cooperates with the operating system to support intelligent placement, migration, and replication without the overhead of an explicit coherence mechanism for the on-chip last-level cache. In a range of server, scientific, and multiprogrammed workloads, R-NUCA matches the performance of the best cache design for each workload, improving performance by 14% on average over competing designs and by 32% at best, while achieving performance within 5% of an ideal cache design.

436 citations


Proceedings ArticleDOI
20 Jun 2009
TL;DR: This work proposes a new cache management approach that combines dynamic insertion and promotion policies to provide the benefits of cache partitioning, adaptive insertion, and capacity stealing all with a single mechanism.
Abstract: Many multi-core processors employ a large last-level cache (LLC) shared among the multiple cores. Past research has demonstrated that sharing-oblivious cache management policies (e.g., LRU) can lead to poor performance and fairness when the multiple cores compete for the limited LLC capacity. Different memory access patterns can cause cache contention in different ways, and various techniques have been proposed to target some of these behaviors. In this work, we propose a new cache management approach that combines dynamic insertion and promotion policies to provide the benefits of cache partitioning, adaptive insertion, and capacity stealing all with a single mechanism. By handling multiple types of memory behaviors, our proposed technique outperforms techniques that target only either capacity partitioning or adaptive insertion.

334 citations


Patent
03 Jun 2009
TL;DR: In this paper, the authors present a method and a node for finding the shortest path to a cache node in a content delivery network (CDN) comprising requested content and a method for creating a virtual representation of a network.
Abstract: Embodiments of the present invention a method and a node for finding the shortest path to a cache node in a content delivery network (CDN) comprising requested content and a method for creating a virtual representation of a network According to an embodiment of the present invention, the virtual representation is in the form of a virtual, hierarchical topology, and the cache nodes correspond to the cache nodes of the real network All cache nodes are arranged at a first level and with the virtual nodes arranged at higher levels In the virtual representation, all nodes (cache and virtual) are connected with virtual links such that there exist only one path between any two arbitrary cache nodes Further, costs to the virtual links are assigned such that the path cost between any two arbitrary cache nodes in the virtual representation generally corresponds to the lowest path cost between corresponding cache nodes in the real network

146 citations


Proceedings ArticleDOI
12 Oct 2009
TL;DR: A scheduling strategy for real-time tasks with both timing and cache space constraints is presented, which allows each task to use a fixed number of cache partitions, and makes sure that at any time a cache partition is occupied by at most one running task.
Abstract: The major obstacle to use multicores for real-time applications is that we may not predict and provide any guarantee on real-time properties of embedded software on such platforms; the way of handling the on-chip shared resources such as L2 cache may have a significant impact on the timing predictability. In this paper, we propose to use cache space isolation techniques to avoid cache contention for hard real-time tasks running on multicores with shared caches. We present a scheduling strategy for real-time tasks with both timing and cache space constraints, which allows each task to use a fixed number of cache partitions, and makes sure that at any time a cache partition is occupied by at most one running task. In this way, the cache spaces of tasks are isolated at run-time.As technical contributions, we have developed a sufficient schedulability test for non-preemptive fixed-priority scheduling for multicores with shared L2 cache, encoded as a linear programming problem. To improve the scalability of the test, we then present our second schedulability test of quadratic complexity, which is an over approximation of the first test. To evaluate the performance and scalability of our techniques, we use randomly generated task sets. Our experiments show that the first test which employs an LP solver can easily handle task sets with thousands of tasks in minutes using a desktop computer. It is also shown that the second test is comparable with the first one in terms of precision, but scales much better due to its low complexity, and is therefore a good candidate for efficient schedulability tests in the design loop for embedded systems or as an on-line test for admission control.

131 citations


Proceedings ArticleDOI
06 Mar 2009
TL;DR: This paper proposes three hardware-software approaches to defend against software cache-based attacks - they present different tradeoffs between hardware complexity and performance overhead and proposes novel software permutation to replace the random permutation hardware in the RPcache.
Abstract: Software cache-based side channel attacks present serious threats to modern computer systems. Using caches as a side channel, these attacks are able to derive secret keys used in cryptographic operations through legitimate activities. Among existing countermeasures, software solutions are typically application specific and incur substantial performance overhead. Recent hardware proposals including the Partition-Locked cache (PLcache) and Random-Permutation cache (RPcache) [23], although very effective in reducing performance overhead while enhancing the security level, may still be vulnerable to advanced cache attacks. In this paper, we propose three hardware-software approaches to defend against software cache-based attacks - they present different tradeoffs between hardware complexity and performance overhead. First, we propose to use preloading to secure the PLcache. Second, we leverage informing loads, which is a lightweight architectural support originally proposed to improve memory performance, to protect the RPcache. Third, we propose novel software permutation to replace the random permutation hardware in the RPcache. This way, regular caches can be protected with hardware support for informing loads. In our experiments, we analyze various processor models for their vulnerability to cache attacks and demonstrate that even to the processor model that is most vulnerable to cache attacks, our proposed software-hardware integrated schemes provide strong security protection.

129 citations


Proceedings ArticleDOI
Moinuddin K. Qureshi1
06 Mar 2009
TL;DR: A simple extension of DSR is proposed that provides Quality of Service (QoS) by guaranteeing that the worst-case performance of each application remains similar to that with no spilling, while still providing an average throughput improvement of 17.5%.
Abstract: In a Chip Multi-Processor (CMP) with private caches, the last level cache is statically partitioned between all the cores. This prevents such CMPs from sharing cache capacity in response to the requirement of individual cores. Capacity sharing can be provided in private caches by spilling a line evicted from one cache to another cache. However, naively allowing all caches to spill evicted lines to other caches have limited performance benefit as such spilling does not take into account which cores benefit from extra capacity and which cores can provide extra capacity. This paper proposes Dynamic Spill-Receive (DSR) for efficient capacity sharing. In a DSR architecture, each cache uses Set Dueling to learn whether it should act as a “spiller cache” or “receiver cache” for best overall performance. We evaluate DSR for a Quad-core system with 1MB private caches using 495 multi-programmed workloads. DSR improves average throughput by 18% (weighted-speedup by 13% and harmonic-mean fairness metric by 36%) compared to no spilling. DSR requires a total storage overhead of less than two bytes per core, does not require any changes to the existing cache structure, and is scalable to a large number of cores (16 in our evaluation). Furthermore, we propose a simple extension of DSR that provides Quality of Service (QoS) by guaranteeing that the worst-case performance of each application remains similar to that with no spilling, while still providing an average throughput improvement of 17.5%.

121 citations


Proceedings ArticleDOI
12 Dec 2009
TL;DR: Pseudo-last-in-first-out (pseudo-LIFO), a fundamentally new family of replacement heuristics that manages each cache set as a fill stack (as opposed to the traditional access recency stack) that delivers better average performance for the selected single-threaded and multiprogrammed workloads and similar averageperformance for the multi- threaded workloads compared to the dead block prediction-based replacement policy.
Abstract: Cache blocks often exhibit a small number of uses during their life time in the last-level cache. Past research has exploited this property in two different ways. First, replacement policies have been designed to evict dead blocks early and retain the potentially live blocks. Second, dynamic insertion policies attempt to victimize single-use blocks (dead on fill) as early as possible, thereby leaving most of the working set undisturbed in the cache. However, we observe that as the last-level cache grows in capacity and associativity, the traditional dead block prediction-based replacement policy loses effectiveness because often the LRU block itself is dead leading to an LRU replacement decision. The benefit of dynamic insertion policies is also small in a large class of applications that exhibit a significant number of cache blocks with small, yet more than one, uses. To address these drawbacks, we introduce pseudo-last-in-first-out (pseudo-LIFO), a fundamentally new family of replacement heuristics that manages each cache set as a fill stack (as opposed to the traditional access recency stack). We specify three members of this family, namely, dead block prediction LIFO, probabilistic escape LIFO, and probabilistic counter LIFO. The probabilistic escape LIFO (peLIFO) policy is the central contribution of this paper. It dynamically learns the use probabilities of cache blocks beyond each fill stack position to implement a new replacement policy. Our detailed simulation results show that peLIFO, while having less than 1% storage overhead, reduces the execution time by 10% on average compared to a baseline LRU replacement policy for a set of fourteen single-threaded applications on a 2 MB 16-way set associative L2 cache. It reduces the average CPI by 19% on average for a set of twelve multiprogrammed workloads while satisfying a strong fairness requirement on a four-core chip-multiprocessor with an 8 MB 16-way set associative shared L2 cache. Further, it reduces the parallel execution time by 17% on average for a set of six multi-threaded programs on an eight-core chip-multiprocessor with a 4 MB 16-way set associative shared L2 cache. For the architectures considered in this paper, the storage overhead of the peLIFO policy is one-fifth to half of that of a state-of-the-art dead block prediction-based replacement policy. However, the peLIFO policy delivers better average performance for the selected single-threaded and multiprogrammed workloads and similar average performance for the multi-threaded workloads compared to the dead block prediction-based replacement policy.

114 citations


Patent
11 May 2009
TL;DR: In this paper, a poll-based notification system is used in distributed caches for tracking changes to cache items, where the server can maintain the changes in an efficient fashion (in blocks) and return the changes to clients that perform the appropriate filtering.
Abstract: Systems and methods that supply poll based notification system in a distributed cache, for tracking changes to cache items. Local caches on the client can employ the notification system to keep the local objects in sync with the backend cache service; and can further dynamically adjust the “scope” of notifications required based on the number and distribution of keys in the local cache. The server can maintain the changes in an efficient fashion (in blocks) and returns the changes to clients that perform the appropriate filtering. Notifications can be associated with a session and/or an application.

110 citations


Proceedings ArticleDOI
08 Jun 2009
TL;DR: On applications manipulating large amount of null data blocks, such a ZC cache allows to significantly reduce the miss rate and memory traffic, and therefore to increase performance for a small hardware overhead.
Abstract: It has been observed that some applications manipulate large amounts of null data. Moreover these zero data often exhibit high spatial locality. On some applications more than 20% of the data accesses concern null data blocks. Representing a null block in a cache on a standard cache line appears as a waste of resources.In this paper, we propose the Zero-Content Augmented cache, the ZCA cache. A ZCA cache consists of a conventional cache augmented with a specialized cache for memorizing null blocks, the Zero-Content cache or ZC cache. In the ZC cache, the data block is represented by its address tag and a validity bit. Moreover, as null blocks generally exhibit high spatial locality, several null blocks can be associated with a single address tag in the ZC cache.For instance, a ZC cache mapping 32MB of zero 64-byte lines uses less than 80KB of storage. Decompression of a null block is very simple, therefore read access time on the ZCA cache is in the same range as the one of a conventional cache. On applications manipulating large amount of null data blocks, such a ZC cache allows to significantly reduce the miss rate and memory traffic, and therefore to increase performance for a small hardware overhead. In particular, the write-back traffic on null blocks is limited. For applications with a low null block rate, no performance loss is observed.

109 citations


Patent
22 Dec 2009
TL;DR: In this paper, a content delivery network includes first and second sets of cache servers, a domain name server, and an anycast island controller, which is configured to respond to a request for content from a client system and provide the content to the client system.
Abstract: A content delivery network includes first and second sets of cache servers, a domain name server, and an anycast island controller. The first set of cache servers is hosted by a first autonomous system and the second set of cache servers is hosted by a second autonomous system. The cache servers are configured to respond to an anycast address for the content delivery network, to receive a request for content from a client system, and provide the content to the client system. The first and second autonomous systems are configured to balance the load across the first and second sets of cache servers, respectively. The domain name server is configured to receive a request from a requestor for a cache server address, and provide the anycast address to the requestor in response to the request. The anycast island controller is configured to receive load information from each of the cache servers, determine an amount of requests to transfer from the first autonomous system to the second autonomous system; send an instruction to the first autonomous system to transfer the amount of requests to the second autonomous system.

106 citations


Proceedings ArticleDOI
19 Apr 2009
TL;DR: This work addresses cooperative caching in mobile ad hoc networks where information is exchanged in a peer-to-peer fashion among the network nodes and devise a fully-distributed caching strategy called Hamlet, which creates a content diversity within the nodes neighborhood.
Abstract: We address cooperative caching in mobile ad hoc networks where information is exchanged in a peer-to-peer fashion among the network nodes. Our objective is to devise a fully-distributed caching strategy whereby nodes, independently of each other, decide whether to cache or not some content, and for how long. Each node takes this decision according to its perception of what nearby users may be storing in their caches and with the aim to differentiate its own cache content from the others'. We aptly named such algorithm "Hamlet". The result is the creation of a content diversity within the nodes neighborhood, so that a requesting user likely finds the desired information nearby. We simulate our caching algorithm in an urban scenario, featuring vehicular mobility, as well as in a mall scenario with pedestrians carrying mobile devices. Comparison with other caching schemes under different forwarding strategies confirms that Hamlet succeeds in creating the desired content diversity thus leading to a resource-efficient information access.

Journal ArticleDOI
TL;DR: This paper shows how hardware performance monitors can be used to provide a fine-grained, closely-coupled feedback loop to dynamic optimizations done by a multicore-aware operating system.
Abstract: Multicore processors contain new hardware characteristics that are different from previous generation single-core systems or traditional SMP (symmetric multiprocessing) multiprocessor systems. These new characteristics provide new performance opportunities and challenges. In this paper, we show how hardware performance monitors can be used to provide a fine-grained, closely-coupled feedback loop to dynamic optimizations done by a multicore-aware operating system. These multicore optimizations are possible due to the advanced capabilities of hardware performance monitoring units currently found in commodity processors, such as execution pipeline stall breakdown and data address sampling. We demonstrate three case studies on how a multicore-aware operating system can use these online capabilities for (1) determining cache partition sizes, which helps reduce contention in the shared cache among applications, (2) detecting memory regions with bad cache usage, which helps in isolating these regions to reduce cache pollution, and (3) detecting sharing among threads, which helps in clustering threads to improve locality. Using realistic applications from standard benchmark suites, the following performance improvements were achieved: (1) up to 27% improvement in IPC (instructions-per-cycle) due to cache partition sizing; (2) up to 10% reduction in cache miss rates due to reduced cache pollution, resulting in up to 7% improvement in IPC; and (3) up to 70% reduction in remote cache accesses due to thread clustering, resulting in up to 7% application-level improvement.

Patent
30 Oct 2009
TL;DR: In this article, a method for accessing data stored in a distributed storage system is provided, which comprises determining whether a copy of first data is stored in distributed cache systems, where data in the distributed cache system is located in free storage space of the distributed storage systems.
Abstract: A method for accessing data stored in a distributed storage system is provided. The method comprises determining whether a copy of first data is stored in a distributed cache system, where data in the distributed cache system is stored in free storage space of the distributed storage system; accessing the copy of the first data from the distributed cache system if the copy of the first data is stored in a first data storage medium at a first computing system in a network; and requesting a second computing system in the network to access the copy of the first data from the distributed cache system if the copy of the first data is stored in a second data storage medium at the second computing system. If the copy of the first data is not stored in the distributed cache system, the first data is accessed from the distributed storage system.

Proceedings ArticleDOI
06 Mar 2009
TL;DR: A 3D chip design that stacks SRAM and DRAM upon processing cores and employs OS-based page coloring to minimize horizontal communication of cache data is postulated and it is shown that a tree topology is an ideal fit that significantly reduces the power and latency requirements of the on-chip network.
Abstract: Cache hierarchies in future many-core processors are expected to grow in size and contribute a large fraction of overall processor power and performance. In this paper, we postulate a 3D chip design that stacks SRAM and DRAM upon processing cores and employs OS-based page coloring to minimize horizontal communication of cache data. We then propose a heterogeneous reconfigurable cache design that takes advantage of the high density of DRAM and the superior power/delay characteristics of SRAM to efficiently meet the working set demands of each individual core. Finally, we analyze the communication patterns for such a processor and show that a tree topology is an ideal fit that significantly reduces the power and latency requirements of the on-chip network. The above proposals are synergistic: each proposal is made more compelling because of its combination with the other innovations described in this paper. The proposed reconfigurable cache model improves performance by up to 19% along with 48% savings in network power.

Patent
18 Sep 2009
TL;DR: In this article, a plurality of mid-tier databases form a single, consistent cache grid for data in a one or more backend data sources, such as a database system, and consistency in the cache grid is maintained by ownership locks.
Abstract: A plurality of mid-tier databases form a single, consistent cache grid for data in a one or more backend data sources, such as a database system. The mid-tier databases may be standard relational databases. Cache agents at each mid-tier database swap in data from the backend database as needed. Consistency in the cache grid is maintained by ownership locks. Cache agents prevent database operations that will modify cached data in a mid-tier database unless and until ownership of the cached data can be acquired for the mid-tier database. Cache groups define what backend data may be cached, as well as a general structure in which the backend data is to be cached. Metadata for cache groups is shared to ensure that data is cached in the same form throughout the entire grid. Ownership of cached data can then be tracked through a mapping of cached instances of data to particular mid-tier databases.

Journal ArticleDOI
Matteo Frigo1, Volker Strumpen1
TL;DR: It is shown that a multithreaded cache oblivious matrix multiplication incurs cache misses when executed by the Cilk work-stealing scheduler on a machine with P processors, each with a cache of size Z, with high probability.
Abstract: We present a technique for analyzing the number of cache misses incurred by multithreaded cache oblivious algorithms on an idealized parallel machine in which each processor has a private cache. We specialize this technique to computations executed by the Cilk work-stealing scheduler on a machine with dag-consistent shared memory. We show that a multithreaded cache oblivious matrix multiplication incurs cache misses when executed by the Cilk scheduler on a machine with P processors, each with a cache of size Z, with high probability. This bound is tighter than previously published bounds. We also present a new multithreaded cache oblivious algorithm for 1D stencil computations incurring cache misses with high probability, one for Gaussian elimination and back substitution, and one for the length computation part of the longest common subsequence problem incurring cache misses with high probability.

Patent
09 Sep 2009
TL;DR: In this article, the contents of a non-volatile memory device may be relied upon as accurately reflecting data stored on disk storage across a power transition such as a reboot, and cache metadata may be efficiently accessed and reliably saved and restored across power transitions.
Abstract: Embodiments of the invention provide techniques for ensuring that the contents of a non-volatile memory device may be relied upon as accurately reflecting data stored on disk storage across a power transition such as a reboot. For example, some embodiments of the invention provide techniques for determining whether the cache contents and/or or disk contents are modified during a power transition, causing cache contents to no longer accurately reflect data stored in disk storage. Further, some embodiments provide techniques for managing cache metadata during normal ("steady state") operations and across power transitions, ensuring that cache metadata may be efficiently accessed and reliably saved and restored across power transitions.

Proceedings ArticleDOI
19 Apr 2009
TL;DR: This paper seeks to construct an analytical framework based on optimal control theory and dynamic programming, to help form an in-depth understanding of optimal strategies to design cache replacement algorithms in peer-assisted VoD systems.
Abstract: Peer-assisted Video-on-Demand (VoD) systems have not only received substantial recent research attention, but also been implemented and deployed with success in large-scale real- world streaming systems, such as PPLive. Peer-assisted Video- on-Demand systems are designed to take full advantage of peer upload bandwidth contributions with a cache on each peer. Since the size of such a cache on each peer is limited, it is imperative that an appropriate cache replacement algorithm is designed. There exists a tremendous level of flexibility in the design space of such cache replacement algorithms, including the simplest alternatives such as Least Recently Used (LRU). Which algorithm is the best to minimize server bandwidth costs, so that when peers need a media segment, it is most likely available from caches of other peers? Such a question, however, is arguably non-trivial to answer, as both the demand and supply of media segments are stochastic in nature. In this paper, we seek to construct an analytical framework based on optimal control theory and dynamic programming, to help us form an in-depth understanding of optimal strategies to design cache replacement algorithms. With such analytical insights, we have shown with extensive simulations that, the performance margin enjoyed by optimal strategies over the simplest algorithms is not substantial, when it comes to reducing server bandwidth costs. In most cases, the simplest choices are good enough as cache replacement algorithms in peer-assisted VoD systems.

Patent
31 Mar 2009
TL;DR: In this article, a computer-implemented method for controlling initialization of a fingerprint cache for data deduplication associated with a single-instance-storage computing subsystem is proposed.
Abstract: A computer-implemented method for controlling initialization of a fingerprint cache for data deduplication associated with a single-instance-storage computing subsystem may comprise: 1) detecting a request to store a data selection to the single-instance-storage computing subsystem, 2) leveraging a client-side fingerprint cache associated with a previous storage of the data selection to the single-instance-storage computing subsystem to initialize a new client-side fingerprint cache, and 3) utilizing the new client-side fingerprint cache for data deduplication associated with the request to store the data selection to the single-instance-storage computing subsystem. Other exemplary methods of controlling initialization of a fingerprint cache for data deduplication, as well as corresponding exemplary systems and computer-readable-storage media, are also disclosed.

Proceedings ArticleDOI
12 Dec 2009
TL;DR: This paper presents a technique that aims to balance the pressure on the cache sets by detecting when it may be beneficial to associate sets, displacing lines from stressed sets to underutilized ones.
Abstract: Efficient memory hierarchy design is critical due to the increasing gap between the speed of the processors and the memory. One of the sources of inefficiency in current caches is the non-uniform distribution of the memory accesses on the cache sets. Its consequence is that while some cache sets may have working sets that are far from fitting in them, other sets may be underutilized because their working set has fewer lines than the set. In this paper we present a technique that aims to balance the pressure on the cache sets by detecting when it may be beneficial to associate sets, displacing lines from stressed sets to underutilized ones. This new technique, called Set Balancing Cache or SBC, achieved an average reduction of 13% in the miss rate of ten benchmarks from the SPEC CPU2006 suite, resulting in an average IPC improvement of 5%.

Patent
27 Mar 2009
TL;DR: In this paper, a shared cache controller enables or disables access separately to each of the cache ways based upon the corresponding source of a received memory request, and the control of the accessibility of the shared cache ways via altering stored values in the configuration and status registers (CSRs) may be used to create a pseudo-RAM structure within the shared caches and to progressively reduce the size of the share cache during a power-down sequence while the cache continues operation.
Abstract: A system and method for data allocation in a shared cache memory of a computing system are contemplated. Each cache way of a shared set-associative cache is accessible to multiple sources, such as one or more processor cores, a graphics processing unit (GPU), an input/output (I/O) device, or multiple different software threads. A shared cache controller enables or disables access separately to each of the cache ways based upon the corresponding source of a received memory request. One or more configuration and status registers (CSRs) store encoded values used to alter accessibility to each of the shared cache ways. The control of the accessibility of the shared cache ways via altering stored values in the CSRs may be used to create a pseudo-RAM structure within the shared cache and to progressively reduce the size of the shared cache during a power-down sequence while the shared cache continues operation.

Proceedings ArticleDOI
22 Sep 2009
TL;DR: Two strategies offering a kernel-assisted, single-copy model with support for noncontiguous and asynchronous transfers are introduced, which outperform the standard transfer method in the MPICH2 implementation when no cache is shared between the processing cores or when very large messages are being transferred.
Abstract: The emergence of multicore processors raises the need to efficiently transfer large amounts of data between local processes. MPICH2 is a highly portable MPI implementation whose large-message communication schemes suffer from high CPU utilization and cache pollution because of the use of a double-buffering strategy, common to many MPI implementations. We introduce two strategies offering a kernel-assisted, single-copy model with support for noncontiguous and asynchronous transfers. The first one uses the now widely available vmsplice Linux system call; the second one further improves performance thanks to a custom kernel module called KNEM. The latter also offers I/OAT copy offload, which is dynamically enabled depending on both hardware cache characteristics and message size. These new solutions outperform the standard transfer method in the MPICH2 implementation when no cache is shared between the processing cores or when very large messages are being transferred. Collective communication operations show a dramatic improvement, and the IS NAS parallel benchmark shows a 25% speedup and better cache efficiency.

Patent
10 Oct 2009
TL;DR: In this article, a method for efficiently using a large secondary cache is presented, in which a subset of the plurality of data tracks makes up a full stride and in the event the subset makes up the full stride, the method may destage the subset from the secondary cache.
Abstract: A method for efficiently using a large secondary cache is disclosed herein. In certain embodiments, such a method may include accumulating, in a secondary cache, a plurality of data tracks. These data tracks may include modified data and/or unmodified data. The method may determine if a subset of the plurality of data tracks makes up a full stride. In the event the subset makes up a full stride, the method may destage the subset from the secondary cache. By destaging full strides, the method reduces the number of disk operations that are required to destage data from the secondary cache. A corresponding computer program product and apparatus are also disclosed and claimed herein.

Proceedings ArticleDOI
24 Aug 2009
TL;DR: A novel approach to analyzing the worst-case cache interferences and bounding the WCET for threads running on multi-core processors with shared direct-mapped L2 instruction caches is proposed, using an Extended ILP (Integer Linear Programming) to model all the possible inter-thread cache conflicts.
Abstract: In a multi-core processor, different cores typically share the last-level cache, and threads running on different cores may interfere with each other in accessing the shared cache. Therefore, multi-core WCET (Worst-Case Execution Time) analyzer must be able to safely and accurately estimate the worst-case inter-thread cache interferences, which is not supported by current WCET analysis techniques that mainly focus on analyzing uniprocessors. This paper proposes a novel approach to analyzing the worst-case cache interferences and bounding the WCET for threads running on multi-core processors with shared direct-mapped L2 instruction caches. We propose to use an Extended ILP (Integer Linear Programming) to model all the possible inter-thread cache conflicts, based on which we can accurately calculate the worst-case inter-thread cache interferences and derive the WCET. Compared to a recently proposed multi-core static analysis technique based on control flow information alone, this approach improves the tightness of WCET estimation by 13.7% on average.

Patent
Daniel J. Colglazier1
17 Jun 2009
TL;DR: In this paper, an improved sectored cache replacement algorithm is implemented via a method and computer program product, which selects a cache sector among a plurality of cache sectors for replacement in a computer system.
Abstract: An improved sectored cache replacement algorithm is implemented via a method and computer program product The method and computer program product select a cache sector among a plurality of cache sectors for replacement in a computer system The method may comprise selecting a cache sector to be replaced that is not the most recently used and that has the least amount of modified data In the case in which there is a tie among cache sectors, the sector to be replaced may be the sector among such cache sectors with the least amount of valid data In the case in which there is still a tie among cache sectors, the sector to be replaced may be randomly selected among such cache sectors Unlike conventional sectored cache replacement algorithms, the improved algorithm implemented by the method and computer program product accounts for both hit rate and bus utilization

Proceedings ArticleDOI
12 Sep 2009
TL;DR: Experimental results show that in comparison with a standard L2 cache managed by LRU, Soft-OLP significantly reduces the execution time by reducing L1 cache misses across inputs for a set of single- and multi-threaded programs from the SPEC CPU2000 benchmark suite, NAS benchmarks and a computational kernel set.
Abstract: Performance degradation of memory-intensive programs caused by the LRU policy's inability to handle weak-locality data accesses in the last level cache is increasingly serious for two reasons. First, the last-level cache remains in the CPU's critical path, where only simple management mechanisms, such as LRU, can be used, precluding some sophisticated hardware mechanisms to address the problem. Second, the commonly used shared cache structure of multi-core processors has made this critical path even more performance-sensitive due to intensive inter-thread contention for shared cache resources. Researchers have recently made efforts to address the problem with the LRU policy by partitioning the cache using hardware or OS facilities guided by run-time locality information. Such approaches often rely on special hardware support or lack enough accuracy. In contrast, for a large class of programs, the locality information can be accurately predicted if access patterns are recognized through small training runs at the data object level. To achieve this goal, we present a system-software framework referred to as Soft-OLP (Software-based Object-Level cache Partitioning). We first collect per-object reuse distance histograms and inter-object interference histograms via memory-trace sampling. With several low-cost training runs, we are able to determine the locality patterns of data objects. For the actual runs, we categorize data objects into different locality types and partition the cache space among data objects with a heuristic algorithm, in order to reduce cache misses through segregation of contending objects. The object-level cache partitioning framework has been implemented with a modified Linux kernel, and tested on a commodity multi-core processor. Experimental results show that in comparison with a standard L2 cache managed by LRU, Soft-OLP significantly reduces the execution time by reducing L2 cache misses across inputs for a set of single- and multi-threaded programs from the SPEC CPU2000 benchmark suite, NAS benchmarks and a computational kernel set.

Patent
Marcus L. Kornegay1, Ngan N. Pham1
19 Jun 2009
TL;DR: A write-back coherency data cache for temporarily holding cache lines is proposed in this article, which holds data being written back to main memory for a period of time prior to writing the data to the main memory.
Abstract: A write-back coherency data cache for temporarily holding cache lines. Upon receiving a processor request for data, a determination is made from a coherency directory whether a copy of the data is cached in a write-back cache located in a memory controller hardware. The write-back cache holds data being written back to main memory for a period of time prior to writing the data to main memory. If the data is cached in the write-back cache, the data is removed from the write-back cache and forwarded to the requesting processor. The cache coherency state in the coherency directory entry for the data is updated to reflect the current cache coherency state of the data based on the requesting processor's intended use of the data.

Proceedings ArticleDOI
01 Jul 2009
TL;DR: This paper presents simulation-based results on representative examples that illustrate the problem of performance anomalies with standard cache replacement policies and shows that, by eliminating dependencies on access history, randomized replacement greatly reduces the risk of these cache-based performance anomalies, enabling probabilistic worst-case execution time analysis.
Abstract: While hardware caches are generally effective at improving application performance, they greatly complicate performance prediction. Slight changes in memory layout or data access patterns can lead to large and systematic increases in cache misses, degrading performance. In the worst case, these misses can effectively render the cache useless. These pathological cases, or ``cache risk patterns'', are difficult to predict, test or debug, and their presence limits the usefulness of caches in safety critical real-time systems, especially in hard real-time environments.In this paper, we explore the effect of randomized cache replacement policies in real-time systems with stringent timing constrains. We present simulation-based results on representative examples that illustrate the problem of performance anomalies with standard cache replacement policies. We show that, by eliminating dependencies on access history, randomized replacement greatly reduces the risk of these cache-based performance anomalies, enabling probabilistic worst-case execution time analysis.

Patent
30 Apr 2009
TL;DR: In this paper, a cache-policy switching module emulates the caching of the storage data with a plurality of cache configurations, and upon a determination that one of the plurality of caches performs better than the first cache configuration, automatically applies the better cache configuration to the cache memory for caching the data.
Abstract: A method implements a cache-policy switching module in a storage system. The storage system includes a cache memory to cache storage data. The cache memory uses a first cache configuration. The cache-policy switching module emulates the caching of the storage data with a plurality of cache configurations. Upon a determination that one of the plurality of cache configurations performs better than the first cache configuration, the cache-policy switching module automatically applies the better performing cache configuration to the cache memory for caching the storage data.

Proceedings ArticleDOI
18 Aug 2009
TL;DR: The evaluation shows that the cache partitioning method can work efficiently and improve the performance of co-scheduled applications running within different VMs and the VMM-based implementation is fully transparent to the guest OSes.
Abstract: Virtualization is often used in systems for the purpose of offering isolation among applications running in separate virtual machines (VM). Current virtual machine monitors (VMMs) have done a decent job in resource isolation in memory, CPU and I/O devices. However, when looking further into the usage of lower-level shared cache, we notice that one virtual machine’s cache behavior may interfere with another’s due to the uncontrolled cache sharing. In this situation, performance isolation cannot be guaranteed. This paper presents a cache partitioning approach which can be implemented in the VMM. We have implemented this mechanism in Xen VMM using the page coloring technique traditionally applied to the OS. Our VMM-based implementation is fully transparent to the guest OSes. It thus shows the advantages of simplicity and flexibility. Our evaluation shows that our cache partitioning method can work efficiently and improve the performance of co-scheduled applications running within different VMs. In the concurrent workloads selected from the SPEC CPU 2006 benchmarks, our technique achieves a performance improvement by up to 19% for the most sensitive workloads