scispace - formally typeset
Search or ask a question

Showing papers on "Smart Cache published in 2009"


Proceedings ArticleDOI
20 Jun 2009
TL;DR: Reactive NUCA (R-NUCA), a distributed cache design which reacts to the class of each cache access and places blocks at the appropriate location in the cache, is proposed.
Abstract: Increases in on-chip communication delay and the large working sets of server and scientific workloads complicate the design of the on-chip last-level cache for multicore processors. The large working sets favor a shared cache design that maximizes the aggregate cache capacity and minimizes off-chip memory requests. At the same time, the growing on-chip communication delay favors core-private caches that replicate data to minimize delays on global wires. Recent hybrid proposals offer lower average latency than conventional designs, but they address the placement requirements of only a subset of the data accessed by the application, require complex lookup and coherence mechanisms that increase latency, or fail to scale to high core counts.In this work, we observe that the cache access patterns of a range of server and scientific workloads can be classified into distinct classes, where each class is amenable to different block placement policies. Based on this observation, we propose Reactive NUCA (R-NUCA), a distributed cache design which reacts to the class of each cache access and places blocks at the appropriate location in the cache. R-NUCA cooperates with the operating system to support intelligent placement, migration, and replication without the overhead of an explicit coherence mechanism for the on-chip last-level cache. In a range of server, scientific, and multiprogrammed workloads, R-NUCA matches the performance of the best cache design for each workload, improving performance by 14% on average over competing designs and by 32% at best, while achieving performance within 5% of an ideal cache design.

436 citations


Proceedings ArticleDOI
20 Jun 2009
TL;DR: This paper discusses and evaluates two types of hybrid cache architectures: inter cache Level HCA (LHCA), in which the levels in a cache hierarchy can be made of disparate memory technologies; and intra cache level or cache Region based H CA (RHCA), where a single level of cache can be partitioned into multiple regions, each of a different memory technology.
Abstract: Caching techniques have been an efficient mechanism for mitigating the effects of the processor-memory speed gap. Traditional multi-level SRAM-based cache hierarchies, especially in the context of chip multiprocessors (CMPs), present many challenges in area requirements, core-to-cache balance, power consumption, and design complexity. New advancements in technology enable caches to be built from other technologies, such as Embedded DRAM (EDRAM), Magnetic RAM (MRAM), and Phase-change RAM (PRAM), in both 2D chips or 3D stacked chips. Caches fabricated in these technologies offer dramatically different power and performance characteristics when compared with SRAM-based caches, particularly in the areas of access latency, cell density, and overall power consumption. In this paper, we propose to take advantage of the best characteristics that each technology offers, through the use of Hybrid Cache Architecture (HCA) designs. We discuss and evaluate two types of hybrid cache architectures: inter cache Level HCA (LHCA), in which the levels in a cache hierarchy can be made of disparate memory technologies; and intra cache level or cache Region based HCA (RHCA), where a single level of cache can be partitioned into multiple regions, each of a different memory technology. We have studied a number of different HCA architectures and explored the potential of hardware support for intra-cache data movement and power consumption management within HCA caches. Utilizing a full-system simulator that has been validated against real hardware, we demonstrate that an LHCA design can provide a geometric mean 7% IPC improvement over a baseline 3-level SRAM cache design under the same area constraint across a collection of 25 workloads. A more aggressive RHCA-based design provides 12% IPC improvement over the baseline. Finally, a 2-layer 3D cache stack (3DHCA) of high density memory technology within the same chip footprint gives 18% IPC improvement over the baseline. Furthermore, up to 70% reduction in power consumption over a baseline SRAM-only design is achieved.

375 citations


Proceedings ArticleDOI
20 Jun 2009
TL;DR: This work proposes a new cache management approach that combines dynamic insertion and promotion policies to provide the benefits of cache partitioning, adaptive insertion, and capacity stealing all with a single mechanism.
Abstract: Many multi-core processors employ a large last-level cache (LLC) shared among the multiple cores. Past research has demonstrated that sharing-oblivious cache management policies (e.g., LRU) can lead to poor performance and fairness when the multiple cores compete for the limited LLC capacity. Different memory access patterns can cause cache contention in different ways, and various techniques have been proposed to target some of these behaviors. In this work, we propose a new cache management approach that combines dynamic insertion and promotion policies to provide the benefits of cache partitioning, adaptive insertion, and capacity stealing all with a single mechanism. By handling multiple types of memory behaviors, our proposed technique outperforms techniques that target only either capacity partitioning or adaptive insertion.

334 citations


Proceedings ArticleDOI
13 Nov 2009
TL;DR: Experimental results demonstrate that two resource management approaches are effective in isolating cache interference impacts a VM may have on another VM, and incorporate these approaches in the resource management framework of the example cloud infrastructure, which enables the deployment of VMs with isolation enhanced SLAs.
Abstract: The cloud infrastructure provider (CIP) in a cloud computing platform must provide security and isolation guarantees to a service provider (SP), who builds the service(s) for such a platform. We identify last level cache (LLC) sharing as one of the impediments to finer grain isolation required by a service, and advocate two resource management approaches to provide performance and security isolation in the shared cloud infrastructure - cache hierarchy aware core assignment and page coloring based cache partitioning. Experimental results demonstrate that these approaches are effective in isolating cache interference impacts a VM may have on another VM. We also incorporate these approaches in the resource management (RM) framework of our example cloud infrastructure, which enables the deployment of VMs with isolation enhanced SLAs.

234 citations


Proceedings ArticleDOI
19 Apr 2009
TL;DR: This paper considers a network in which each router has a local cache that caches files passing through it and develops a simple content caching, location, and routing systems that adopts an implicit, transparent, and best-effort approach towards caching.
Abstract: For several years, web caching has been used to meet the ever-increasing Web access loads. A fundamental capability of all such systems is that of inter-cache coordination, which can be divided into two main types: explicit and implicit coordination. While the former allows for greater control over resource allocation, the latter does not suffer from the additional communication overhead needed for coordination. In this paper, we consider a network in which each router has a local cache that caches files passing through it. By additionally storing minimal information regarding caching history, we develop a simple content caching, location, and routing systems that adopts an implicit, transparent, and best-effort approach towards caching. Though only best effort, the policy outperforms classic policies that allow explicit coordination between caches.

199 citations


Proceedings ArticleDOI
12 Oct 2009
TL;DR: A scheduling strategy for real-time tasks with both timing and cache space constraints is presented, which allows each task to use a fixed number of cache partitions, and makes sure that at any time a cache partition is occupied by at most one running task.
Abstract: The major obstacle to use multicores for real-time applications is that we may not predict and provide any guarantee on real-time properties of embedded software on such platforms; the way of handling the on-chip shared resources such as L2 cache may have a significant impact on the timing predictability. In this paper, we propose to use cache space isolation techniques to avoid cache contention for hard real-time tasks running on multicores with shared caches. We present a scheduling strategy for real-time tasks with both timing and cache space constraints, which allows each task to use a fixed number of cache partitions, and makes sure that at any time a cache partition is occupied by at most one running task. In this way, the cache spaces of tasks are isolated at run-time.As technical contributions, we have developed a sufficient schedulability test for non-preemptive fixed-priority scheduling for multicores with shared L2 cache, encoded as a linear programming problem. To improve the scalability of the test, we then present our second schedulability test of quadratic complexity, which is an over approximation of the first test. To evaluate the performance and scalability of our techniques, we use randomly generated task sets. Our experiments show that the first test which employs an LP solver can easily handle task sets with thousands of tasks in minutes using a desktop computer. It is also shown that the second test is comparable with the first one in terms of precision, but scales much better due to its low complexity, and is therefore a good candidate for efficient schedulability tests in the design loop for embedded systems or as an on-line test for admission control.

131 citations


Proceedings ArticleDOI
06 Mar 2009
TL;DR: This paper proposes three hardware-software approaches to defend against software cache-based attacks - they present different tradeoffs between hardware complexity and performance overhead and proposes novel software permutation to replace the random permutation hardware in the RPcache.
Abstract: Software cache-based side channel attacks present serious threats to modern computer systems. Using caches as a side channel, these attacks are able to derive secret keys used in cryptographic operations through legitimate activities. Among existing countermeasures, software solutions are typically application specific and incur substantial performance overhead. Recent hardware proposals including the Partition-Locked cache (PLcache) and Random-Permutation cache (RPcache) [23], although very effective in reducing performance overhead while enhancing the security level, may still be vulnerable to advanced cache attacks. In this paper, we propose three hardware-software approaches to defend against software cache-based attacks - they present different tradeoffs between hardware complexity and performance overhead. First, we propose to use preloading to secure the PLcache. Second, we leverage informing loads, which is a lightweight architectural support originally proposed to improve memory performance, to protect the RPcache. Third, we propose novel software permutation to replace the random permutation hardware in the RPcache. This way, regular caches can be protected with hardware support for informing loads. In our experiments, we analyze various processor models for their vulnerability to cache attacks and demonstrate that even to the processor model that is most vulnerable to cache attacks, our proposed software-hardware integrated schemes provide strong security protection.

129 citations


Proceedings ArticleDOI
Moinuddin K. Qureshi1
06 Mar 2009
TL;DR: A simple extension of DSR is proposed that provides Quality of Service (QoS) by guaranteeing that the worst-case performance of each application remains similar to that with no spilling, while still providing an average throughput improvement of 17.5%.
Abstract: In a Chip Multi-Processor (CMP) with private caches, the last level cache is statically partitioned between all the cores. This prevents such CMPs from sharing cache capacity in response to the requirement of individual cores. Capacity sharing can be provided in private caches by spilling a line evicted from one cache to another cache. However, naively allowing all caches to spill evicted lines to other caches have limited performance benefit as such spilling does not take into account which cores benefit from extra capacity and which cores can provide extra capacity. This paper proposes Dynamic Spill-Receive (DSR) for efficient capacity sharing. In a DSR architecture, each cache uses Set Dueling to learn whether it should act as a “spiller cache” or “receiver cache” for best overall performance. We evaluate DSR for a Quad-core system with 1MB private caches using 495 multi-programmed workloads. DSR improves average throughput by 18% (weighted-speedup by 13% and harmonic-mean fairness metric by 36%) compared to no spilling. DSR requires a total storage overhead of less than two bytes per core, does not require any changes to the existing cache structure, and is scalable to a large number of cores (16 in our evaluation). Furthermore, we propose a simple extension of DSR that provides Quality of Service (QoS) by guaranteeing that the worst-case performance of each application remains similar to that with no spilling, while still providing an average throughput improvement of 17.5%.

121 citations


Book ChapterDOI
28 Mar 2009
TL;DR: Surprisingly, despite the large number of deaggregated /24 subnets, caching uni-class prefixes can effectively curb the increase of FIB sizes, and the distribution of traffic across prefixes is becoming increasingly skewed, making route caching more appealing.
Abstract: Internet routers' forwarding tables (FIBs), which must be stored in expensive fast memory for high-speed packet forwarding, are growing quickly in size due to increased multihoming, finer-grained traffic engineering, and deployment of IPv6 and VPNs. To address this problem, several Internet architectures have been proposed to reduce FIB size by returning to the earlier approach of route caching : storing only the working set of popular routes in the FIB. This paper revisits route caching. We build upon previous work by studying flat, uni-class (/24) prefix caching, with modern traffic traces from more than 60 routers in a tier-1 ISP. We first characterize routers' working sets and then evaluate route-caching performance under different cache replacement strategies and cache sizes. Surprisingly, despite the large number of deaggregated /24 subnets, caching uni-class prefixes can effectively curb the increase of FIB sizes. Moreover, uni-class prefixes substantially simplify a cache design by eliminating longest-prefix matching, enabling FIB design with slower memory technologies. Finally, by comparing our results with previous work, we show that the distribution of traffic across prefixes is becoming increasingly skewed, making route caching more appealing.

114 citations


Patent
11 May 2009
TL;DR: In this paper, a poll-based notification system is used in distributed caches for tracking changes to cache items, where the server can maintain the changes in an efficient fashion (in blocks) and return the changes to clients that perform the appropriate filtering.
Abstract: Systems and methods that supply poll based notification system in a distributed cache, for tracking changes to cache items. Local caches on the client can employ the notification system to keep the local objects in sync with the backend cache service; and can further dynamically adjust the “scope” of notifications required based on the number and distribution of keys in the local cache. The server can maintain the changes in an efficient fashion (in blocks) and returns the changes to clients that perform the appropriate filtering. Notifications can be associated with a session and/or an application.

110 citations


Proceedings ArticleDOI
08 Jun 2009
TL;DR: On applications manipulating large amount of null data blocks, such a ZC cache allows to significantly reduce the miss rate and memory traffic, and therefore to increase performance for a small hardware overhead.
Abstract: It has been observed that some applications manipulate large amounts of null data. Moreover these zero data often exhibit high spatial locality. On some applications more than 20% of the data accesses concern null data blocks. Representing a null block in a cache on a standard cache line appears as a waste of resources.In this paper, we propose the Zero-Content Augmented cache, the ZCA cache. A ZCA cache consists of a conventional cache augmented with a specialized cache for memorizing null blocks, the Zero-Content cache or ZC cache. In the ZC cache, the data block is represented by its address tag and a validity bit. Moreover, as null blocks generally exhibit high spatial locality, several null blocks can be associated with a single address tag in the ZC cache.For instance, a ZC cache mapping 32MB of zero 64-byte lines uses less than 80KB of storage. Decompression of a null block is very simple, therefore read access time on the ZCA cache is in the same range as the one of a conventional cache. On applications manipulating large amount of null data blocks, such a ZC cache allows to significantly reduce the miss rate and memory traffic, and therefore to increase performance for a small hardware overhead. In particular, the write-back traffic on null blocks is limited. For applications with a low null block rate, no performance loss is observed.

Proceedings ArticleDOI
19 Apr 2009
TL;DR: This work addresses cooperative caching in mobile ad hoc networks where information is exchanged in a peer-to-peer fashion among the network nodes and devise a fully-distributed caching strategy called Hamlet, which creates a content diversity within the nodes neighborhood.
Abstract: We address cooperative caching in mobile ad hoc networks where information is exchanged in a peer-to-peer fashion among the network nodes. Our objective is to devise a fully-distributed caching strategy whereby nodes, independently of each other, decide whether to cache or not some content, and for how long. Each node takes this decision according to its perception of what nearby users may be storing in their caches and with the aim to differentiate its own cache content from the others'. We aptly named such algorithm "Hamlet". The result is the creation of a content diversity within the nodes neighborhood, so that a requesting user likely finds the desired information nearby. We simulate our caching algorithm in an urban scenario, featuring vehicular mobility, as well as in a mall scenario with pedestrians carrying mobile devices. Comparison with other caching schemes under different forwarding strategies confirms that Hamlet succeeds in creating the desired content diversity thus leading to a resource-efficient information access.

Journal ArticleDOI
TL;DR: This paper shows how hardware performance monitors can be used to provide a fine-grained, closely-coupled feedback loop to dynamic optimizations done by a multicore-aware operating system.
Abstract: Multicore processors contain new hardware characteristics that are different from previous generation single-core systems or traditional SMP (symmetric multiprocessing) multiprocessor systems. These new characteristics provide new performance opportunities and challenges. In this paper, we show how hardware performance monitors can be used to provide a fine-grained, closely-coupled feedback loop to dynamic optimizations done by a multicore-aware operating system. These multicore optimizations are possible due to the advanced capabilities of hardware performance monitoring units currently found in commodity processors, such as execution pipeline stall breakdown and data address sampling. We demonstrate three case studies on how a multicore-aware operating system can use these online capabilities for (1) determining cache partition sizes, which helps reduce contention in the shared cache among applications, (2) detecting memory regions with bad cache usage, which helps in isolating these regions to reduce cache pollution, and (3) detecting sharing among threads, which helps in clustering threads to improve locality. Using realistic applications from standard benchmark suites, the following performance improvements were achieved: (1) up to 27% improvement in IPC (instructions-per-cycle) due to cache partition sizing; (2) up to 10% reduction in cache miss rates due to reduced cache pollution, resulting in up to 7% improvement in IPC; and (3) up to 70% reduction in remote cache accesses due to thread clustering, resulting in up to 7% application-level improvement.

Journal ArticleDOI
TL;DR: To address the latency scaling problem and the increased demand of the larger 80-way SMP size, the z10 processor cache subsystem introduces new innovative concepts and solutions.
Abstract: With the introduction of the high-frequency IBM System z10™ processor design, a new, robust cache hierarchy was needed to enable up to 80 of these processors aggregated into a tightly coupled symmetric multiprocessor (SMP) system to reach their performance potential. Typically, each time the processor frequency increases by a significant factor, as did the z10™ processor over the predecessor IBM System z9® processor, the access time of data, as measured by the number of processor cycles beyond the level 1 cache on an identical processor cache subsystem, would increase proportionally as well because the flight time on the chip interconnects across multiple hardware packaging levels has stayed relatively constant in nanoseconds. To address the latency scaling problem and the increased demand of the larger 80-way SMP size, the z10 processor cache subsystem introduces new innovative concepts and solutions.

Proceedings ArticleDOI
06 Mar 2009
TL;DR: A 3D chip design that stacks SRAM and DRAM upon processing cores and employs OS-based page coloring to minimize horizontal communication of cache data is postulated and it is shown that a tree topology is an ideal fit that significantly reduces the power and latency requirements of the on-chip network.
Abstract: Cache hierarchies in future many-core processors are expected to grow in size and contribute a large fraction of overall processor power and performance. In this paper, we postulate a 3D chip design that stacks SRAM and DRAM upon processing cores and employs OS-based page coloring to minimize horizontal communication of cache data. We then propose a heterogeneous reconfigurable cache design that takes advantage of the high density of DRAM and the superior power/delay characteristics of SRAM to efficiently meet the working set demands of each individual core. Finally, we analyze the communication patterns for such a processor and show that a tree topology is an ideal fit that significantly reduces the power and latency requirements of the on-chip network. The above proposals are synergistic: each proposal is made more compelling because of its combination with the other innovations described in this paper. The proposed reconfigurable cache model improves performance by up to 19% along with 48% savings in network power.

Journal ArticleDOI
Matteo Frigo1, Volker Strumpen1
TL;DR: It is shown that a multithreaded cache oblivious matrix multiplication incurs cache misses when executed by the Cilk work-stealing scheduler on a machine with P processors, each with a cache of size Z, with high probability.
Abstract: We present a technique for analyzing the number of cache misses incurred by multithreaded cache oblivious algorithms on an idealized parallel machine in which each processor has a private cache. We specialize this technique to computations executed by the Cilk work-stealing scheduler on a machine with dag-consistent shared memory. We show that a multithreaded cache oblivious matrix multiplication incurs cache misses when executed by the Cilk scheduler on a machine with P processors, each with a cache of size Z, with high probability. This bound is tighter than previously published bounds. We also present a new multithreaded cache oblivious algorithm for 1D stencil computations incurring cache misses with high probability, one for Gaussian elimination and back substitution, and one for the length computation part of the longest common subsequence problem incurring cache misses with high probability.

Proceedings ArticleDOI
12 Dec 2009
TL;DR: This paper presents a technique that aims to balance the pressure on the cache sets by detecting when it may be beneficial to associate sets, displacing lines from stressed sets to underutilized ones.
Abstract: Efficient memory hierarchy design is critical due to the increasing gap between the speed of the processors and the memory. One of the sources of inefficiency in current caches is the non-uniform distribution of the memory accesses on the cache sets. Its consequence is that while some cache sets may have working sets that are far from fitting in them, other sets may be underutilized because their working set has fewer lines than the set. In this paper we present a technique that aims to balance the pressure on the cache sets by detecting when it may be beneficial to associate sets, displacing lines from stressed sets to underutilized ones. This new technique, called Set Balancing Cache or SBC, achieved an average reduction of 13% in the miss rate of ten benchmarks from the SPEC CPU2006 suite, resulting in an average IPC improvement of 5%.

Patent
27 Mar 2009
TL;DR: In this paper, a shared cache controller enables or disables access separately to each of the cache ways based upon the corresponding source of a received memory request, and the control of the accessibility of the shared cache ways via altering stored values in the configuration and status registers (CSRs) may be used to create a pseudo-RAM structure within the shared caches and to progressively reduce the size of the share cache during a power-down sequence while the cache continues operation.
Abstract: A system and method for data allocation in a shared cache memory of a computing system are contemplated. Each cache way of a shared set-associative cache is accessible to multiple sources, such as one or more processor cores, a graphics processing unit (GPU), an input/output (I/O) device, or multiple different software threads. A shared cache controller enables or disables access separately to each of the cache ways based upon the corresponding source of a received memory request. One or more configuration and status registers (CSRs) store encoded values used to alter accessibility to each of the shared cache ways. The control of the accessibility of the shared cache ways via altering stored values in the CSRs may be used to create a pseudo-RAM structure within the shared cache and to progressively reduce the size of the shared cache during a power-down sequence while the shared cache continues operation.

Proceedings ArticleDOI
24 Aug 2009
TL;DR: A novel approach to analyzing the worst-case cache interferences and bounding the WCET for threads running on multi-core processors with shared direct-mapped L2 instruction caches is proposed, using an Extended ILP (Integer Linear Programming) to model all the possible inter-thread cache conflicts.
Abstract: In a multi-core processor, different cores typically share the last-level cache, and threads running on different cores may interfere with each other in accessing the shared cache. Therefore, multi-core WCET (Worst-Case Execution Time) analyzer must be able to safely and accurately estimate the worst-case inter-thread cache interferences, which is not supported by current WCET analysis techniques that mainly focus on analyzing uniprocessors. This paper proposes a novel approach to analyzing the worst-case cache interferences and bounding the WCET for threads running on multi-core processors with shared direct-mapped L2 instruction caches. We propose to use an Extended ILP (Integer Linear Programming) to model all the possible inter-thread cache conflicts, based on which we can accurately calculate the worst-case inter-thread cache interferences and derive the WCET. Compared to a recently proposed multi-core static analysis technique based on control flow information alone, this approach improves the tightness of WCET estimation by 13.7% on average.

Journal ArticleDOI
TL;DR: This work proposes a new framework for conducting comprehensive studies and characterization on the reliability behavior of cache memories, based on the development of new lifetime models for data and tag arrays residing in both the data and instruction caches that facilitate the characterization of cache vulnerability of stored items at various lifetime phases.
Abstract: Soft errors induced by energetic particle strikes in on-chip cache memories have become an increasing challenge in designing new generation reliable microprocessors. Previous efforts have exploited information redundancy via parity/ECC codings or cacheline duplication for information integrity in on-chip cache memories. Due to various performance, area/size, and energy constraints in various target systems, many existing unoptimized protection schemes may eventually prove significantly inadequate and ineffective. In this paper, we propose a new framework for conducting comprehensive studies and characterization on the reliability behavior of cache memories, in order to provide insight into cache vulnerability to soft errors as well as design guidance to architects for highly efficient reliable on-chip cache memory design. Our work is based on the development of new lifetime models for data and tag arrays residing in both the data and instruction caches. Those models facilitate the characterization of cache vulnerability of stored items at various lifetime phases. We then exemplify this design methodology by proposing reliability schemes targeting at specific vulnerable phases. Benchmarking is carried out to showcase the effectiveness of our approach.

Journal ArticleDOI
01 Aug 2009
TL;DR: This paper proposes a hybrid system method called MCC-DB for accelerating executions of warehouse-style queries, which relies on the DBMS knowledge of data access patterns to minimize LLC conflicts in multi-core systems through an enhanced OS facility of cache partitioning.
Abstract: In a typical commercial multi-core processor, the last level cache (LLC) is shared by two or more cores. Existing studies have shown that the shared LLC is beneficial to concurrent query processes with commonly shared data sets. However, the shared LLC can also be a performance bottleneck to concurrent queries, each of which has private data structures, such as a hash table for the widely used hash join operator, causing serious cache conflicts. We show that cache conflicts on multi-core processors can significantly degrade overall database performance. In this paper, we propose a hybrid system method called MCC-DB for accelerating executions of warehouse-style queries, which relies on the DBMS knowledge of data access patterns to minimize LLC conflicts in multi-core systems through an enhanced OS facility of cache partitioning. MCC-DB consists of three components: (1) a cacheaware query optimizer carefully selects query plans in order to balance the numbers of cache-sensitive and cache-insensitive plans; (2) a query execution scheduler makes decisions to co-run queries with an objective of minimizing LLC conflicts; and (3) an enhanced OS kernel facility partitions the shared LLC according to each query's cache capacity need and locality strength. We have implemented MCC-DB by patching the three components in PostgreSQL and Linux kernel. Our intensive measurements on an Intel multi-core system with warehouse-style queries show that MCC-DB can reduce query execution times by up to 33%.

Patent
Daniel J. Colglazier1
17 Jun 2009
TL;DR: In this paper, an improved sectored cache replacement algorithm is implemented via a method and computer program product, which selects a cache sector among a plurality of cache sectors for replacement in a computer system.
Abstract: An improved sectored cache replacement algorithm is implemented via a method and computer program product The method and computer program product select a cache sector among a plurality of cache sectors for replacement in a computer system The method may comprise selecting a cache sector to be replaced that is not the most recently used and that has the least amount of modified data In the case in which there is a tie among cache sectors, the sector to be replaced may be the sector among such cache sectors with the least amount of valid data In the case in which there is still a tie among cache sectors, the sector to be replaced may be randomly selected among such cache sectors Unlike conventional sectored cache replacement algorithms, the improved algorithm implemented by the method and computer program product accounts for both hit rate and bus utilization

Proceedings ArticleDOI
12 Sep 2009
TL;DR: Experimental results show that in comparison with a standard L2 cache managed by LRU, Soft-OLP significantly reduces the execution time by reducing L1 cache misses across inputs for a set of single- and multi-threaded programs from the SPEC CPU2000 benchmark suite, NAS benchmarks and a computational kernel set.
Abstract: Performance degradation of memory-intensive programs caused by the LRU policy's inability to handle weak-locality data accesses in the last level cache is increasingly serious for two reasons. First, the last-level cache remains in the CPU's critical path, where only simple management mechanisms, such as LRU, can be used, precluding some sophisticated hardware mechanisms to address the problem. Second, the commonly used shared cache structure of multi-core processors has made this critical path even more performance-sensitive due to intensive inter-thread contention for shared cache resources. Researchers have recently made efforts to address the problem with the LRU policy by partitioning the cache using hardware or OS facilities guided by run-time locality information. Such approaches often rely on special hardware support or lack enough accuracy. In contrast, for a large class of programs, the locality information can be accurately predicted if access patterns are recognized through small training runs at the data object level. To achieve this goal, we present a system-software framework referred to as Soft-OLP (Software-based Object-Level cache Partitioning). We first collect per-object reuse distance histograms and inter-object interference histograms via memory-trace sampling. With several low-cost training runs, we are able to determine the locality patterns of data objects. For the actual runs, we categorize data objects into different locality types and partition the cache space among data objects with a heuristic algorithm, in order to reduce cache misses through segregation of contending objects. The object-level cache partitioning framework has been implemented with a modified Linux kernel, and tested on a commodity multi-core processor. Experimental results show that in comparison with a standard L2 cache managed by LRU, Soft-OLP significantly reduces the execution time by reducing L2 cache misses across inputs for a set of single- and multi-threaded programs from the SPEC CPU2000 benchmark suite, NAS benchmarks and a computational kernel set.

Patent
30 Apr 2009
TL;DR: In this paper, a cache-policy switching module emulates the caching of the storage data with a plurality of cache configurations, and upon a determination that one of the plurality of caches performs better than the first cache configuration, automatically applies the better cache configuration to the cache memory for caching the data.
Abstract: A method implements a cache-policy switching module in a storage system. The storage system includes a cache memory to cache storage data. The cache memory uses a first cache configuration. The cache-policy switching module emulates the caching of the storage data with a plurality of cache configurations. Upon a determination that one of the plurality of cache configurations performs better than the first cache configuration, the cache-policy switching module automatically applies the better performing cache configuration to the cache memory for caching the storage data.

Proceedings ArticleDOI
18 Aug 2009
TL;DR: The evaluation shows that the cache partitioning method can work efficiently and improve the performance of co-scheduled applications running within different VMs and the VMM-based implementation is fully transparent to the guest OSes.
Abstract: Virtualization is often used in systems for the purpose of offering isolation among applications running in separate virtual machines (VM). Current virtual machine monitors (VMMs) have done a decent job in resource isolation in memory, CPU and I/O devices. However, when looking further into the usage of lower-level shared cache, we notice that one virtual machine’s cache behavior may interfere with another’s due to the uncontrolled cache sharing. In this situation, performance isolation cannot be guaranteed. This paper presents a cache partitioning approach which can be implemented in the VMM. We have implemented this mechanism in Xen VMM using the page coloring technique traditionally applied to the OS. Our VMM-based implementation is fully transparent to the guest OSes. It thus shows the advantages of simplicity and flexibility. Our evaluation shows that our cache partitioning method can work efficiently and improve the performance of co-scheduled applications running within different VMs. In the concurrent workloads selected from the SPEC CPU 2006 benchmarks, our technique achieves a performance improvement by up to 19% for the most sensitive workloads

Proceedings ArticleDOI
14 Nov 2009
TL;DR: This work proposes to provide an affordable and lightweight hardware support to coordinate with OS-based cache management policies that are scalable to many-cores, and perform comparably with other proposed hardware solutions, but have much lower overheads, therefore can be easily adopted in commodity processors.
Abstract: The management of shared caches in multicore processors is a critical and challenging task. Many hardware and OS-based methods have been proposed. However, they may be hardly adopted in practice due to their non-trivial overheads, high complexities, and/or limited abilities to handle increasingly complicated scenarios of cache contention caused by many-cores.In order to turn cache partitioning methods into reality in the management of multicore processors, we propose to provide an affordable and lightweight hardware support to coordinate with OS-based cache management policies. The proposed methods are scalable to many-cores, and perform comparably with other proposed hardware solutions, but have much lower overheads, therefore can be easily adopted in commodity processors. Having conducted extensive experiments with 37 multi-programming workloads, we show the effectiveness and scalability of the proposed methods. For example on 8-core systems, one of our proposed policies improves performance over LRU-based hardware cache management by 14.5% on average.

Proceedings ArticleDOI
20 Jun 2009
TL;DR: This paper proposes the Mergeable cache architecture that detects data similarities and merges cache blocks, resulting in substantial savings in cache storage requirements, which leads to reductions in off-chip memory accesses and overall power usage, and increases in application performance.
Abstract: While microprocessor designers turn to multicore architectures to sustain performance expectations, the dramatic increase in parallelism of such architectures will put substantial demands on off-chip bandwidth and make the memory wall more significant than ever. This paper demonstrates that one profitable application of multicore processors is the execution of many similar instantiations of the same program. We identify that this model of execution is used in several practical scenarios and term it as "multi-execution." Often, each such instance utilizes very similar data. In conventional cache hierarchies, each instance would cache its own data independently. We propose the Mergeable cache architecture that detects data similarities and merges cache blocks, resulting in substantial savings in cache storage requirements. This leads to reductions in off-chip memory accesses and overall power usage, and increases in application performance. We present cycle-accurate simulation results of 8 benchmarks (6 from SPEC2000) to demonstrate that our technique provides a scalable solution and leads to significant speedups due to reductions in main memory accesses. For 8 cores running 8 similar executions of the same application and sharing an exclusive 4-MB, 8-way L2 cache, the Mergeable cache shows a speedup in execution by 2.5x on average (ranging from 0.93x to 6.92x), while posing an overhead of only 4.28% on cache area and 5.21% on power when it is used.

Patent
13 May 2009
TL;DR: In this article, a data storage system with multiple cache is discussed. But the authors focus on the management of cache activity in data storage systems having multiple cache and not on the storage system itself.
Abstract: The disclosure is related to data storage systems having multiple cache and to management of cache activity in data storage systems having multiple cache. In a particular embodiment, a data storage device includes a volatile memory having a first read cache and a first write cache, a non-volatile memory having a second read cache and a second write cache and a controller coupled to the volatile memory and the non-volatile memory. The memory can be configured to selectively transfer read data from the first read cache to the second read cache based on a least recently used indicator of the read data and selectively transfer write data from the first write cache to the second write cache based on a least recently written indicator of the write data.

Book ChapterDOI
26 May 2009
TL;DR: Experimental results have revealed that the proposed approach has better performance compared to the most common caching policies and has improved the performance of client-side caching substantially.
Abstract: Web caching is a well-known strategy for improving performance of Web-based system by keeping web objects that are likely to be used in the near future close to the client. Most of the current Web browsers still employ traditional caching policies that are not efficient in web caching. This research proposes a splitting client-side web cache to two caches, short-term cache and long-term cache. Primarily, a web object is stored in short-term cache, and the web objects that are visited more than the pre-specified threshold value will be moved to long-term cache, while other objects are removed by Least Recently Used(LRU) algorithm as short-term cache is full. More significantly, when the long-term cache saturates, the trained neuro-fuzzy system is employed in classifying each object stored in long-term cache into cacheable or uncacheable object. The old uncacheable objects are candidate for removing from the long-term cache. By implementing this mechanism, the cache pollution can be mitigated and the cache space can be utilized effectively. Experimental results have revealed that the proposed approach has better performance compared to the most common caching policies and has improved the performance of client-side caching substantially.

Proceedings ArticleDOI
17 Mar 2009
TL;DR: This work proposes to adapt the cache organization to simplify the analysis of worst-case execution time (WCET) analysis, where unknown abstract cache states during the analysis result in conservative WCET bounds.
Abstract: Caches are a mandatory feature of current processors to deliver instructions and data to a fast processor pipeline. However, standard cache organizations are designed to increase the average case performance. They are hard to model for worst-case execution time (WCET) analysis. Unknown abstract cache states during the analysis result in conservative WCET bounds. Therefore, we propose to adapt the cache organization to simplify the analysis. The data cache is split into several independent caches for the stack, static data, constants, and heap allocated data.