scispace - formally typeset
Search or ask a question

Showing papers on "Cache pollution published in 2009"


Proceedings ArticleDOI
20 Jun 2009
TL;DR: Reactive NUCA (R-NUCA), a distributed cache design which reacts to the class of each cache access and places blocks at the appropriate location in the cache, is proposed.
Abstract: Increases in on-chip communication delay and the large working sets of server and scientific workloads complicate the design of the on-chip last-level cache for multicore processors. The large working sets favor a shared cache design that maximizes the aggregate cache capacity and minimizes off-chip memory requests. At the same time, the growing on-chip communication delay favors core-private caches that replicate data to minimize delays on global wires. Recent hybrid proposals offer lower average latency than conventional designs, but they address the placement requirements of only a subset of the data accessed by the application, require complex lookup and coherence mechanisms that increase latency, or fail to scale to high core counts.In this work, we observe that the cache access patterns of a range of server and scientific workloads can be classified into distinct classes, where each class is amenable to different block placement policies. Based on this observation, we propose Reactive NUCA (R-NUCA), a distributed cache design which reacts to the class of each cache access and places blocks at the appropriate location in the cache. R-NUCA cooperates with the operating system to support intelligent placement, migration, and replication without the overhead of an explicit coherence mechanism for the on-chip last-level cache. In a range of server, scientific, and multiprogrammed workloads, R-NUCA matches the performance of the best cache design for each workload, improving performance by 14% on average over competing designs and by 32% at best, while achieving performance within 5% of an ideal cache design.

436 citations


Proceedings ArticleDOI
20 Jun 2009
TL;DR: This paper discusses and evaluates two types of hybrid cache architectures: inter cache Level HCA (LHCA), in which the levels in a cache hierarchy can be made of disparate memory technologies; and intra cache level or cache Region based H CA (RHCA), where a single level of cache can be partitioned into multiple regions, each of a different memory technology.
Abstract: Caching techniques have been an efficient mechanism for mitigating the effects of the processor-memory speed gap. Traditional multi-level SRAM-based cache hierarchies, especially in the context of chip multiprocessors (CMPs), present many challenges in area requirements, core-to-cache balance, power consumption, and design complexity. New advancements in technology enable caches to be built from other technologies, such as Embedded DRAM (EDRAM), Magnetic RAM (MRAM), and Phase-change RAM (PRAM), in both 2D chips or 3D stacked chips. Caches fabricated in these technologies offer dramatically different power and performance characteristics when compared with SRAM-based caches, particularly in the areas of access latency, cell density, and overall power consumption. In this paper, we propose to take advantage of the best characteristics that each technology offers, through the use of Hybrid Cache Architecture (HCA) designs. We discuss and evaluate two types of hybrid cache architectures: inter cache Level HCA (LHCA), in which the levels in a cache hierarchy can be made of disparate memory technologies; and intra cache level or cache Region based HCA (RHCA), where a single level of cache can be partitioned into multiple regions, each of a different memory technology. We have studied a number of different HCA architectures and explored the potential of hardware support for intra-cache data movement and power consumption management within HCA caches. Utilizing a full-system simulator that has been validated against real hardware, we demonstrate that an LHCA design can provide a geometric mean 7% IPC improvement over a baseline 3-level SRAM cache design under the same area constraint across a collection of 25 workloads. A more aggressive RHCA-based design provides 12% IPC improvement over the baseline. Finally, a 2-layer 3D cache stack (3DHCA) of high density memory technology within the same chip footprint gives 18% IPC improvement over the baseline. Furthermore, up to 70% reduction in power consumption over a baseline SRAM-only design is achieved.

375 citations


Proceedings ArticleDOI
20 Jun 2009
TL;DR: This work proposes a new cache management approach that combines dynamic insertion and promotion policies to provide the benefits of cache partitioning, adaptive insertion, and capacity stealing all with a single mechanism.
Abstract: Many multi-core processors employ a large last-level cache (LLC) shared among the multiple cores. Past research has demonstrated that sharing-oblivious cache management policies (e.g., LRU) can lead to poor performance and fairness when the multiple cores compete for the limited LLC capacity. Different memory access patterns can cause cache contention in different ways, and various techniques have been proposed to target some of these behaviors. In this work, we propose a new cache management approach that combines dynamic insertion and promotion policies to provide the benefits of cache partitioning, adaptive insertion, and capacity stealing all with a single mechanism. By handling multiple types of memory behaviors, our proposed technique outperforms techniques that target only either capacity partitioning or adaptive insertion.

334 citations


Proceedings ArticleDOI
01 Apr 2009
TL;DR: This paper proposes a hot-page coloring approach enforcing coloring on only a small set of frequently accessed (or hot) pages for each process, and demonstrates that hot page identification and selective coloring can significantly alleviate the coloring-induced adverse effects in practice.
Abstract: Modern multi-core processors present new resource management challenges due to the subtle interactions of simultaneously executing processes sharing on-chip resources (particularly the L2 cache). Recent research demonstrates that the operating system may use the page coloring mechanism to control cache partitioning, and consequently to achieve fair and efficient cache utilization. However, page coloring places additional constraints on memory space allocation, which may conflict with application memory needs. Further, adaptive adjustments of cache partitioning policies in a multi-programmed execution environment may incur substantial overhead for page recoloring (or copying). This paper proposes a hot-page coloring approach enforcing coloring on only a small set of frequently accessed (or hot) pages for each process. The cost of identifying hot pages online is reduced by leveraging the knowledge of spatial locality during a page table scan of access bits. Our results demonstrate that hot page identification and selective coloring can significantly alleviate the coloring-induced adverse effects in practice. However, we also reach the somewhat negative conclusion that without additional hardware support, adaptive page coloring is only beneficial when recoloring is performed infrequently (meaning long scheduling time quanta in multi-programmed executions).

311 citations


Proceedings ArticleDOI
12 Sep 2009
TL;DR: This paper presents fundamental details of the newly introduced Intel Nehalem microarchitecture with its integrated memory controller, Quick Path Interconnect, and ccNUMA architecture, based on sophisticated benchmarks to measure the latency and bandwidth between different locations in the memory subsystem.
Abstract: Today's microprocessors have complex memory subsystems with several cache levels. The efficient use of this memory hierarchy is crucial to gain optimal performance, especially on multicore processors. Unfortunately, many implementation details of these processors are not publicly available. In this paper we present such fundamental details of the newly introduced Intel Nehalem microarchitecture with its integrated memory controller, Quick Path Interconnect, and ccNUMA architecture. Our analysis is based on sophisticated benchmarks to measure the latency and bandwidth between different locations in the memory subsystem. Special care is taken to control the coherency state of the data to gain insight into performance relevant implementation details of the cache coherency protocol. Based on these benchmarks we present undocumented performance data and architectural properties.

243 citations


Patent
05 Jan 2009
TL;DR: In this paper, the cache memory is configured to store at less capacity per memory cell and finer granularity of write units compared to the main memory, and the logical addresses of data are partitioned into zones to limit the size of the indices for the cache.
Abstract: A portion of a nonvolatile memory is partitioned from a main multi-level memory array to operate as a cache. The cache memory is configured to store at less capacity per memory cell and finer granularity of write units compared to the main memory. In a block-oriented memory architecture, the cache has multiple functions, not merely to improve access speed, but is an integral part of a sequential update block system. The cache memory has a capacity dynamically increased by allocation of blocks from the main memory in response to a demand to increase the capacity. Preferably, a block with an endurance count higher than average is allocated. The logical addresses of data are partitioned into zones to limit the size of the indices for the cache.

164 citations


Proceedings ArticleDOI
20 Jun 2009
TL;DR: A novel technique, Memory Mapped ECC, is presented, which reduces the cost of providing error correction for SRAM caches and only dedicates SRAM for error detection while the ECC bits are stored within the memory hierarchy as data.
Abstract: This paper presents a novel technique, Memory Mapped ECC, which reduces the cost of providing error correction for SRAM caches. It is important to limit such overheads as processor resources become constrained and error propensity increases. The continuing decrease in SRAM cell size and the growing capacity of caches increases the likelihood of errors in SRAM arrays. To address this, redundant information can be used to correct a value after an error occurs. Information redundancy is typically provided through error-correcting codes (ECC), which append bits to every SRAM row and increase the array's area and energy consumption. We make three observations regarding error protection and utilize them in our architecture: (1) much of the data in a cache is replicated throughout the hierarchy and is inherently redundant; (2) error-detection is necessary for every cache access and is cheaper than error correction, which is very infrequent; (3) redundant information for correction need not be stored in high-cost SRAM. Our unique architecture only dedicates SRAM for error detection while the ECC bits are stored within the memory hierarchy as data. We associate a physical memory address with each cache line for ECC storage and rely on locality to minimize the impact. The cache is dynamically and transparently partitioned between data and ECC with the fraction of ECC growing with the number of dirty cache lines. We show that this has little impact on both performance (1.3% average and

138 citations


Proceedings ArticleDOI
12 Oct 2009
TL;DR: A scheduling strategy for real-time tasks with both timing and cache space constraints is presented, which allows each task to use a fixed number of cache partitions, and makes sure that at any time a cache partition is occupied by at most one running task.
Abstract: The major obstacle to use multicores for real-time applications is that we may not predict and provide any guarantee on real-time properties of embedded software on such platforms; the way of handling the on-chip shared resources such as L2 cache may have a significant impact on the timing predictability. In this paper, we propose to use cache space isolation techniques to avoid cache contention for hard real-time tasks running on multicores with shared caches. We present a scheduling strategy for real-time tasks with both timing and cache space constraints, which allows each task to use a fixed number of cache partitions, and makes sure that at any time a cache partition is occupied by at most one running task. In this way, the cache spaces of tasks are isolated at run-time.As technical contributions, we have developed a sufficient schedulability test for non-preemptive fixed-priority scheduling for multicores with shared L2 cache, encoded as a linear programming problem. To improve the scalability of the test, we then present our second schedulability test of quadratic complexity, which is an over approximation of the first test. To evaluate the performance and scalability of our techniques, we use randomly generated task sets. Our experiments show that the first test which employs an LP solver can easily handle task sets with thousands of tasks in minutes using a desktop computer. It is also shown that the second test is comparable with the first one in terms of precision, but scales much better due to its low complexity, and is therefore a good candidate for efficient schedulability tests in the design loop for embedded systems or as an on-line test for admission control.

131 citations


Proceedings ArticleDOI
12 Dec 2009
TL;DR: A set of sophisticated benchmarks for latency and bandwidth measurements to arbitrary locations in the memory subsystem are presented and the coherency state of cache lines are considered to analyze the cache co herency protocols and their performance impact.
Abstract: Across a broad range of applications, multicore technology is the most important factor that drives today's microprocessor performance improvements. Closely coupled is a growing complexity of the memory subsystems with several cache levels that need to be exploited efficiently to gain optimal application performance. Many important implementation details of these memory subsystems are undocumented. We therefore present a set of sophisticated benchmarks for latency and bandwidth measurements to arbitrary locations in the memory subsystem. We consider the coherency state of cache lines to analyze the cache coherency protocols and their performance impact. The potential of our approach is demonstrated with an in-depth comparison of ccNUMA multiprocessor systems with AMD (Shanghai) and Intel (Nehalem-EP) quad-core x86-64 processors that both feature integrated memory controllers and coherent point-to-point interconnects. Using our benchmarks we present fundamental memory performance data and architectural properties of both processors. Our comparison reveals in detail how the microarchitectural differences tremendously affect the performance of the memory subsystem.

130 citations


Proceedings ArticleDOI
01 Dec 2009
TL;DR: This paper develops a timing analysis method for concurrent software running on multi-cores with a shared instruction cache that progressively improves the lifetime estimates of tasks that execute concurrently on multiple cores, in order to estimate potential conflicts in the shared cache.
Abstract: Memory accesses form an important source of timing unpredictability. Timing analysis of real-time embedded software thus requires bounding the time for memory accesses. Multiprocessing, a popular approach for performance enhancement, opens up the opportunity for concurrent execution. However due to contention for any shared memory by different processing cores, memory access behavior becomes more unpredictable, and hence harder to analyze. In this paper, we develop a timing analysis method for concurrent software running on multi-cores with a shared instruction cache. Communication across tasks is by message passing where the message mailboxes are accessed via interrupt service routines. We do not handle data cache, shared memory synchronization and code sharing across tasks. Our method progressively improves the lifetime estimates of tasks that execute concurrently on multiple cores, in order to estimate potential conflicts in the shared cache. Possible conflicts arising from overlapping task lifetimes are accounted for in the hit-miss classification of accesses to the shared cache, to provide safe execution time bounds. We show that our method produces lower worst-case response time (WCRT) estimates than existing shared-cache analysis on a real-world embedded application.

130 citations


Patent
05 Jan 2009
TL;DR: A portion of nonvolatile memory is partitioned from a main multi-level memory array to operate as a cache as mentioned in this paper, where the cache memory is configured to store at less capacity per memory cell and finer granularity of write units compared to the main memory.
Abstract: A portion of a nonvolatile memory is partitioned from a main multi-level memory array to operate as a cache. The cache memory is configured to store at less capacity per memory cell and finer granularity of write units compared to the main memory. In a block-oriented memory architecture, the cache has multiple functions, not merely to improve access speed, but is an integral part of a sequential update block system. Decisions to write data to the cache memory or directly to the main memory depend on the attributes and characteristics of the data to be written, the state of the blocks in the main memory portion and the state of the blocks in the cache portion.

Proceedings ArticleDOI
06 Mar 2009
TL;DR: This paper proposes three hardware-software approaches to defend against software cache-based attacks - they present different tradeoffs between hardware complexity and performance overhead and proposes novel software permutation to replace the random permutation hardware in the RPcache.
Abstract: Software cache-based side channel attacks present serious threats to modern computer systems. Using caches as a side channel, these attacks are able to derive secret keys used in cryptographic operations through legitimate activities. Among existing countermeasures, software solutions are typically application specific and incur substantial performance overhead. Recent hardware proposals including the Partition-Locked cache (PLcache) and Random-Permutation cache (RPcache) [23], although very effective in reducing performance overhead while enhancing the security level, may still be vulnerable to advanced cache attacks. In this paper, we propose three hardware-software approaches to defend against software cache-based attacks - they present different tradeoffs between hardware complexity and performance overhead. First, we propose to use preloading to secure the PLcache. Second, we leverage informing loads, which is a lightweight architectural support originally proposed to improve memory performance, to protect the RPcache. Third, we propose novel software permutation to replace the random permutation hardware in the RPcache. This way, regular caches can be protected with hardware support for informing loads. In our experiments, we analyze various processor models for their vulnerability to cache attacks and demonstrate that even to the processor model that is most vulnerable to cache attacks, our proposed software-hardware integrated schemes provide strong security protection.

Proceedings ArticleDOI
Moinuddin K. Qureshi1
06 Mar 2009
TL;DR: A simple extension of DSR is proposed that provides Quality of Service (QoS) by guaranteeing that the worst-case performance of each application remains similar to that with no spilling, while still providing an average throughput improvement of 17.5%.
Abstract: In a Chip Multi-Processor (CMP) with private caches, the last level cache is statically partitioned between all the cores. This prevents such CMPs from sharing cache capacity in response to the requirement of individual cores. Capacity sharing can be provided in private caches by spilling a line evicted from one cache to another cache. However, naively allowing all caches to spill evicted lines to other caches have limited performance benefit as such spilling does not take into account which cores benefit from extra capacity and which cores can provide extra capacity. This paper proposes Dynamic Spill-Receive (DSR) for efficient capacity sharing. In a DSR architecture, each cache uses Set Dueling to learn whether it should act as a “spiller cache” or “receiver cache” for best overall performance. We evaluate DSR for a Quad-core system with 1MB private caches using 495 multi-programmed workloads. DSR improves average throughput by 18% (weighted-speedup by 13% and harmonic-mean fairness metric by 36%) compared to no spilling. DSR requires a total storage overhead of less than two bytes per core, does not require any changes to the existing cache structure, and is scalable to a large number of cores (16 in our evaluation). Furthermore, we propose a simple extension of DSR that provides Quality of Service (QoS) by guaranteeing that the worst-case performance of each application remains similar to that with no spilling, while still providing an average throughput improvement of 17.5%.

Patent
11 May 2009
TL;DR: In this paper, a poll-based notification system is used in distributed caches for tracking changes to cache items, where the server can maintain the changes in an efficient fashion (in blocks) and return the changes to clients that perform the appropriate filtering.
Abstract: Systems and methods that supply poll based notification system in a distributed cache, for tracking changes to cache items. Local caches on the client can employ the notification system to keep the local objects in sync with the backend cache service; and can further dynamically adjust the “scope” of notifications required based on the number and distribution of keys in the local cache. The server can maintain the changes in an efficient fashion (in blocks) and returns the changes to clients that perform the appropriate filtering. Notifications can be associated with a session and/or an application.

Proceedings ArticleDOI
08 Jun 2009
TL;DR: On applications manipulating large amount of null data blocks, such a ZC cache allows to significantly reduce the miss rate and memory traffic, and therefore to increase performance for a small hardware overhead.
Abstract: It has been observed that some applications manipulate large amounts of null data. Moreover these zero data often exhibit high spatial locality. On some applications more than 20% of the data accesses concern null data blocks. Representing a null block in a cache on a standard cache line appears as a waste of resources.In this paper, we propose the Zero-Content Augmented cache, the ZCA cache. A ZCA cache consists of a conventional cache augmented with a specialized cache for memorizing null blocks, the Zero-Content cache or ZC cache. In the ZC cache, the data block is represented by its address tag and a validity bit. Moreover, as null blocks generally exhibit high spatial locality, several null blocks can be associated with a single address tag in the ZC cache.For instance, a ZC cache mapping 32MB of zero 64-byte lines uses less than 80KB of storage. Decompression of a null block is very simple, therefore read access time on the ZCA cache is in the same range as the one of a conventional cache. On applications manipulating large amount of null data blocks, such a ZC cache allows to significantly reduce the miss rate and memory traffic, and therefore to increase performance for a small hardware overhead. In particular, the write-back traffic on null blocks is limited. For applications with a low null block rate, no performance loss is observed.

Proceedings ArticleDOI
06 Mar 2009
TL;DR: A fully hardwired coarse-grain data migration mechanism that dynamically monitors the access patterns of the cores at the granularity of a page to reduce the book-keeping overhead and decides when and where to migrate an entire page of data to amortize the performance overhead is proposed.
Abstract: As the last-level on-chip caches in chip-multiprocessors increase in size, the physical locality of on-chip data becomes important for delivering high performance. The non-uniform access latency seen by a core to different independent banks of a large cache spread over the chip necessitates active mechanisms for improving data locality. The central proposal of this paper is a fully hardwired coarse-grain data migration mechanism that dynamically monitors the access patterns of the cores at the granularity of a page to reduce the book-keeping overhead and decides when and where to migrate an entire page of data to amortize the performance overhead. The page-grain migration mechanism is compared against two variants of previously proposed cache block-grain dynamic migration mechanisms and two OS-assisted static locality management mechanisms. Our detailed execution-driven simulation of an eight-core chip-multiprocessor with a shared 16 MB L2 cache employing a bidirectional ring to connect the cores and the L2 cache banks shows that hardwired dynamic page migration, while using only 4.8% of extra storage out of the total L2 cache and book-keeping budget, delivers the best performance and energy-efficiency across a set of shared memory parallel applications selected from the SPLASH-2, SPEC OMP, DARPA DIS, and FFTW suites and multiprogrammed workloads prepared out of the SPEC 2000 and BioBench suites. It reduces execution time by 18.7% and 12.6% on average (geometric mean) respectively for the shared memory applications and the multiprogrammed workloads compared to a baseline architecture that distributes the pages round-robin across the L2 cache banks.

Patent
05 Jan 2009
TL;DR: A portion of nonvolatile memory is partitioned from a main multi-level memory array to operate as a cache as mentioned in this paper, where the cache memory is configured to store at less capacity per memory cell and finer granularity of write units compared to the main memory.
Abstract: A portion of a nonvolatile memory is partitioned from a main multi-level memory array to operate as a cache. The cache memory is configured to store at less capacity per memory cell and finer granularity of write units compared to the main memory. In a block-oriented memory architecture, the cache has multiple functions, not merely to improve access speed, but is an integral part of a sequential update block system. Decisions to archive data from the cache memory to the main memory depend on the attributes of the data to be archived, the state of the blocks in the main memory portion and the state of the blocks in the cache portion.

Journal ArticleDOI
TL;DR: This paper shows how hardware performance monitors can be used to provide a fine-grained, closely-coupled feedback loop to dynamic optimizations done by a multicore-aware operating system.
Abstract: Multicore processors contain new hardware characteristics that are different from previous generation single-core systems or traditional SMP (symmetric multiprocessing) multiprocessor systems. These new characteristics provide new performance opportunities and challenges. In this paper, we show how hardware performance monitors can be used to provide a fine-grained, closely-coupled feedback loop to dynamic optimizations done by a multicore-aware operating system. These multicore optimizations are possible due to the advanced capabilities of hardware performance monitoring units currently found in commodity processors, such as execution pipeline stall breakdown and data address sampling. We demonstrate three case studies on how a multicore-aware operating system can use these online capabilities for (1) determining cache partition sizes, which helps reduce contention in the shared cache among applications, (2) detecting memory regions with bad cache usage, which helps in isolating these regions to reduce cache pollution, and (3) detecting sharing among threads, which helps in clustering threads to improve locality. Using realistic applications from standard benchmark suites, the following performance improvements were achieved: (1) up to 27% improvement in IPC (instructions-per-cycle) due to cache partition sizing; (2) up to 10% reduction in cache miss rates due to reduced cache pollution, resulting in up to 7% improvement in IPC; and (3) up to 70% reduction in remote cache accesses due to thread clustering, resulting in up to 7% application-level improvement.

Journal ArticleDOI
TL;DR: To address the latency scaling problem and the increased demand of the larger 80-way SMP size, the z10 processor cache subsystem introduces new innovative concepts and solutions.
Abstract: With the introduction of the high-frequency IBM System z10™ processor design, a new, robust cache hierarchy was needed to enable up to 80 of these processors aggregated into a tightly coupled symmetric multiprocessor (SMP) system to reach their performance potential. Typically, each time the processor frequency increases by a significant factor, as did the z10™ processor over the predecessor IBM System z9® processor, the access time of data, as measured by the number of processor cycles beyond the level 1 cache on an identical processor cache subsystem, would increase proportionally as well because the flight time on the chip interconnects across multiple hardware packaging levels has stayed relatively constant in nanoseconds. To address the latency scaling problem and the increased demand of the larger 80-way SMP size, the z10 processor cache subsystem introduces new innovative concepts and solutions.

Proceedings ArticleDOI
06 Mar 2009
TL;DR: A 3D chip design that stacks SRAM and DRAM upon processing cores and employs OS-based page coloring to minimize horizontal communication of cache data is postulated and it is shown that a tree topology is an ideal fit that significantly reduces the power and latency requirements of the on-chip network.
Abstract: Cache hierarchies in future many-core processors are expected to grow in size and contribute a large fraction of overall processor power and performance. In this paper, we postulate a 3D chip design that stacks SRAM and DRAM upon processing cores and employs OS-based page coloring to minimize horizontal communication of cache data. We then propose a heterogeneous reconfigurable cache design that takes advantage of the high density of DRAM and the superior power/delay characteristics of SRAM to efficiently meet the working set demands of each individual core. Finally, we analyze the communication patterns for such a processor and show that a tree topology is an ideal fit that significantly reduces the power and latency requirements of the on-chip network. The above proposals are synergistic: each proposal is made more compelling because of its combination with the other innovations described in this paper. The proposed reconfigurable cache model improves performance by up to 19% along with 48% savings in network power.

Journal ArticleDOI
Matteo Frigo1, Volker Strumpen1
TL;DR: It is shown that a multithreaded cache oblivious matrix multiplication incurs cache misses when executed by the Cilk work-stealing scheduler on a machine with P processors, each with a cache of size Z, with high probability.
Abstract: We present a technique for analyzing the number of cache misses incurred by multithreaded cache oblivious algorithms on an idealized parallel machine in which each processor has a private cache. We specialize this technique to computations executed by the Cilk work-stealing scheduler on a machine with dag-consistent shared memory. We show that a multithreaded cache oblivious matrix multiplication incurs cache misses when executed by the Cilk scheduler on a machine with P processors, each with a cache of size Z, with high probability. This bound is tighter than previously published bounds. We also present a new multithreaded cache oblivious algorithm for 1D stencil computations incurring cache misses with high probability, one for Gaussian elimination and back substitution, and one for the length computation part of the longest common subsequence problem incurring cache misses with high probability.

Proceedings ArticleDOI
Andrew J. Herdrich1, Ramesh Illikkal1, Ravi Iyer1, Donald Newell1, Vineet Chadha1, Jaideep Moses1 
08 Jun 2009
TL;DR: This paper evaluates two rate throttling mechanisms (clock modulation, and frequency scaling) for effectively managing the interference between applications running in a CMP platform and delivering QoS/performance management and shows that clock modulation is much more applicable to cache/memory QoS than frequency scaling.
Abstract: As we embrace the era of chip multi-processors (CMP), we are faced with two major architectural challenges: (i) QoS or performance management of disparate applications running on CPU cores contending for shared cache/memory resources and (ii) global/local power management techniques to stay within the overall platform constraints. The problem is exacerbated as the number of cores sharing the resources in a chip increase. In the past, researchers have proposed independent solutions for these two problems. In this paper, we show that rate-based techniques that are employed to address power management can be adapted to address cache/memory QoS issues. The basic approach is to throttle down the processing rate of a core if it is running a low-priority task and its execution is interfering with the performance of a high priority task due to platform resource contention (i.e. cache or memory contention). We evaluate two rate throttling mechanisms (clock modulation, and frequency scaling) for effectively managing the interference between applications running in a CMP platform and delivering QoS/performance management. We show that clock modulation is much more applicable to cache/memory QoS than frequency scaling and that resource monitoring along with rate control provides effective power-performance management in CMP platforms.

Patent
09 Sep 2009
TL;DR: In this article, the contents of a non-volatile memory device may be relied upon as accurately reflecting data stored on disk storage across a power transition such as a reboot, and cache metadata may be efficiently accessed and reliably saved and restored across power transitions.
Abstract: Embodiments of the invention provide techniques for ensuring that the contents of a non-volatile memory device may be relied upon as accurately reflecting data stored on disk storage across a power transition such as a reboot. For example, some embodiments of the invention provide techniques for determining whether the cache contents and/or or disk contents are modified during a power transition, causing cache contents to no longer accurately reflect data stored in disk storage. Further, some embodiments provide techniques for managing cache metadata during normal ("steady state") operations and across power transitions, ensuring that cache metadata may be efficiently accessed and reliably saved and restored across power transitions.

Proceedings ArticleDOI
19 Apr 2009
TL;DR: This paper seeks to construct an analytical framework based on optimal control theory and dynamic programming, to help form an in-depth understanding of optimal strategies to design cache replacement algorithms in peer-assisted VoD systems.
Abstract: Peer-assisted Video-on-Demand (VoD) systems have not only received substantial recent research attention, but also been implemented and deployed with success in large-scale real- world streaming systems, such as PPLive. Peer-assisted Video- on-Demand systems are designed to take full advantage of peer upload bandwidth contributions with a cache on each peer. Since the size of such a cache on each peer is limited, it is imperative that an appropriate cache replacement algorithm is designed. There exists a tremendous level of flexibility in the design space of such cache replacement algorithms, including the simplest alternatives such as Least Recently Used (LRU). Which algorithm is the best to minimize server bandwidth costs, so that when peers need a media segment, it is most likely available from caches of other peers? Such a question, however, is arguably non-trivial to answer, as both the demand and supply of media segments are stochastic in nature. In this paper, we seek to construct an analytical framework based on optimal control theory and dynamic programming, to help us form an in-depth understanding of optimal strategies to design cache replacement algorithms. With such analytical insights, we have shown with extensive simulations that, the performance margin enjoyed by optimal strategies over the simplest algorithms is not substantial, when it comes to reducing server bandwidth costs. In most cases, the simplest choices are good enough as cache replacement algorithms in peer-assisted VoD systems.

Patent
19 Feb 2009
TL;DR: In this paper, the authors present methods, systems, and computer programs for managing storage in a computer system using a solid state drive (SSD) read cache memory, which includes receiving a read request, which causes a miss in a cache memory.
Abstract: Methods, systems, and computer programs for managing storage in a computer system using a solid state drive (SSD) read cache memory are presented. The method includes receiving a read request, which causes a miss in a cache memory. After the cache miss, the method determines whether the data to satisfy the read request is available in the SSD memory. If the data is in SSD memory, the read request is served from the SSD memory. Otherwise, SSD memory tracking logic is invoked and the read request is served from a hard disk drive (HDD). Additionally, the SSD memory tracking logic monitors access requests to pages in memory, and if a predefined criteria is met for a certain page in memory, then the page is loaded in the SSD. The use of the SSD as a read cache improves memory performance for random data reads.

Proceedings ArticleDOI
12 Dec 2009
TL;DR: This paper presents a technique that aims to balance the pressure on the cache sets by detecting when it may be beneficial to associate sets, displacing lines from stressed sets to underutilized ones.
Abstract: Efficient memory hierarchy design is critical due to the increasing gap between the speed of the processors and the memory. One of the sources of inefficiency in current caches is the non-uniform distribution of the memory accesses on the cache sets. Its consequence is that while some cache sets may have working sets that are far from fitting in them, other sets may be underutilized because their working set has fewer lines than the set. In this paper we present a technique that aims to balance the pressure on the cache sets by detecting when it may be beneficial to associate sets, displacing lines from stressed sets to underutilized ones. This new technique, called Set Balancing Cache or SBC, achieved an average reduction of 13% in the miss rate of ten benchmarks from the SPEC CPU2006 suite, resulting in an average IPC improvement of 5%.

Patent
27 Mar 2009
TL;DR: In this paper, a shared cache controller enables or disables access separately to each of the cache ways based upon the corresponding source of a received memory request, and the control of the accessibility of the shared cache ways via altering stored values in the configuration and status registers (CSRs) may be used to create a pseudo-RAM structure within the shared caches and to progressively reduce the size of the share cache during a power-down sequence while the cache continues operation.
Abstract: A system and method for data allocation in a shared cache memory of a computing system are contemplated. Each cache way of a shared set-associative cache is accessible to multiple sources, such as one or more processor cores, a graphics processing unit (GPU), an input/output (I/O) device, or multiple different software threads. A shared cache controller enables or disables access separately to each of the cache ways based upon the corresponding source of a received memory request. One or more configuration and status registers (CSRs) store encoded values used to alter accessibility to each of the shared cache ways. The control of the accessibility of the shared cache ways via altering stored values in the CSRs may be used to create a pseudo-RAM structure within the shared cache and to progressively reduce the size of the shared cache during a power-down sequence while the shared cache continues operation.

Proceedings ArticleDOI
22 Sep 2009
TL;DR: Two strategies offering a kernel-assisted, single-copy model with support for noncontiguous and asynchronous transfers are introduced, which outperform the standard transfer method in the MPICH2 implementation when no cache is shared between the processing cores or when very large messages are being transferred.
Abstract: The emergence of multicore processors raises the need to efficiently transfer large amounts of data between local processes. MPICH2 is a highly portable MPI implementation whose large-message communication schemes suffer from high CPU utilization and cache pollution because of the use of a double-buffering strategy, common to many MPI implementations. We introduce two strategies offering a kernel-assisted, single-copy model with support for noncontiguous and asynchronous transfers. The first one uses the now widely available vmsplice Linux system call; the second one further improves performance thanks to a custom kernel module called KNEM. The latter also offers I/OAT copy offload, which is dynamically enabled depending on both hardware cache characteristics and message size. These new solutions outperform the standard transfer method in the MPICH2 implementation when no cache is shared between the processing cores or when very large messages are being transferred. Collective communication operations show a dramatic improvement, and the IS NAS parallel benchmark shows a 25% speedup and better cache efficiency.

Proceedings ArticleDOI
24 Aug 2009
TL;DR: A novel approach to analyzing the worst-case cache interferences and bounding the WCET for threads running on multi-core processors with shared direct-mapped L2 instruction caches is proposed, using an Extended ILP (Integer Linear Programming) to model all the possible inter-thread cache conflicts.
Abstract: In a multi-core processor, different cores typically share the last-level cache, and threads running on different cores may interfere with each other in accessing the shared cache. Therefore, multi-core WCET (Worst-Case Execution Time) analyzer must be able to safely and accurately estimate the worst-case inter-thread cache interferences, which is not supported by current WCET analysis techniques that mainly focus on analyzing uniprocessors. This paper proposes a novel approach to analyzing the worst-case cache interferences and bounding the WCET for threads running on multi-core processors with shared direct-mapped L2 instruction caches. We propose to use an Extended ILP (Integer Linear Programming) to model all the possible inter-thread cache conflicts, based on which we can accurately calculate the worst-case inter-thread cache interferences and derive the WCET. Compared to a recently proposed multi-core static analysis technique based on control flow information alone, this approach improves the tightness of WCET estimation by 13.7% on average.

Journal ArticleDOI
TL;DR: This work proposes a new framework for conducting comprehensive studies and characterization on the reliability behavior of cache memories, based on the development of new lifetime models for data and tag arrays residing in both the data and instruction caches that facilitate the characterization of cache vulnerability of stored items at various lifetime phases.
Abstract: Soft errors induced by energetic particle strikes in on-chip cache memories have become an increasing challenge in designing new generation reliable microprocessors. Previous efforts have exploited information redundancy via parity/ECC codings or cacheline duplication for information integrity in on-chip cache memories. Due to various performance, area/size, and energy constraints in various target systems, many existing unoptimized protection schemes may eventually prove significantly inadequate and ineffective. In this paper, we propose a new framework for conducting comprehensive studies and characterization on the reliability behavior of cache memories, in order to provide insight into cache vulnerability to soft errors as well as design guidance to architects for highly efficient reliable on-chip cache memory design. Our work is based on the development of new lifetime models for data and tag arrays residing in both the data and instruction caches. Those models facilitate the characterization of cache vulnerability of stored items at various lifetime phases. We then exemplify this design methodology by proposing reliability schemes targeting at specific vulnerable phases. Benchmarking is carried out to showcase the effectiveness of our approach.