Showing papers on "Cache coloring published in 2006"

PDF

Open Access

Proceedings Article•DOI•

Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches

[...]

09 Dec 2006

TL;DR: In this article, the authors propose a low-overhead, runtime mechanism that partitions a shared cache between multiple applications depending on the reduction in cache misses that each application is likely to obtain for a given amount of cache resources.

...read moreread less

Abstract: This paper investigates the problem of partitioning a shared cache between multiple concurrently executing applications. The commonly used LRU policy implicitly partitions a shared cache on a demand basis, giving more cache resources to the application that has a high demand and fewer cache resources to the application that has a low demand. However, a higher demand for cache resources does not always correlate with a higher performance from additional cache resources. It is beneficial for performance to invest cache resources in the application that benefits more from the cache resources rather than in the application that has more demand for the cache resources. This paper proposes utility-based cache partitioning (UCP), a low-overhead, runtime mechanism that partitions a shared cache between multiple applications depending on the reduction in cache misses that each application is likely to obtain for a given amount of cache resources. The proposed mechanism monitors each application at runtime using a novel, cost-effective, hardware circuit that requires less than 2kB of storage. The information collected by the monitoring circuits is used by a partitioning algorithm to decide the amount of cache resources allocated to each application. Our evaluation, with 20 multiprogrammed workloads, shows that UCP improves performance of a dual-core system by up to 23% and on average 11% over LRU-based cache partitioning.

...read moreread less

1,083 citations

Proceedings Article•DOI•

Managing Distributed, Shared L2 Caches through OS-Level Page Allocation

[...]

Sangyeun Cho¹, Lei Jin¹•Institutions (1)

University of Pittsburgh¹

09 Dec 2006

TL;DR: This paper presents and studies a distributed L2 cache management approach through OS-level page allocation for future many-core processors that can provide differentiated execution environment to running programs by dynamically controlling data placement and cache sharing degrees.

...read moreread less

Abstract: This paper presents and studies a distributed L2 cache management approach through OS-level page allocation for future many-core processors. L2 cache management is a crucial multicore processor design aspect to overcome non-uniform cache access latency for good program performance and to reduce on-chip network traffic and related power consumption. Unlike previously studied hardwarebased private and shared cache designs implementing a "fixed" caching policy, the proposed OS-microarchitecture approach is flexible; it can easily implement a wide spectrum of L2 caching policies without complex hardware support. Furthermore, our approach can provide differentiated execution environment to running programs by dynamically controlling data placement and cache sharing degrees. We discuss key design issues of the proposed approach and present preliminary experimental results showing the promise of our approach.

...read moreread less

377 citations

Journal Article•DOI•

Cooperative Caching for Chip Multiprocessors

[...]

Jichuan Chang¹, Gurindar S. Sohi¹•Institutions (1)

University of Wisconsin-Madison¹

01 May 2006

TL;DR: This paper presents CMP cooperative caching, a unified framework to manage a CMP's aggregate on-chip cache resources by forming an aggregate "shared" cache through cooperation among private caches that performs robustly over a range of system/cache sizes and memory latencies.

...read moreread less

Abstract: This paper presents CMP Cooperative Caching, a unified framework to manage a CMP's aggregate on-chip cache resources. Cooperative caching combines the strengths of private and shared cache organizations by forming an aggregate "shared" cache through cooperation among private caches. Locally active data are attracted to the private caches by their accessing processors to reduce remote on-chip references, while globally active data are cooperatively identified and kept in the aggregate cache to reduce off-chip accesses. Examples of cooperation include cache-to-cache transfers of clean data, replication-aware data replacement, and global replacement of inactive data. These policies can be implemented by modifying an existing cache replacement policy and cache coherence protocol, or by the new implementation of a directory-based protocol presented in this paper. Our evaluation using full-system simulation shows that cooperative caching achieves an off-chip miss rate similar to that of a shared cache, and a local cache hit rate similar to that of using private caches. Cooperative caching performs robustly over a range of system/cache sizes and memory latencies. For an 8-core CMP with 1MB L2 cache per core, the best cooperative caching scheme improves the performance of multithreaded commercial workloads by 5-11% compared with a shared cache and 4-38% compared with private caches. For a 4-core CMP running multiprogrammed SPEC2000 workloads, cooperative caching is on average 11% and 6% faster than shared and private cache organizations, respectively.

...read moreread less

377 citations

Journal Article•DOI•

A Case for MLP-Aware Cache Replacement

[...]

Moinuddin K. Qureshi¹, Daniel N. Lynch¹, Onur Mutlu¹, Yale N. Patt¹•Institutions (1)

University of Texas at Austin¹

01 May 2006

TL;DR: Evaluations with the SPEC CPU2000 benchmarks show that MLP-aware cache replacement can improve performance by as much as 23% and a novel, low-hardware overhead mechanism called sampling based adaptive replacement (SBAR) is proposed, to dynamically choose between an MLp-aware and a traditional replacement policy, depending on which one is more effective at reducing the number of memory related stalls.

...read moreread less

Abstract: Performance loss due to long-latency memory accesses can be reduced by servicing multiple memory accesses concurrently. The notion of generating and servicing long-latency cache misses in parallel is called Memory Level Parallelism (MLP). MLP is not uniform across cache misses - some misses occur in isolation while some occur in parallel with other misses. Isolated misses are more costly on performance than parallel misses. However, traditional cache replacement is not aware of the MLP-dependent cost differential between different misses. Cache replacement, if made MLP-aware, can improve performance by reducing the number of performance-critical isolated misses. This paper makes two key contributions. First, it proposes a framework for MLP-aware cache replacement by using a runtime technique to compute the MLP-based cost for each cache miss. It then describes a simple cache replacement mechanism that takes both MLP-based cost and recency into account. Second, it proposes a novel, low-hardware overhead mechanism called Sampling Based Adaptive Replacement (SBAR), to dynamically choose between an MLP-aware and a traditional replacement policy, depending on which one is more effective at reducing the number of memory related stalls. Evaluations with the SPEC CPU2000 benchmarks show that MLP-aware cache replacement can improve performance by as much as 23%.

...read moreread less

316 citations

Journal Article•DOI•

Dynamic allocation for scratch-pad memory using compile-time decisions

[...]

Sumesh Udayakumaran¹, Angel Dominguez¹, Rajeev Barua¹•Institutions (1)

University of Maryland, College Park¹

01 May 2006-ACM Transactions in Embedded Computing Systems

TL;DR: This research proposes a dynamic allocation methodology for global and stack data and program code that accounts for changing program requirements at runtime, has no software-caching tags, requires no runtime checks, has extremely low overheads, and yields 100% predictable memory access times.

...read moreread less

Abstract: In this research, we propose a highly predictable, low overhead, and, yet, dynamic, memory-allocation strategy for embedded systems with scratch pad memory. A scratch pad is a fast compiler-managed SRAM memory that replaces the hardware-managed cache. It is motivated by its better real-time guarantees versus cache and by its significantly lower overheads in energy consumption, area, and overall runtime, even with a simple allocation scheme. Primarily scratch pad allocation methods are of two types. First, software-caching schemes emulate the workings of a hardware cache in software. Instructions are inserted before each load/store to check the software-maintained cache tags. Such methods incur large overheads in runtime, code size, energy consumption, and SRAM space for tags and deliver poor real-time guarantees just like hardware caches. A second category of algorithms partitions variables at compile-time into the two banks. However, a drawback of such static allocation schemes is that they do not account for dynamic program behavior. It is easy to see why a data allocation that never changes at runtime cannot achieve the full locality benefits of a cache. We propose a dynamic allocation methodology for global and stack data and program code that; (i) accounts for changing program requirements at runtime, (ii) has no software-caching tags, (iii) requires no runtime checks, (iv) has extremely low overheads, and (v) yields 100p predictable memory access times. In this method, data that is about to be accessed frequently is copied into the scratch pad using compiler-inserted code at fixed and infrequent points in the program. Earlier data is evicted if necessary. When compared to a provably optimal static allocation, results show that our scheme reduces runtime by up to 39.8p and energy by up to 31.3p, on average, for our benchmarks, depending on the SRAM size used. The actual gain depends on the SRAM size, but our results show that close to the maximum benefit in runtime and energy is achieved for a substantial range of small SRAM sizes commonly found in embedded systems. Our comparison with a direct mapped cache shows that our method performs roughly as well as a cached architecture.

...read moreread less

240 citations

Proceedings Article•DOI•

Architectural support for operating system-driven CMP cache management

[...]

Nauman Rafique¹, Won-Taek Lim¹, Mithuna Thottethodi¹•Institutions (1)

Purdue University¹

16 Sep 2006

TL;DR: This paper designs architectural support for OS to efficiently manage shared caches with a wide variety of policies and demonstrates that the scheme can support a wide range of policies including policies that provide passive performance differentiation, reactive fairness by miss-rate equalization and reactive performance differentiation.

...read moreread less

Abstract: The role of the operating system (OS) in managing shared resources such as CPU time, memory, peripherals, and even energy is well motivated and understood [23]. Unfortunately, one key resource — lower-level shared cache in chip multi-processors — is commonly managed purely in hardware by rudimentary replacement policies such as least-recently-used (LRU). The rigid nature of the hardware cache management policy poses a serious problem since there is no single best cache management policy across all sharing scenarios. For example, the cache management policy for a scenario where applications from a single organization are running under "best effort" performance expectation is likely to be different from the policy for a scenario where applications from competing business entities (say, at a third party data center) are running under a minimum service level expectation. When it comes to managing shared caches, there is an inherent tension between flexibility and performance. On one hand, managing the shared cache in the OS offers immense policy flexibility since it may be implemented in software. Unfortunately, it is prohibitively expensive in terms of performance for the OS to be involved in managing temporally fine-grain events such as cache allocation. On the other hand, sophisticated hardware-only cache management techniques to achieve fair sharing or throughput maximization have been proposed. But they offer no policy flexibility. This paper addresses this problem by designing architectural support for OS to efficiently manage shared caches with a wide variety of policies. Our scheme consists of a hardware cache quota management mechanism, an OS interface and a set of OS level quota orchestration policies. The hardware mechanism guarantees that OS-specified quotas are enforced in shared caches, thus eliminating the need for (and the performance penalty of) temporally fine-grained OS intervention. The OS retains policy flexibility since it can tune the quotas during regularly scheduled OS interventions. We demonstrate that our scheme can support a wide range of policies including policies that provide (a) passive performance differentiation, (b) reactive fairness by miss-rate equalization and (c) reactive performance differentiation.

...read moreread less

215 citations

Proceedings Article•DOI•

Communist, utilitarian, and capitalist cache policies on CMPs: caches as a shared resource

[...]

Lisa R. Hsu¹, Steven K. Reinhardt¹, Ravishankar Iyer², Srihari Makineni²•Institutions (2)

University of Michigan¹, Intel²

16 Sep 2006

TL;DR: It is found that simple policies like LRU replacement and static uniform partitioning are not sufficient to provide near-optimal performance under any reasonable definition, indicating that some thread-aware cache resource allocation mechanism is required.

...read moreread less

Abstract: As chip multiprocessors (CMPs) become increasingly mainstream, architects have likewise become more interested in how best to share a cache hierarchy among multiple simultaneous threads of execution. The complexity of this problem is exacerbated as the number of simultaneous threads grows from two or four to the tens or hundreds. However, there is no consensus in the architectural community on what "best" means in this context. Some papers in the literature seek to equalize each thread's performance loss due to sharing, while others emphasize maximizing overall system performance. Furthermore, the specific effect of these goals varies depending on the metric used to define "performance". In this paper we label equal performance targets as Communist cache policies and overall performance targets as Utilitarian cache policies. We compare both of these models to the most common current model of a free-for-all cache (a Capitalist policy). We consider various performance metrics, including miss rates, bandwidth usage, and IPC, including both absolute and relative values of each metric. Using analytical models and behavioral cache simulation, we find that the optimal partitioning of a shared cache can vary greatly as different but reasonable definitions of optimality are applied. We also find that, although Communist and Utilitarian targets are generally compatible, each policy has workloads for which it provides poor overall performance or poor fairness, respectively. Finally, we find that simple policies like LRU replacement and static uniform partitioning are not sufficient to provide near-optimal performance under any reasonable definition, indicating that some thread-aware cache resource allocation mechanism is required.

...read moreread less

198 citations

Proceedings Article•DOI•

FlashCache: a NAND flash memory file cache for low power web servers

[...]

Taeho Kgil¹, Trevor Mudge¹•Institutions (1)

University of Michigan¹

22 Oct 2006

TL;DR: The client request behavior in web servers, allows the proposed architecture to show that the primary drawbacks of flash memory?endurance and long write latencies?can easily be overcome.

...read moreread less

Abstract: We propose an architecture that uses NAND flash memory to reduce main memory power in web server platforms. Our architecture uses a two level file buffer cache composed of a relatively small DRAM, which includes a primary file buffer cache, and a flash memory secondary file buffer cache. Compared to a conventional DRAM-only architecture, our architecture consumes orders of magnitude less idle power while remaining cost effective. This is a result of using flash memory, which consumes orders of magnitude less idle power than DRAM and is twice as dense. The client request behavior in web servers, allows us to show that the primary drawbacks of flash memory?endurance and long write latencies?can easily be overcome. In fact the wear-level aware management techniques that we propose are not heavily used.

...read moreread less

152 citations

Proceedings Article•DOI•

Implicit and explicit optimizations for stencil computations

[...]

Shoaib Kamil¹, Kaushik Datta², Samuel Williams², Leonid Oliker¹, John Shalf¹, Katherine Yelick¹ - Show less +2 more•Institutions (2)

Lawrence Berkeley National Laboratory¹, University of California, Berkeley²

22 Oct 2006

TL;DR: Several optimizations on both the conventional cache-based memory systems of the Itanium 2, Opteron, and Power5, as well as the heterogeneous multicore design of the Cell processor are examined, including both an implicit cache oblivious approach and a cache-aware algorithm blocked to match the cache structure.

...read moreread less

Abstract: Stencil-based kernels constitute the core of many scientific applications on block-structured grids. Unfortunately, these codes achieve a low fraction of peak performance, due primarily to the disparity between processor and main memory speeds. We examine several optimizations on both the conventional cache-based memory systems of the Itanium 2, Opteron, and Power5, as well as the heterogeneous multicore design of the Cell processor. The optimizations target cache reuse across stencil sweeps, including both an implicit cache oblivious approach and a cache-aware algorithm blocked to match the cache structure. Finally, we consider stencil computations on a machine with an explicitly-managed memory hierarchy, the Cell processor. Overall, results show that a cache-aware approach is significantly faster than a cache oblivious approach and that the explicitly managed memory on Cell is more efficient: Relative to the Power5, it has almost 2x more memory bandwidth and is 3.7x faster.

...read moreread less

150 citations

Patent•

Utilizing cache information to manage memory access and cache utilization

[...]

John Zedlewski¹, Carl A. Waldspurger¹•Institutions (1)

VMware¹

24 Apr 2006

TL;DR: In this paper, a method and system of managing data access in a shared memory cache of a processor is described, which includes probing one or more memory addresses that map to a subset of the shared memory caches.

...read moreread less

Abstract: A method and system of managing data access in a shared memory cache of a processor are disclosed. The method includes probing one or more memory addresses that map to a subset of the shared memory cache and sensing a plurality of events in the one or more memory addresses. Cache utilization information is then obtained by reading a hardware performance counter of the processor. The hardware performance counter is incremented based on the occurrence of the plurality of events. Based upon the cache utilization information, an occurrence of one of the plurality of events is reduced.

...read moreread less

140 citations

Proceedings Article•DOI•

Last level cache (LLC) performance of data mining workloads on a CMP - a case study of parallel bioinformatics workloads

[...]

Aamer Jaleel¹, Matthew Mattina, Bruce Jacob•Institutions (1)

Intel¹

27 Feb 2006

TL;DR: A detailed data-sharing analysis and chip-multiprocessor (CMP) cache study of several multithreaded data-mining bioinformatics workloads and shows that a shared 32 MB last-level cache is able to capture a tremendous amount of data- sharing and outperform a 32 MB private cache configuration by several orders of magnitude.

...read moreread less

Abstract: With the continuing growth in the amount of genetic data, members of the bioinformatics community are developing a variety of data-mining applications to understand the data and discover meaningful information. These applications are important in defining the design and performance decisions of future high performance microprocessors. This paper presents a detailed data-sharing analysis and chip-multiprocessor (CMP) cache study of several multithreaded data-mining bioinformatics workloads. For a CMP with a three-level cache hierarchy, we model the last-level of the cache hierarchy as either multiple private caches or a single cache shared amongst different cores of the CMP. Our experiments show that the bioinformatics workloads exhibit significant data-sharing - 50-95% of the data cache is shared by the different threads of the workload. Furthermore, regardless of the amount of data cache shared, for some workloads, as many as 98% of the accesses to the last-level cache are to shared data cache lines. Additionally, the amount of data-sharing exhibited by the workloads is a function of the total cache size available - the larger the data cache the better the sharing behavior. Thus, partitioning the available last-level cache silicon area into multiple private caches can cause applications to lose their inherent data-sharing behavior. For the workloads in this study, a shared 32 MB last-level cache is able to capture a tremendous amount of data-sharing and outperform a 32 MB private cache configuration by several orders of magnitude. Specifically, with shared last-level caches, the bandwidth demands beyond the last-level cache can be reduced by factors of 3-625 when compared to private last-level caches.

...read moreread less

Patent•

Selectively unmarking load-marked cache lines during transactional program execution

[...]

Marc Tremblay¹, Quinn A. Jacobson¹, Shaildender Chaudhry¹, Mark S. Moir¹, Maurice Herlihy¹ - Show less +1 more•Institutions (1)

Sun Microsystems¹

06 Apr 2006

TL;DR: In this paper, the authors present a system that facilitates selectively unmarking load-marked cache lines during transactional program execution, where load-marked cache lines are monitored during the transactional execution to detect interfering accesses from other threads.

...read moreread less

Abstract: One embodiment of the present invention provides a system that facilitates selectively unmarking load-marked cache lines during transactional program execution, wherein load-marked cache lines are monitored during transactional execution to detect interfering accesses from other threads. During operation, the system encounters a release instruction during transactional execution of a block of instructions. In response to the release instruction, the system modifies the state of cache lines, which are specially load-marked to indicate they can be released from monitoring, to account for the release instruction being encountered. In doing so, the system can potentially cause the specially load-marked cache lines to become unmarked. In a variation on this embodiment, upon encountering a commit-and-start-new-transaction instruction, the system modifies load-marked cache lines to account for the commit-and-start-new-transaction instruction being encountered. In doing so, the system causes normally load-marked cache lines to become unmarked, while other specially load-marked cache lines may remain load-marked past the commit-and-start-new-transaction instruction.

...read moreread less

Journal Article•DOI•

Web cache replacement policies: a pragmatic approach

[...]

Kin-Yeung Wong

01 Jan 2006-IEEE Network

TL;DR: It is claimed that there is a sufficient number of good policies for proxies with different characteristics, such as proxies with a small cache, limited bandwidth, and limited processing power, as well as suggest policies for different types of proxies.

...read moreread less

Abstract: Research involving Web cache replacement policy has been active for at least a decade. In this-article we would like to claim that there is a sufficient number of good policies, and further proposals would only produce minute improvements. We argue that the focus should be fitness for purpose rather than proposing any new policies. Up to now, almost all policies were purported to perform better than others, creating confusion as to which policy should be used. Actually, a policy only performs well in certain environments. Therefore, the goal of this article is to identify the appropriate policies for proxies with different characteristics, such as proxies with a small cache, limited bandwidth, and limited processing power, as well as suggest policies for different types of proxies, such as ISP-level and root-level proxies

...read moreread less

Journal Article•DOI•

A novel caching scheme for improving Internet-based mobile ad hoc networks performance

[...]

Sunho Lim¹, Wang-Chien Lee¹, Guohong Cao¹, Chita R. Das¹•Institutions (1)

Pennsylvania State University¹

01 Mar 2006

TL;DR: Simulation results indicate that the proposed aggregate caching mechanism and a broadcast-based Simple Search algorithm can significantly improve an Imanet performance in terms of throughput and average number of hops to access data items.

...read moreread less

Abstract: Internet-based mobile ad hoc network (Imanet) is an emerging technique that combines a wired network (e.g. Internet) and a mobile ad hoc network (Manet) for developing a ubiquitous communication infrastructure. To fulfill users' demand to access various kinds of information, however, an Imanet has several limitations such as limited accessibility to the wired Internet, insufficient wireless bandwidth, and longer message latency. In this paper, we address the issues involved in information search and access in Imanets. An aggregate caching mechanism and a broadcast-based Simple Search (SS) algorithm are proposed for improving the information accessibility and reducing average communication latency in Imanets. As a part of the aggregate cache, a cache admission control policy and a cache replacement policy, called Time and Distance Sensitive (TDS) replacement, are developed to reduce the cache miss ratio and improve the information accessibility. We evaluate the impact of caching, cache management, and the number of access points that are connected to the Internet, through extensive simulation. The simulation results indicate that the proposed aggregate caching mechanism can significantly improve an Imanet performance in terms of throughput and average number of hops to access data items.

...read moreread less

Patent•

Storage system comprising volatile cache memory and nonvolatile memory

[...]

Norio Shimozono¹, Akira Fujibayashi¹•Institutions (1)

Hitachi¹

13 Dec 2006

TL;DR: In this article, the temporary storage address of data following access commands from the upper level device shall be the volatile cache memory, which is a type of memory that can continue to memorize data irrespective of whether or not power is supplied.

...read moreread less

Abstract: A storage system comprises a volatile cache memory, and a non-volatile memory, which is a type of memory that can continue to memorize data irrespective of whether or not power is supplied. The temporary storage address of data following access commands from the upper level device shall be the volatile cache memory. If power is not supplied from primary power source to the volatile cache memory, power supplied from a battery is used to copy data memorized in volatile cache memory to non-volatile memory.

...read moreread less

Proceedings Article•DOI•

Adaptive Caches: Effective Shaping of Cache Behavior to Workloads

[...]

Ranjith Subramanian¹, Yannis Smaragdakis², Gabriel H. Loh¹•Institutions (2)

Georgia Institute of Technology¹, University of Oregon²

09 Dec 2006

TL;DR: A novel and general scheme by which any two cache management algorithms can be combined and adaptively switch between them, closely tracking the locality characteristics of a given program, inspired by recent work in virtual memory management at the operating system level.

...read moreread less

Abstract: We present and evaluate the idea of adaptive processor cache management. Specifically, we describe a novel and general scheme by which we can combine any two cache management algorithms (e.g., LRU, LFU, FIFO, Random) and adaptively switch between them, closely tracking the locality characteristics of a given program. The scheme is inspired by recent work in virtual memory management at the operating system level, which has shown that it is possible to adapt over two replacement policies to provide an aggregate policy that always performs within a constant factor of the better component policy. A hardware implementation of adaptivity requires very simple logic but duplicate tag structures. To reduce the overhead, we use partial tags, which achieve good performance with a small hardware cost. In particular, adapting between LRU and LFU replacement policies on an 8-way 512KB L2 cache yields a 12.7% improvement in average CPI on applications that exhibit a non-negligible L2 miss ratio. Our approach increases total cache storage by 4.0%, but it still provides slightly better performance than a conventional 10-way setassociative 640KB cache which requires 25% more storage.

...read moreread less

Proceedings Article•DOI•

WCET-centric software-controlled instruction caches for hard real-time systems

[...]

Isabelle Puaut¹•Institutions (1)

University of Rennes 1¹

05 Jul 2006

TL;DR: Experimental results provided in the paper show that with an appropriate selection of regions and cache contents, the worst-case performance of applications with locked instruction caches is competitive with the best- case performance of unlocked caches.

...read moreread less

Abstract: Cache memories have been extensively used to bridge the gap between high speed processors and relatively slower main memories. However, they are sources of predictability problems because of their dynamic and adaptive behavior, and thus need special attention to be used in hard real-time systems. A lot of progress has been achieved in the last ten years to statically predict worst-case execution times (WCETs) of tasks on architectures with caches. However, cache-aware WCET analysis techniques are not always applicable due to the lack of documentation of hardware manuals concerning the cache replacement policies. Moreover, they tend to be pessimistic with some cache replacement policies (e.g. random replacement policies). Lastly, caches are sources of timing anomalies in dynamically scheduled processors (a cache miss may in some cases result in a shorter execution time than a hit). To reconciliate performance and predictability of caches, we propose in this paper algorithm for software control of instruction caches. The proposed algorithms statically divide the code of tasks into regions, for which the cache contents is statically selected. At run-time, at every transition between regions, the cache contents computed off-line is loaded into the cache and the cache replacement policy is disabled (the cache is locked). Experimental results provided in the paper show that with an appropriate selection of regions and cache contents, the worst-case performance of applications with locked instruction caches is competitive with the worst-case performance of unlocked caches.

...read moreread less

Patent•

Dynamic write cache size adjustment in raid controller with capacitor backup energy source

[...]

Victor Key Pecone, Yuanra Frank Wang

02 Jun 2006

TL;DR: In this article, a high data availability write-caching storage controller has a volatile memory with a write cache for caching write cache data, a non-volatile memory, a capacitor pack for supplying power for backing up the write cache to the nonvolatile memories in response to a loss of main power, and a CPU that determines whether reducing an operating voltage of the capacitor pack to a new value would cause the battery pack to be storing less energy than required for back up the current size write cache.

...read moreread less

Abstract: A high data availability write-caching storage controller has a volatile memory with a write cache for caching write cache data, a non-volatile memory, a capacitor pack for supplying power for backing up the write cache to the non-volatile memory in response to a loss of main power, and a CPU that determines whether reducing an operating voltage of the capacitor pack to a new value would cause the capacitor pack to be storing less energy than required for backing up the current size write cache to the non-volatile memory. If so, the CPU reduces the size of the write cache prior to reducing the operating voltage. The CPU estimates the capacity of the capacitor pack to store the required energy based on a history of operational temperature and voltage readings of the capacitor pack, such as on an accumulated normalized running time and warranted lifetime of the capacitor pack.

...read moreread less

Patent•

Recoverable Cache Preload in Clustered Computer System

[...]

William T. Newport¹, Kevin William Sutter¹•Institutions (1)

IBM¹

29 Nov 2006

TL;DR: In this paper, the state of the cache during a cache preload operation is monitored to avoid the need to restart the cache pre-load operation from the beginning, and any data that has been preloaded into a cache prior to a failure may be retained after a failover occurs.

...read moreread less

Abstract: An apparatus, program product and method monitor the state of a cache during a cache preload operation in a clustered computer system such that the monitored state can be used after a failover to potentially avoid the need to restart the cache preload operation from the beginning. In particular, by monitoring the state of the cache during a cache preload operation, any data that has been preloaded into a cache prior to a failure may be retained after a failover occurs, thus enabling the cache preload operation to continue from the point at which it was interrupted as a result of the failure.

...read moreread less

Proceedings Article•DOI•

Thread-Shared Software Code Caches

[...]

Derek L. Bruening, Vladimir Kiriansky, Timothy Garnett, Sanjeev Banerji

26 Mar 2006

TL;DR: This paper discusses the design choices when building thread-shared code caches and enumerate the difficulties of thread-local storage, synchronization, trace building, in-cache lookup tables, and cache eviction and presents efficient solutions to these problems that both scale well and do not require thread suspension.

...read moreread less

Abstract: Software code caches are increasingly being used to amortize the runtime overhead of dynamic optimizers, simulators, emulators, dynamic translators, dynamic compilers, and other tools. Despite the now-wide spread use of code caches, techniques for efficiently sharing them across multiple threads have not been fully explored. Some systems simply do not support threads, while others resort to thread-private code caches. Although thread-private caches are much simpler to manage, synchronize, and provide scratch space for, they simply do not scale when applied to many-threaded programs. Thread-shared code caches are needed to target server applications, which employ hundreds of worker threads all performing similar tasks. Yet, those systems that do share their code caches often have brute-force, inefficient solutions to the challenges of concurrent code cache access: a single global lock on runtime system code and suspension of all threads for any cache management action. This limits the possibilities for cache design and has performance problems with applications that require frequent cache invalidations to maintain cache consistency. In this paper, we discuss the design choices when building thread-shared code caches and enumerate the difficulties of thread-local storage, synchronization, trace building, in-cache lookup tables, and cache eviction. We present efficient solutions to these problems that both scale well and do not require thread suspension. We evaluate our results in DynamoRIO, an industrial-strength dynamic binary translation system, on real-world server applications. On these applications our thread-shared caches use an order of magnitude less memory and improve throughput by up to four times compared to thread-private caches.

...read moreread less

Patent•

Hard disk caching with automated discovery of cacheable files

[...]

Louis A. Duran, Brian L. Vajda, Punit Modi

27 Dec 2006

TL;DR: In this article, a permanent cache list of files not to be removed from a cache is determined in response to a user selection of an application to be added to the cache by adding a file to cache list if the file is a static dependency of the application or if a file has a high probability of being used in the future by the application.

...read moreread less

Abstract: In some embodiments a permanent cache list of files not to be removed from a cache is determined in response to a user selection of an application to be added to the cache. The determination is made by adding a file to the cache list if the file is a static dependency of the application, or if a file has a high probability of being used in the future by the application. Other embodiments are described and claimed.

...read moreread less

Patent•

Shared memory message switch and cache

[...]

Keith Iain Wilkinson¹•Institutions (1)

Cisco Systems, Inc.¹

31 Aug 2006

TL;DR: In this article, a method and apparatus for providing shared switch and cache memory is described, consisting of a message switch module, a cache controller module, and shared switch/cache memory.

...read moreread less

Abstract: A method and apparatus are described to provide shared switch and cache memory. The apparatus may comprise a message switch module, a cache controller module, and shared switch and cache memory to provide shared memory to the message switch module and to the cache controller module. The cache controller module may comprise pointer memory to store a plurality of pointers, each pointer pointing to a location in the shared switch and cache memory (e.g., point to a message header partition in the shared switch and cache memory). If there is a corresponding pointer, a memory read response may be sent to the requesting agent. If there is no corresponding pointer, a write data request may be sent to a corresponding destination agent and, in response to receiving the requested data, a pointer to the stored data in the pointer memory may be provided.

...read moreread less

Patent•

Partitioned shared cache

[...]

Charles E. Narad¹•Institutions (1)

Intel¹

07 Dec 2006

TL;DR: In this paper, the authors discuss how data shared between two memory accessing agents may be stored in a shared partition of the shared cache, and how data accessed by one of the accessing agents can be stored either in one or more private partitions of a shared cache.

...read moreread less

Abstract: Some of the embodiments discussed herein may utilize partitions within a shared cache in various computing environments. In an embodiment, data shared between two memory accessing agents may be stored in a shared partition of the shared cache. Additionally, data accessed by one of the memory accessing agents may be stored in one or more private partitions of the shared cache.

...read moreread less

Patent•

Cache memory background preprocessing

[...]

Zvi Greenfield¹, Yariv Saliternik¹•Institutions (1)

Analog Devices¹

11 Dec 2006

TL;DR: In this article, a cache memory preprocessor consists of a command inputter, which receives a multiple-way cache memory processing command from the processor, and a command implementer, which performs background processing upon multiple ways of the cache memory.

...read moreread less

Abstract: A cache memory preprocessor prepares a cache memory for use by a processor. The processor accesses a main memory via a cache memory, which serves a data cache for the main memory. The cache memory preprocessor consists of a command inputter, which receives a multiple-way cache memory processing command from the processor, and a command implementer. The command implementer performs background processing upon multiple ways of the cache memory in order to implement the cache memory processing command received by the command inputter.

...read moreread less

Patent•

Method for optimisation of the management of a server cache which may be consulted by client terminals with differing characteristics

[...]

Elouan Lecoq, Julien Perron

27 Mar 2006

TL;DR: In this paper, the authors propose a method for optimisation of the management of a server cache for dynamic pages which may be consulted by client terminals with differing characteristics which requires the provision of discrete versions of a dynamic page in the cache.

...read moreread less

Abstract: The invention relates to a method for optimisation of the management of a server cache for dynamic pages which may be consulted by client terminals with differing characteristics which requires the provision of discrete versions (10) of a dynamic page in the cache. According to the method, when a terminal requests (11) a dynamic page, a verification step (12) for the presence of at least one version of the dynamic page in the cache is carried out, such that if the verification is positive the following complementary steps are carried out: procurement (13) of a set of characteristics specific to the type of client terminal, determination (14) of a subset of necessary characteristics from amongst the specific characteristics for the reproduction of the dynamic page on a client terminal, search (15), among the version(s) of the dynamic page in the cache for a suitable version using the subset of necessary characteristics and allocation (16) of the suitable version to the client terminal.

...read moreread less

Journal Article•DOI•

An analytical model for cache replacement policy performance

[...]

Fei Guo¹, Yan Solihin¹•Institutions (1)

North Carolina State University¹

26 Jun 2006

TL;DR: This paper is the first to propose an analytical model which predicts the performance of cache replacement policies, based on probability theory, and relies solely on the statistical properties of the application, without relying on heuristics or rules of thumbs.

...read moreread less

Abstract: Due to the increasing gap between CPU and memory speed, cache performance plays an increasingly critical role in determining the overall performance of microprocessor systems. One of the important factors that a affect cache performance is the cache replacement policy. Despite the importance, current analytical cache performance models ignore the impact of cache replacement policies on cache performance. To the best of our knowledge, this paper is the first to propose an analytical model which predicts the performance of cache replacement policies. The input to our model is a simple circular sequence profiling of each application, which requires very little storage overhead. The output of the model is the predicted miss rates of an application under different replacement policies. The model is based on probability theory and utilizes Markov processes to compute each cache access' miss probability. The model realistic assumptions and relies solely on the statistical properties of the application, without relying on heuristics or rules of thumbs. The model's run time is less than 0.1 seconds, much lower than that of trace simulations. We validate the model by comparing the predicted miss rates of seventeen Spec2000 and NAS benchmark applications against miss rates obtained by detailed execution-driven simulations, across a range of different cache sizes, associativities, and four replacement policies, and show that the model is very accurate. The model's average prediction error is 1.41%,and there are only 14 out of 952 validation points in which the prediction errors are larger than 10%.

...read moreread less

Patent•

Managing cache memory in a parallel processing environment

[...]

David Wentzlaff¹, Matthew Mattina¹, Anant Agarwal¹•Institutions (1)

Tilera¹

14 Apr 2006

TL;DR: In this article, a plurality of processor cores, each comprising a computation unit and a memory, are configured as a cache for memory external to the processor cores and at least some of the processors are configured to transmit a message over the interconnection network to access a cache of another processor core.

...read moreread less

Abstract: An apparatus comprises a plurality of processor cores, each comprising a computation unit and a memory. The apparatus further comprises an interconnection network to transmit data among the processor cores. At least some of the memories are configured as a cache for memory external to the processor cores, and at least some of the processor cores are configured to transmit a message over the interconnection network to access a cache of another processor core.

...read moreread less

Journal Article•DOI•

Cache oblivious matrix multiplication using an element ordering based on a Peano curve

[...]

Michael Bader¹, Christoph Zenger¹•Institutions (1)

Technische Universität München¹

01 Sep 2006-Linear Algebra and its Applications

TL;DR: This paper presents a cache oblivious algorithm for matrix multiplication that uses a block recursive structure and an element ordering that is based on Peano curves that leads to an asymptotically optimal spatial and temporal locality of the data access.

...read moreread less

Proceedings Article•DOI•

Finding optimal L1 cache configuration for embedded systems

[...]

Andhi Janapsatya¹, Aleksandar Ignjatovic¹, Sri Parameswaran¹•Institutions (1)

University of New South Wales¹

24 Jan 2006

TL;DR: A method is given to rapidly find the L1 cache miss rate of an application and an energy model and an execution time model are developed to find the best cache configuration for the given embedded application.

...read moreread less

Abstract: Modern embedded system execute a single application or a class of applications repeatedly. A new emerging methodology of designing embedded system utilizes configurable processors where the cache size, associativity, and line size can be chosen by the designer. In this paper, a method is given to rapidly find the L1 cache miss rate of an application. An energy model and an execution time model are developed to find the best cache configuration for the given embedded application. Using benchmarks from Mediabench, we find that our method is on average 45 times faster to explore the design space, compared to Dinero IV while still having 100% accuracy.

...read moreread less

Patent•

Systems and Arrangements for Cache Management

[...]

Marcus L. Kornegay¹, Ngan N. Pham¹•Institutions (1)

IBM¹

22 Nov 2006

TL;DR: In this article, a method for cache management of lines of binary code that are stored in the cache is described, where a cache manager log can identify a line, or lines, to be evicted based on data by accessing the cache directory.

...read moreread less

Abstract: A method for cache management is disclosed. The method can assign or determined identifiers for lines of binary code that are, or will be stored in cache. The method can create a cache directory that utilizes the identifier to keep an eviction count and/or a reload count for cached lines. Thus, each time a line is entered into, or evicted from cache, the cache eviction log can be amended accordingly. When a processor receives or creates an instruction that requests that a line be evicted from cache, a cache manager log can identify a line, or lines of binary code to be evicted based on data by accessing the cache directory and then the line(s) can be evicted.

...read moreread less

Collapse