scispace - formally typeset
Search or ask a question

Showing papers on "Cache coloring published in 2006"


Proceedings ArticleDOI
09 Dec 2006
TL;DR: In this article, the authors propose a low-overhead, runtime mechanism that partitions a shared cache between multiple applications depending on the reduction in cache misses that each application is likely to obtain for a given amount of cache resources.
Abstract: This paper investigates the problem of partitioning a shared cache between multiple concurrently executing applications. The commonly used LRU policy implicitly partitions a shared cache on a demand basis, giving more cache resources to the application that has a high demand and fewer cache resources to the application that has a low demand. However, a higher demand for cache resources does not always correlate with a higher performance from additional cache resources. It is beneficial for performance to invest cache resources in the application that benefits more from the cache resources rather than in the application that has more demand for the cache resources. This paper proposes utility-based cache partitioning (UCP), a low-overhead, runtime mechanism that partitions a shared cache between multiple applications depending on the reduction in cache misses that each application is likely to obtain for a given amount of cache resources. The proposed mechanism monitors each application at runtime using a novel, cost-effective, hardware circuit that requires less than 2kB of storage. The information collected by the monitoring circuits is used by a partitioning algorithm to decide the amount of cache resources allocated to each application. Our evaluation, with 20 multiprogrammed workloads, shows that UCP improves performance of a dual-core system by up to 23% and on average 11% over LRU-based cache partitioning.

1,083 citations


Proceedings ArticleDOI
09 Dec 2006
TL;DR: This paper presents and studies a distributed L2 cache management approach through OS-level page allocation for future many-core processors that can provide differentiated execution environment to running programs by dynamically controlling data placement and cache sharing degrees.
Abstract: This paper presents and studies a distributed L2 cache management approach through OS-level page allocation for future many-core processors. L2 cache management is a crucial multicore processor design aspect to overcome non-uniform cache access latency for good program performance and to reduce on-chip network traffic and related power consumption. Unlike previously studied hardwarebased private and shared cache designs implementing a "fixed" caching policy, the proposed OS-microarchitecture approach is flexible; it can easily implement a wide spectrum of L2 caching policies without complex hardware support. Furthermore, our approach can provide differentiated execution environment to running programs by dynamically controlling data placement and cache sharing degrees. We discuss key design issues of the proposed approach and present preliminary experimental results showing the promise of our approach.

377 citations


Journal ArticleDOI
01 May 2006
TL;DR: This paper presents CMP cooperative caching, a unified framework to manage a CMP's aggregate on-chip cache resources by forming an aggregate "shared" cache through cooperation among private caches that performs robustly over a range of system/cache sizes and memory latencies.
Abstract: This paper presents CMP Cooperative Caching, a unified framework to manage a CMP's aggregate on-chip cache resources. Cooperative caching combines the strengths of private and shared cache organizations by forming an aggregate "shared" cache through cooperation among private caches. Locally active data are attracted to the private caches by their accessing processors to reduce remote on-chip references, while globally active data are cooperatively identified and kept in the aggregate cache to reduce off-chip accesses. Examples of cooperation include cache-to-cache transfers of clean data, replication-aware data replacement, and global replacement of inactive data. These policies can be implemented by modifying an existing cache replacement policy and cache coherence protocol, or by the new implementation of a directory-based protocol presented in this paper. Our evaluation using full-system simulation shows that cooperative caching achieves an off-chip miss rate similar to that of a shared cache, and a local cache hit rate similar to that of using private caches. Cooperative caching performs robustly over a range of system/cache sizes and memory latencies. For an 8-core CMP with 1MB L2 cache per core, the best cooperative caching scheme improves the performance of multithreaded commercial workloads by 5-11% compared with a shared cache and 4-38% compared with private caches. For a 4-core CMP running multiprogrammed SPEC2000 workloads, cooperative caching is on average 11% and 6% faster than shared and private cache organizations, respectively.

377 citations


Journal ArticleDOI
01 May 2006
TL;DR: Evaluations with the SPEC CPU2000 benchmarks show that MLP-aware cache replacement can improve performance by as much as 23% and a novel, low-hardware overhead mechanism called sampling based adaptive replacement (SBAR) is proposed, to dynamically choose between an MLp-aware and a traditional replacement policy, depending on which one is more effective at reducing the number of memory related stalls.
Abstract: Performance loss due to long-latency memory accesses can be reduced by servicing multiple memory accesses concurrently. The notion of generating and servicing long-latency cache misses in parallel is called Memory Level Parallelism (MLP). MLP is not uniform across cache misses - some misses occur in isolation while some occur in parallel with other misses. Isolated misses are more costly on performance than parallel misses. However, traditional cache replacement is not aware of the MLP-dependent cost differential between different misses. Cache replacement, if made MLP-aware, can improve performance by reducing the number of performance-critical isolated misses. This paper makes two key contributions. First, it proposes a framework for MLP-aware cache replacement by using a runtime technique to compute the MLP-based cost for each cache miss. It then describes a simple cache replacement mechanism that takes both MLP-based cost and recency into account. Second, it proposes a novel, low-hardware overhead mechanism called Sampling Based Adaptive Replacement (SBAR), to dynamically choose between an MLP-aware and a traditional replacement policy, depending on which one is more effective at reducing the number of memory related stalls. Evaluations with the SPEC CPU2000 benchmarks show that MLP-aware cache replacement can improve performance by as much as 23%.

316 citations


Journal ArticleDOI
TL;DR: This research proposes a dynamic allocation methodology for global and stack data and program code that accounts for changing program requirements at runtime, has no software-caching tags, requires no runtime checks, has extremely low overheads, and yields 100% predictable memory access times.
Abstract: In this research, we propose a highly predictable, low overhead, and, yet, dynamic, memory-allocation strategy for embedded systems with scratch pad memory. A scratch pad is a fast compiler-managed SRAM memory that replaces the hardware-managed cache. It is motivated by its better real-time guarantees versus cache and by its significantly lower overheads in energy consumption, area, and overall runtime, even with a simple allocation scheme. Primarily scratch pad allocation methods are of two types. First, software-caching schemes emulate the workings of a hardware cache in software. Instructions are inserted before each load/store to check the software-maintained cache tags. Such methods incur large overheads in runtime, code size, energy consumption, and SRAM space for tags and deliver poor real-time guarantees just like hardware caches. A second category of algorithms partitions variables at compile-time into the two banks. However, a drawback of such static allocation schemes is that they do not account for dynamic program behavior. It is easy to see why a data allocation that never changes at runtime cannot achieve the full locality benefits of a cache. We propose a dynamic allocation methodology for global and stack data and program code that; (i) accounts for changing program requirements at runtime, (ii) has no software-caching tags, (iii) requires no runtime checks, (iv) has extremely low overheads, and (v) yields 100p predictable memory access times. In this method, data that is about to be accessed frequently is copied into the scratch pad using compiler-inserted code at fixed and infrequent points in the program. Earlier data is evicted if necessary. When compared to a provably optimal static allocation, results show that our scheme reduces runtime by up to 39.8p and energy by up to 31.3p, on average, for our benchmarks, depending on the SRAM size used. The actual gain depends on the SRAM size, but our results show that close to the maximum benefit in runtime and energy is achieved for a substantial range of small SRAM sizes commonly found in embedded systems. Our comparison with a direct mapped cache shows that our method performs roughly as well as a cached architecture.

240 citations


Proceedings ArticleDOI
16 Sep 2006
TL;DR: This paper designs architectural support for OS to efficiently manage shared caches with a wide variety of policies and demonstrates that the scheme can support a wide range of policies including policies that provide passive performance differentiation, reactive fairness by miss-rate equalization and reactive performance differentiation.
Abstract: The role of the operating system (OS) in managing shared resources such as CPU time, memory, peripherals, and even energy is well motivated and understood [23]. Unfortunately, one key resource — lower-level shared cache in chip multi-processors — is commonly managed purely in hardware by rudimentary replacement policies such as least-recently-used (LRU). The rigid nature of the hardware cache management policy poses a serious problem since there is no single best cache management policy across all sharing scenarios. For example, the cache management policy for a scenario where applications from a single organization are running under "best effort" performance expectation is likely to be different from the policy for a scenario where applications from competing business entities (say, at a third party data center) are running under a minimum service level expectation. When it comes to managing shared caches, there is an inherent tension between flexibility and performance. On one hand, managing the shared cache in the OS offers immense policy flexibility since it may be implemented in software. Unfortunately, it is prohibitively expensive in terms of performance for the OS to be involved in managing temporally fine-grain events such as cache allocation. On the other hand, sophisticated hardware-only cache management techniques to achieve fair sharing or throughput maximization have been proposed. But they offer no policy flexibility. This paper addresses this problem by designing architectural support for OS to efficiently manage shared caches with a wide variety of policies. Our scheme consists of a hardware cache quota management mechanism, an OS interface and a set of OS level quota orchestration policies. The hardware mechanism guarantees that OS-specified quotas are enforced in shared caches, thus eliminating the need for (and the performance penalty of) temporally fine-grained OS intervention. The OS retains policy flexibility since it can tune the quotas during regularly scheduled OS interventions. We demonstrate that our scheme can support a wide range of policies including policies that provide (a) passive performance differentiation, (b) reactive fairness by miss-rate equalization and (c) reactive performance differentiation.

215 citations


Proceedings ArticleDOI
16 Sep 2006
TL;DR: It is found that simple policies like LRU replacement and static uniform partitioning are not sufficient to provide near-optimal performance under any reasonable definition, indicating that some thread-aware cache resource allocation mechanism is required.
Abstract: As chip multiprocessors (CMPs) become increasingly mainstream, architects have likewise become more interested in how best to share a cache hierarchy among multiple simultaneous threads of execution. The complexity of this problem is exacerbated as the number of simultaneous threads grows from two or four to the tens or hundreds. However, there is no consensus in the architectural community on what "best" means in this context. Some papers in the literature seek to equalize each thread's performance loss due to sharing, while others emphasize maximizing overall system performance. Furthermore, the specific effect of these goals varies depending on the metric used to define "performance". In this paper we label equal performance targets as Communist cache policies and overall performance targets as Utilitarian cache policies. We compare both of these models to the most common current model of a free-for-all cache (a Capitalist policy). We consider various performance metrics, including miss rates, bandwidth usage, and IPC, including both absolute and relative values of each metric. Using analytical models and behavioral cache simulation, we find that the optimal partitioning of a shared cache can vary greatly as different but reasonable definitions of optimality are applied. We also find that, although Communist and Utilitarian targets are generally compatible, each policy has workloads for which it provides poor overall performance or poor fairness, respectively. Finally, we find that simple policies like LRU replacement and static uniform partitioning are not sufficient to provide near-optimal performance under any reasonable definition, indicating that some thread-aware cache resource allocation mechanism is required.

198 citations


Proceedings ArticleDOI
22 Oct 2006
TL;DR: The client request behavior in web servers, allows the proposed architecture to show that the primary drawbacks of flash memory?endurance and long write latencies?can easily be overcome.
Abstract: We propose an architecture that uses NAND flash memory to reduce main memory power in web server platforms. Our architecture uses a two level file buffer cache composed of a relatively small DRAM, which includes a primary file buffer cache, and a flash memory secondary file buffer cache. Compared to a conventional DRAM-only architecture, our architecture consumes orders of magnitude less idle power while remaining cost effective. This is a result of using flash memory, which consumes orders of magnitude less idle power than DRAM and is twice as dense. The client request behavior in web servers, allows us to show that the primary drawbacks of flash memory?endurance and long write latencies?can easily be overcome. In fact the wear-level aware management techniques that we propose are not heavily used.

152 citations


Proceedings ArticleDOI
22 Oct 2006
TL;DR: Several optimizations on both the conventional cache-based memory systems of the Itanium 2, Opteron, and Power5, as well as the heterogeneous multicore design of the Cell processor are examined, including both an implicit cache oblivious approach and a cache-aware algorithm blocked to match the cache structure.
Abstract: Stencil-based kernels constitute the core of many scientific applications on block-structured grids. Unfortunately, these codes achieve a low fraction of peak performance, due primarily to the disparity between processor and main memory speeds. We examine several optimizations on both the conventional cache-based memory systems of the Itanium 2, Opteron, and Power5, as well as the heterogeneous multicore design of the Cell processor. The optimizations target cache reuse across stencil sweeps, including both an implicit cache oblivious approach and a cache-aware algorithm blocked to match the cache structure. Finally, we consider stencil computations on a machine with an explicitly-managed memory hierarchy, the Cell processor. Overall, results show that a cache-aware approach is significantly faster than a cache oblivious approach and that the explicitly managed memory on Cell is more efficient: Relative to the Power5, it has almost 2x more memory bandwidth and is 3.7x faster.

150 citations


Patent
24 Apr 2006
TL;DR: In this paper, a method and system of managing data access in a shared memory cache of a processor is described, which includes probing one or more memory addresses that map to a subset of the shared memory caches.
Abstract: A method and system of managing data access in a shared memory cache of a processor are disclosed. The method includes probing one or more memory addresses that map to a subset of the shared memory cache and sensing a plurality of events in the one or more memory addresses. Cache utilization information is then obtained by reading a hardware performance counter of the processor. The hardware performance counter is incremented based on the occurrence of the plurality of events. Based upon the cache utilization information, an occurrence of one of the plurality of events is reduced.

140 citations


Proceedings ArticleDOI
27 Feb 2006
TL;DR: A detailed data-sharing analysis and chip-multiprocessor (CMP) cache study of several multithreaded data-mining bioinformatics workloads and shows that a shared 32 MB last-level cache is able to capture a tremendous amount of data- sharing and outperform a 32 MB private cache configuration by several orders of magnitude.
Abstract: With the continuing growth in the amount of genetic data, members of the bioinformatics community are developing a variety of data-mining applications to understand the data and discover meaningful information. These applications are important in defining the design and performance decisions of future high performance microprocessors. This paper presents a detailed data-sharing analysis and chip-multiprocessor (CMP) cache study of several multithreaded data-mining bioinformatics workloads. For a CMP with a three-level cache hierarchy, we model the last-level of the cache hierarchy as either multiple private caches or a single cache shared amongst different cores of the CMP. Our experiments show that the bioinformatics workloads exhibit significant data-sharing - 50-95% of the data cache is shared by the different threads of the workload. Furthermore, regardless of the amount of data cache shared, for some workloads, as many as 98% of the accesses to the last-level cache are to shared data cache lines. Additionally, the amount of data-sharing exhibited by the workloads is a function of the total cache size available - the larger the data cache the better the sharing behavior. Thus, partitioning the available last-level cache silicon area into multiple private caches can cause applications to lose their inherent data-sharing behavior. For the workloads in this study, a shared 32 MB last-level cache is able to capture a tremendous amount of data-sharing and outperform a 32 MB private cache configuration by several orders of magnitude. Specifically, with shared last-level caches, the bandwidth demands beyond the last-level cache can be reduced by factors of 3-625 when compared to private last-level caches.

Patent
06 Apr 2006
TL;DR: In this paper, the authors present a system that facilitates selectively unmarking load-marked cache lines during transactional program execution, where load-marked cache lines are monitored during the transactional execution to detect interfering accesses from other threads.
Abstract: One embodiment of the present invention provides a system that facilitates selectively unmarking load-marked cache lines during transactional program execution, wherein load-marked cache lines are monitored during transactional execution to detect interfering accesses from other threads. During operation, the system encounters a release instruction during transactional execution of a block of instructions. In response to the release instruction, the system modifies the state of cache lines, which are specially load-marked to indicate they can be released from monitoring, to account for the release instruction being encountered. In doing so, the system can potentially cause the specially load-marked cache lines to become unmarked. In a variation on this embodiment, upon encountering a commit-and-start-new-transaction instruction, the system modifies load-marked cache lines to account for the commit-and-start-new-transaction instruction being encountered. In doing so, the system causes normally load-marked cache lines to become unmarked, while other specially load-marked cache lines may remain load-marked past the commit-and-start-new-transaction instruction.

Journal ArticleDOI
TL;DR: It is claimed that there is a sufficient number of good policies for proxies with different characteristics, such as proxies with a small cache, limited bandwidth, and limited processing power, as well as suggest policies for different types of proxies.
Abstract: Research involving Web cache replacement policy has been active for at least a decade. In this-article we would like to claim that there is a sufficient number of good policies, and further proposals would only produce minute improvements. We argue that the focus should be fitness for purpose rather than proposing any new policies. Up to now, almost all policies were purported to perform better than others, creating confusion as to which policy should be used. Actually, a policy only performs well in certain environments. Therefore, the goal of this article is to identify the appropriate policies for proxies with different characteristics, such as proxies with a small cache, limited bandwidth, and limited processing power, as well as suggest policies for different types of proxies, such as ISP-level and root-level proxies

Journal ArticleDOI
01 Mar 2006
TL;DR: Simulation results indicate that the proposed aggregate caching mechanism and a broadcast-based Simple Search algorithm can significantly improve an Imanet performance in terms of throughput and average number of hops to access data items.
Abstract: Internet-based mobile ad hoc network (Imanet) is an emerging technique that combines a wired network (e.g. Internet) and a mobile ad hoc network (Manet) for developing a ubiquitous communication infrastructure. To fulfill users' demand to access various kinds of information, however, an Imanet has several limitations such as limited accessibility to the wired Internet, insufficient wireless bandwidth, and longer message latency. In this paper, we address the issues involved in information search and access in Imanets. An aggregate caching mechanism and a broadcast-based Simple Search (SS) algorithm are proposed for improving the information accessibility and reducing average communication latency in Imanets. As a part of the aggregate cache, a cache admission control policy and a cache replacement policy, called Time and Distance Sensitive (TDS) replacement, are developed to reduce the cache miss ratio and improve the information accessibility. We evaluate the impact of caching, cache management, and the number of access points that are connected to the Internet, through extensive simulation. The simulation results indicate that the proposed aggregate caching mechanism can significantly improve an Imanet performance in terms of throughput and average number of hops to access data items.

Patent
13 Dec 2006
TL;DR: In this article, the temporary storage address of data following access commands from the upper level device shall be the volatile cache memory, which is a type of memory that can continue to memorize data irrespective of whether or not power is supplied.
Abstract: A storage system comprises a volatile cache memory, and a non-volatile memory, which is a type of memory that can continue to memorize data irrespective of whether or not power is supplied. The temporary storage address of data following access commands from the upper level device shall be the volatile cache memory. If power is not supplied from primary power source to the volatile cache memory, power supplied from a battery is used to copy data memorized in volatile cache memory to non-volatile memory.

Proceedings ArticleDOI
09 Dec 2006
TL;DR: A novel and general scheme by which any two cache management algorithms can be combined and adaptively switch between them, closely tracking the locality characteristics of a given program, inspired by recent work in virtual memory management at the operating system level.
Abstract: We present and evaluate the idea of adaptive processor cache management. Specifically, we describe a novel and general scheme by which we can combine any two cache management algorithms (e.g., LRU, LFU, FIFO, Random) and adaptively switch between them, closely tracking the locality characteristics of a given program. The scheme is inspired by recent work in virtual memory management at the operating system level, which has shown that it is possible to adapt over two replacement policies to provide an aggregate policy that always performs within a constant factor of the better component policy. A hardware implementation of adaptivity requires very simple logic but duplicate tag structures. To reduce the overhead, we use partial tags, which achieve good performance with a small hardware cost. In particular, adapting between LRU and LFU replacement policies on an 8-way 512KB L2 cache yields a 12.7% improvement in average CPI on applications that exhibit a non-negligible L2 miss ratio. Our approach increases total cache storage by 4.0%, but it still provides slightly better performance than a conventional 10-way setassociative 640KB cache which requires 25% more storage.

Proceedings ArticleDOI
05 Jul 2006
TL;DR: Experimental results provided in the paper show that with an appropriate selection of regions and cache contents, the worst-case performance of applications with locked instruction caches is competitive with the best- case performance of unlocked caches.
Abstract: Cache memories have been extensively used to bridge the gap between high speed processors and relatively slower main memories. However, they are sources of predictability problems because of their dynamic and adaptive behavior, and thus need special attention to be used in hard real-time systems. A lot of progress has been achieved in the last ten years to statically predict worst-case execution times (WCETs) of tasks on architectures with caches. However, cache-aware WCET analysis techniques are not always applicable due to the lack of documentation of hardware manuals concerning the cache replacement policies. Moreover, they tend to be pessimistic with some cache replacement policies (e.g. random replacement policies). Lastly, caches are sources of timing anomalies in dynamically scheduled processors (a cache miss may in some cases result in a shorter execution time than a hit). To reconciliate performance and predictability of caches, we propose in this paper algorithm for software control of instruction caches. The proposed algorithms statically divide the code of tasks into regions, for which the cache contents is statically selected. At run-time, at every transition between regions, the cache contents computed off-line is loaded into the cache and the cache replacement policy is disabled (the cache is locked). Experimental results provided in the paper show that with an appropriate selection of regions and cache contents, the worst-case performance of applications with locked instruction caches is competitive with the worst-case performance of unlocked caches.

Patent
02 Jun 2006
TL;DR: In this article, a high data availability write-caching storage controller has a volatile memory with a write cache for caching write cache data, a non-volatile memory, a capacitor pack for supplying power for backing up the write cache to the nonvolatile memories in response to a loss of main power, and a CPU that determines whether reducing an operating voltage of the capacitor pack to a new value would cause the battery pack to be storing less energy than required for back up the current size write cache.
Abstract: A high data availability write-caching storage controller has a volatile memory with a write cache for caching write cache data, a non-volatile memory, a capacitor pack for supplying power for backing up the write cache to the non-volatile memory in response to a loss of main power, and a CPU that determines whether reducing an operating voltage of the capacitor pack to a new value would cause the capacitor pack to be storing less energy than required for backing up the current size write cache to the non-volatile memory. If so, the CPU reduces the size of the write cache prior to reducing the operating voltage. The CPU estimates the capacity of the capacitor pack to store the required energy based on a history of operational temperature and voltage readings of the capacitor pack, such as on an accumulated normalized running time and warranted lifetime of the capacitor pack.

Patent
29 Nov 2006
TL;DR: In this paper, the state of the cache during a cache preload operation is monitored to avoid the need to restart the cache pre-load operation from the beginning, and any data that has been preloaded into a cache prior to a failure may be retained after a failover occurs.
Abstract: An apparatus, program product and method monitor the state of a cache during a cache preload operation in a clustered computer system such that the monitored state can be used after a failover to potentially avoid the need to restart the cache preload operation from the beginning. In particular, by monitoring the state of the cache during a cache preload operation, any data that has been preloaded into a cache prior to a failure may be retained after a failover occurs, thus enabling the cache preload operation to continue from the point at which it was interrupted as a result of the failure.

Proceedings ArticleDOI
26 Mar 2006
TL;DR: This paper discusses the design choices when building thread-shared code caches and enumerate the difficulties of thread-local storage, synchronization, trace building, in-cache lookup tables, and cache eviction and presents efficient solutions to these problems that both scale well and do not require thread suspension.
Abstract: Software code caches are increasingly being used to amortize the runtime overhead of dynamic optimizers, simulators, emulators, dynamic translators, dynamic compilers, and other tools. Despite the now-wide spread use of code caches, techniques for efficiently sharing them across multiple threads have not been fully explored. Some systems simply do not support threads, while others resort to thread-private code caches. Although thread-private caches are much simpler to manage, synchronize, and provide scratch space for, they simply do not scale when applied to many-threaded programs. Thread-shared code caches are needed to target server applications, which employ hundreds of worker threads all performing similar tasks. Yet, those systems that do share their code caches often have brute-force, inefficient solutions to the challenges of concurrent code cache access: a single global lock on runtime system code and suspension of all threads for any cache management action. This limits the possibilities for cache design and has performance problems with applications that require frequent cache invalidations to maintain cache consistency. In this paper, we discuss the design choices when building thread-shared code caches and enumerate the difficulties of thread-local storage, synchronization, trace building, in-cache lookup tables, and cache eviction. We present efficient solutions to these problems that both scale well and do not require thread suspension. We evaluate our results in DynamoRIO, an industrial-strength dynamic binary translation system, on real-world server applications. On these applications our thread-shared caches use an order of magnitude less memory and improve throughput by up to four times compared to thread-private caches.

Patent
27 Dec 2006
TL;DR: In this article, a permanent cache list of files not to be removed from a cache is determined in response to a user selection of an application to be added to the cache by adding a file to cache list if the file is a static dependency of the application or if a file has a high probability of being used in the future by the application.
Abstract: In some embodiments a permanent cache list of files not to be removed from a cache is determined in response to a user selection of an application to be added to the cache. The determination is made by adding a file to the cache list if the file is a static dependency of the application, or if a file has a high probability of being used in the future by the application. Other embodiments are described and claimed.

Patent
31 Aug 2006
TL;DR: In this article, a method and apparatus for providing shared switch and cache memory is described, consisting of a message switch module, a cache controller module, and shared switch/cache memory.
Abstract: A method and apparatus are described to provide shared switch and cache memory. The apparatus may comprise a message switch module, a cache controller module, and shared switch and cache memory to provide shared memory to the message switch module and to the cache controller module. The cache controller module may comprise pointer memory to store a plurality of pointers, each pointer pointing to a location in the shared switch and cache memory (e.g., point to a message header partition in the shared switch and cache memory). If there is a corresponding pointer, a memory read response may be sent to the requesting agent. If there is no corresponding pointer, a write data request may be sent to a corresponding destination agent and, in response to receiving the requested data, a pointer to the stored data in the pointer memory may be provided.

Patent
Charles E. Narad1
07 Dec 2006
TL;DR: In this paper, the authors discuss how data shared between two memory accessing agents may be stored in a shared partition of the shared cache, and how data accessed by one of the accessing agents can be stored either in one or more private partitions of a shared cache.
Abstract: Some of the embodiments discussed herein may utilize partitions within a shared cache in various computing environments. In an embodiment, data shared between two memory accessing agents may be stored in a shared partition of the shared cache. Additionally, data accessed by one of the memory accessing agents may be stored in one or more private partitions of the shared cache.

Patent
11 Dec 2006
TL;DR: In this article, a cache memory preprocessor consists of a command inputter, which receives a multiple-way cache memory processing command from the processor, and a command implementer, which performs background processing upon multiple ways of the cache memory.
Abstract: A cache memory preprocessor prepares a cache memory for use by a processor. The processor accesses a main memory via a cache memory, which serves a data cache for the main memory. The cache memory preprocessor consists of a command inputter, which receives a multiple-way cache memory processing command from the processor, and a command implementer. The command implementer performs background processing upon multiple ways of the cache memory in order to implement the cache memory processing command received by the command inputter.

Patent
27 Mar 2006
TL;DR: In this paper, the authors propose a method for optimisation of the management of a server cache for dynamic pages which may be consulted by client terminals with differing characteristics which requires the provision of discrete versions of a dynamic page in the cache.
Abstract: The invention relates to a method for optimisation of the management of a server cache for dynamic pages which may be consulted by client terminals with differing characteristics which requires the provision of discrete versions (10) of a dynamic page in the cache. According to the method, when a terminal requests (11) a dynamic page, a verification step (12) for the presence of at least one version of the dynamic page in the cache is carried out, such that if the verification is positive the following complementary steps are carried out: procurement (13) of a set of characteristics specific to the type of client terminal, determination (14) of a subset of necessary characteristics from amongst the specific characteristics for the reproduction of the dynamic page on a client terminal, search (15), among the version(s) of the dynamic page in the cache for a suitable version using the subset of necessary characteristics and allocation (16) of the suitable version to the client terminal.

Journal ArticleDOI
26 Jun 2006
TL;DR: This paper is the first to propose an analytical model which predicts the performance of cache replacement policies, based on probability theory, and relies solely on the statistical properties of the application, without relying on heuristics or rules of thumbs.
Abstract: Due to the increasing gap between CPU and memory speed, cache performance plays an increasingly critical role in determining the overall performance of microprocessor systems. One of the important factors that a affect cache performance is the cache replacement policy. Despite the importance, current analytical cache performance models ignore the impact of cache replacement policies on cache performance. To the best of our knowledge, this paper is the first to propose an analytical model which predicts the performance of cache replacement policies. The input to our model is a simple circular sequence profiling of each application, which requires very little storage overhead. The output of the model is the predicted miss rates of an application under different replacement policies. The model is based on probability theory and utilizes Markov processes to compute each cache access' miss probability. The model realistic assumptions and relies solely on the statistical properties of the application, without relying on heuristics or rules of thumbs. The model's run time is less than 0.1 seconds, much lower than that of trace simulations. We validate the model by comparing the predicted miss rates of seventeen Spec2000 and NAS benchmark applications against miss rates obtained by detailed execution-driven simulations, across a range of different cache sizes, associativities, and four replacement policies, and show that the model is very accurate. The model's average prediction error is 1.41%,and there are only 14 out of 952 validation points in which the prediction errors are larger than 10%.

Patent
14 Apr 2006
TL;DR: In this article, a plurality of processor cores, each comprising a computation unit and a memory, are configured as a cache for memory external to the processor cores and at least some of the processors are configured to transmit a message over the interconnection network to access a cache of another processor core.
Abstract: An apparatus comprises a plurality of processor cores, each comprising a computation unit and a memory. The apparatus further comprises an interconnection network to transmit data among the processor cores. At least some of the memories are configured as a cache for memory external to the processor cores, and at least some of the processor cores are configured to transmit a message over the interconnection network to access a cache of another processor core.

Journal ArticleDOI
TL;DR: This paper presents a cache oblivious algorithm for matrix multiplication that uses a block recursive structure and an element ordering that is based on Peano curves that leads to an asymptotically optimal spatial and temporal locality of the data access.

Proceedings ArticleDOI
24 Jan 2006
TL;DR: A method is given to rapidly find the L1 cache miss rate of an application and an energy model and an execution time model are developed to find the best cache configuration for the given embedded application.
Abstract: Modern embedded system execute a single application or a class of applications repeatedly. A new emerging methodology of designing embedded system utilizes configurable processors where the cache size, associativity, and line size can be chosen by the designer. In this paper, a method is given to rapidly find the L1 cache miss rate of an application. An energy model and an execution time model are developed to find the best cache configuration for the given embedded application. Using benchmarks from Mediabench, we find that our method is on average 45 times faster to explore the design space, compared to Dinero IV while still having 100% accuracy.

Patent
Marcus L. Kornegay1, Ngan N. Pham1
22 Nov 2006
TL;DR: In this article, a method for cache management of lines of binary code that are stored in the cache is described, where a cache manager log can identify a line, or lines, to be evicted based on data by accessing the cache directory.
Abstract: A method for cache management is disclosed. The method can assign or determined identifiers for lines of binary code that are, or will be stored in cache. The method can create a cache directory that utilizes the identifier to keep an eviction count and/or a reload count for cached lines. Thus, each time a line is entered into, or evicted from cache, the cache eviction log can be amended accordingly. When a processor receives or creates an instruction that requests that a line be evicted from cache, a cache manager log can identify a line, or lines of binary code to be evicted based on data by accessing the cache directory and then the line(s) can be evicted.