scispace - formally typeset
Search or ask a question

Showing papers on "Cache coloring published in 2010"


Journal ArticleDOI
13 Mar 2010
TL;DR: This study is the first to provide a comprehensive analysis of contention-mitigating techniques that use only scheduling, and finds a classification scheme that addresses not only contention for cache space, but contention for other shared resources, such as the memory controller, memory bus and prefetching hardware.
Abstract: Contention for shared resources on multicore processors remains an unsolved problem in existing systems despite significant research efforts dedicated to this problem in the past. Previous solutions focused primarily on hardware techniques and software page coloring to mitigate this problem. Our goal is to investigate how and to what extent contention for shared resource can be mitigated via thread scheduling. Scheduling is an attractive tool, because it does not require extra hardware and is relatively easy to integrate into the system. Our study is the first to provide a comprehensive analysis of contention-mitigating techniques that use only scheduling. The most difficult part of the problem is to find a classification scheme for threads, which would determine how they affect each other when competing for shared resources. We provide a comprehensive analysis of such classification schemes using a newly proposed methodology that enables to evaluate these schemes separately from the scheduling algorithm itself and to compare them to the optimal. As a result of this analysis we discovered a classification scheme that addresses not only contention for cache space, but contention for other shared resources, such as the memory controller, memory bus and prefetching hardware. To show the applicability of our analysis we design a new scheduling algorithm, which we prototype at user level, and demonstrate that it performs within 2\% of the optimal. We also conclude that the highest impact of contention-aware scheduling techniques is not in improving performance of a workload as a whole but in improving quality of service or performance isolation for individual applications.

532 citations


Journal ArticleDOI
TL;DR: Power Systems™ continue strong 7th Generation Power chip: Balanced Multi-Core design EDRAM technology SMT4 greater then 4X performance in same power envelope as previous generation.
Abstract: The Power7 is IBM's first eight-core processor, with each core capable of four-way simultaneous-multithreading operation. Its key architectural features include an advanced memory hierarchy with three levels of on-chip cache; embedded-DRAM devices used in the highest level of the cache; and a new memory interface. This balanced multicore design scales from 1 to 32 sockets in commercial and scientific environments.

259 citations


Proceedings ArticleDOI
04 Dec 2010
TL;DR: The zcache is presented, a cache design that allows much higher associativity than the number of physical ways, and it is shown that zcaches provide higher performance and better energy efficiency than conventional caches without incurring the overheads of designs with a large number of ways.
Abstract: The ever-increasing importance of main memory latency and bandwidth is pushing CMPs towards caches with higher capacity and associativity. Associativity is typically improved by increasing the number of ways. This reduces conflict misses, but increases hit latency and energy, placing a stringent trade-off on cache design. We present the zcache, a cache design that allows much higher associativity than the number of physical ways (e.g. a 64-associative cache with 4 ways). The zcache draws on previous research on skew-associative caches and cuckoo hashing. Hits, the common case, require a single lookup, incurring the latency and energy costs of a cache with a very low number of ways. On a miss, additional tag lookups happen off the critical path, yielding an arbitrarily large number of replacement candidates for the incoming block. Unlike conventional designs, the zcache provides associativity by increasing the number of replacement candidates, but not the number of cache ways. To understand the implications of this approach, we develop a general analysis framework that allows to compare associativity across different cache designs (e.g. a set-associative cache and a zcache) by representing associativity as a probability distribution. We use this framework to show that for zcaches, associativity depends only on the number of replacement candidates, and is independent of other factors (such as the number of cache ways or the workload). We also show that, for the same number of replacement candidates, the associativity of a zcache is superior than that of a set-associative cache for most workloads. Finally, we perform detailed simulations of multithreaded and multiprogrammed workloads on a large-scale CMP with zcache as the last-level cache. We show that zcaches provide higher performance and better energy efficiency than conventional caches without incurring the overheads of designs with a large number of ways.

203 citations


Patent
08 Sep 2010
TL;DR: In this paper, an apparatus, system, and method are disclosed for an caching data using a solid-state storage device, where metadata indicates what data in the cache is valid, as well as information about what data from the nonvolatile cache has been stored in the backing store.
Abstract: An apparatus, system, and method are disclosed for an caching data using a solid-state storage device. The solid-state storage device maintains metadata pertaining to cache operations performed on the solid-state storage device, as well as storage operations of the solid-state storage device. The metadata indicates what data in the cache is valid, as well as information about what data in the nonvolatile cache has been stored in backing store. A backup engine works through units in the nonvolatile cache device and backs up the valid data to backing store. During grooming operations, the groomer determines whether the data is valid and whether the data is discardable. Data that is both valid and discardable may be removed during the grooming operation. The groomer may also determine whether the data is cold in determining whether to remove the data from the cache device. The cache device may present to clients a logical space that is the same size as the backing store. The cache device may be transparent to the clients.

196 citations


Patent
07 Jul 2010
TL;DR: In this paper, a security module on a computing device applies security rules to examine content in a network cache and identify suspicious cache content, such as a rule determining whether the cache content is associated with modified-time set into the future, and a rule deciding whether cache content was created in a low-security environment.
Abstract: A security module on a computing device applies security rules to examine content in a network cache and identify suspicious cache content. Cache content is identified as suspicious according to security rules, such as a rule determining whether the cache content is associated with modified-time set into the future, and a rule determining whether the cache content was created in a low-security environment. The security module may establish an out-of-band connection with the websites from which the cache content originated through a high security access network to receive responses from the websites, and use the responses to determine whether the cache content is suspicious cache content. Suspicious cache content is removed from the network cache to prevent the suspicious cache content from carrying out malicious activities.

168 citations


Journal ArticleDOI
TL;DR: This work presents a lossless compression algorithm that has been designed for fast on-line data compression, and cache compression in particular, and reduces the proposed algorithm to a register transfer level hardware design, permitting performance, power consumption, and area estimation.
Abstract: Microprocessor designers have been torn between tight constraints on the amount of on-chip cache memory and the high latency of off-chip memory, such as dynamic random access memory. Accessing off-chip memory generally takes an order of magnitude more time than accessing on-chip cache, and two orders of magnitude more time than executing an instruction. Computer systems and microarchitecture researchers have proposed using hardware data compression units within the memory hierarchies of microprocessors in order to improve performance, energy efficiency, and functionality. However, most past work, and all work on cache compression, has made unsubstantiated assumptions about the performance, power consumption, and area overheads of the proposed compression algorithms and hardware. It is not possible to determine whether compression at levels of the memory hierarchy closest to the processor is beneficial without understanding its costs. Furthermore, as we show in this paper, raw compression ratio is not always the most important metric. In this work, we present a lossless compression algorithm that has been designed for fast on-line data compression, and cache compression in particular. The algorithm has a number of novel features tailored for this application, including combining pairs of compressed lines into one cache line and allowing parallel compression of multiple words while using a single dictionary and without degradation in compression ratio. We reduced the proposed algorithm to a register transfer level hardware design, permitting performance, power consumption, and area estimation. Experiments comparing our work to previous work are described.

161 citations


Journal ArticleDOI
TL;DR: This study is the first to provide a comprehensive analysis of contention-mitigating techniques that use only scheduling, and finds a classification scheme that addresses not only contention for cache space, but contention for other shared resources, such as the memory controller, memory bus and prefetching hardware.
Abstract: Contention for shared resources on multicore processors remains an unsolved problem in existing systems despite significant research efforts dedicated to this problem in the past. Previous solutions focused primarily on hardware techniques and software page coloring to mitigate this problem. Our goal is to investigate how and to what extent contention for shared resource can be mitigated via thread scheduling. Scheduling is an attractive tool, because it does not require extra hardware and is relatively easy to integrate into the system. Our study is the first to provide a comprehensive analysis of contention-mitigating techniques that use only scheduling. The most difficult part of the problem is to find a classification scheme for threads, which would determine how they affect each other when competing for shared resources. We provide a comprehensive analysis of such classification schemes using a newly proposed methodology that enables to evaluate these schemes separately from the scheduling algorithm itself and to compare them to the optimal. As a result of this analysis we discovered a classification scheme that addresses not only contention for cache space, but contention for other shared resources, such as the memory controller, memory bus and prefetching hardware. To show the applicability of our analysis we design a new scheduling algorithm, which we prototype at user level, and demonstrate that it performs within 2p of the optimal. We also conclude that the highest impact of contention-aware scheduling techniques is not in improving performance of a workload as a whole but in improving quality of service or performance isolation for individual applications and in optimizing system energy consumption.

158 citations


Proceedings ArticleDOI
19 Jun 2010
TL;DR: This paper demonstrates that performance limiting effects of highly-threaded architectures can be overcome, and shows that through awareness of the physical main memory layout and by focusing on writes, both read and write average latency can be shortened, memory power reduced, and overall system performance improved.
Abstract: In computer architecture, caches have primarily been viewed as a means to hide memory latency from the CPU. Cache policies have focused on anticipating the CPU's data needs, and are mostly oblivious to the main memory. In this paper, we demonstrate that the era of many-core architectures has created new main memory bottlenecks, and mandates a new approach: coordination of cache policy with main memory characteristics. Using the cache for memory optimization purposes, we propose a Virtual Write Queue which dramatically expands the memory controller's visibility of processor behavior, at low implementation overhead. Through memory-centric modification of existing policies, such as scheduled writebacks, this paper demonstrates that performance limiting effects of highly-threaded architectures can be overcome. We show that through awareness of the physical main memory layout and by focusing on writes, both read and write average latency can be shortened, memory power reduced, and overall system performance improved. Through full-system cycle-accurate simulations of SPEC cpu2006, we demonstrate that the proposed Virtual Write Queue achieves an average 10.9% system-level throughput improvement on memory-intensive workloads, along with an overall reduction of 8.7% in memory power across the whole suite.

143 citations


Proceedings ArticleDOI
08 Mar 2010
TL;DR: This paper modeled the timing, energy, endurance, and area of PCM caches and integrated them into a PCM cache simulator to evaluate the techniques and shows that the design can achieve 8% of energy saving and 3.8 years of lifetime compared with a baseline PCM Cache.
Abstract: Phase change memory (PCM) is one of the most promising technology among emerging non-volatile random access memory technologies. Implementing a cache memory using PCM provides many benefits such as high density, non-volatility, low leakage power, and high immunity to soft error. However, its disadvantages such as high write latency, high write energy, and limited write endurance prevent it from being used as a drop-in replacement of an SRAM cache. In this paper, we study a set of techniques to design an energy- and endurance-aware PCM cache. We also modeled the timing, energy, endurance, and area of PCM caches and integrated them into a PCM cache simulator to evaluate the techniques. Experiments show that our PCM cache design can achieve 8% of energy saving and 3.8 years of lifetime compared with a baseline PCM cache having less than a hour of lifetime.

140 citations


Patent
25 Mar 2010
TL;DR: In this article, a method for optimising the distribution of data objects between caches in a cache domain of a resource limited network is described, where object information including the request frequency of each requested data object and the locations of the caches at which the requests were received is collated and stored.
Abstract: There is described a method for optimising the distribution of data objects between caches in a cache domain of a resource limited network. User requests for data objects are received at caches in the cache domain. A notification is sent from each cache at which a request is received to a cache manager. The notification reports the user request and identifies the requested data object. At the cache manager, object information including the request frequency of each requested data object and the locations of the caches at which the requests were received is collated and stored. At the cache manager, objects for distribution within the cache domain are identified on the basis of the object information. Instructions are sent from the cache manager to the caches to distribute data objects stored in those caches between themselves. The objects are classified into classes according to popularity, the classes including a high popularity class comprising objects which should be distributed to all caches in the cache domain, a medium popularity class comprising objects which should be distributed to a subset of the caches in the cache domain, and a low popularity class comprising objects which should not be distributed.

137 citations


Patent
03 Mar 2010
TL;DR: In this article, the authors describe a system for implementing a plurality of virtual machines on a physical machine, each of which includes a private cache, and a portion of each of the private caches is used to create a shared cache maintained by a hypervisor.
Abstract: This disclosure describes, generally, methods and systems for implementing transcendent page caching. The method includes establishing a plurality of virtual machines on a physical machine. Each of the plurality of virtual machines includes a private cache, and a portion of each of the private caches is used to create a shared cache maintained by a hypervisor. The method further includes delaying the removal of the at least one of stored memory pages, storing the at least one of stored memory pages in the shared cache, and requesting, by one of the plurality of virtual machines, the at least one of the stored memory pages from the shared cache. Further, the method includes determining that the at least one of the stored memory pages is stored in the shared cache, and transferring the at least one of the stored shared memory pages to the one of the plurality of virtual machines.

Proceedings ArticleDOI
13 Jun 2010
TL;DR: This paper describes several cache-oblivious algorithms with optimal work, polylogarithmic depth, and sequential cache complexities that match the best sequential algorithms, including the first such algorithms for sorting and for sparse-matrix vector multiply on matrices with good vertex separators.
Abstract: In this paper we explore a simple and general approach for developing parallel algorithms that lead to good cache complexity on parallel machines with private or shared caches. The approach is to design nested-parallel algorithms that have low depth (span, critical path length) and for which the natural sequential evaluation order has low cache complexity in the cache-oblivious model. We describe several cache-oblivious algorithms with optimal work, polylogarithmic depth, and sequential cache complexities that match the best sequential algorithms, including the first such algorithms for sorting and for sparse-matrix vector multiply on matrices with good vertex separators.Using known mappings, our results lead to low cache complexities on shared-memory multiprocessors with a single level of private caches or a single shared cache. We generalize these mappings to multi-level cache hierarchies of private or shared caches, implying that our algorithms also have low cache complexities on such hierarchies. The key factor in obtaining these low parallel cache complexities is the low depth of the algorithms we propose.

Proceedings ArticleDOI
09 Jan 2010
TL;DR: A systematic measurement of the influence of CMP cache sharing on two kinds of commodity CMP machines, using a recently released CMP benchmark suite, PARSEC, with a number of potentially important factors on program, OS, and architecture levels considered shows some surprising results.
Abstract: Most modern Chip Multiprocessors (CMP) feature shared cache on chip. For multithreaded applications, the sharing reduces communication latency among co-running threads, but also results in cache contention.A number of studies have examined the influence of cache sharing on multithreaded applications, but most of them have concentrated on the design or management of shared cache, rather than a systematic measurement of the influence. Consequently, prior measurements have been constrained by the reliance on simulators, the use of out-of-date benchmarks, and the limited coverage of deciding factors. The influence of CMP cache sharing on contemporary multithreaded applications remains preliminarily understood.In this work, we conduct a systematic measurement of the influence on two kinds of commodity CMP machines, using a recently released CMP benchmark suite, PARSEC, with a number of potentially important factors on program, OS, and architecture levels considered. The measurement shows some surprising results. Contrary to commonly perceived importance of cache sharing, neither positive nor negative effects from the cache sharing are significant for most of the program executions, regardless of the types of parallelism, input datasets, architectures, numbers of threads, and assignments of threads to cores. After a detailed analysis, we find that the main reason is the mismatch of current development and compilation of multithreaded applications and CMP architectures. By transforming the programs in a cache-sharing-aware manner, we observe up to 36% performance increase when the threads are placed on cores appropriately.

Patent
12 Aug 2010
TL;DR: In this paper, a distributed data cache is used to expedite the retrieval of data for application execution by a server in a content delivery network, where the distributed cache is distributed across computer-readable storage media included in a plurality of servers.
Abstract: A distributed data cache included in a content delivery network expedites retrieval of data for application execution by a server in a content delivery network. The distributed data cache is distributed across computer-readable storage media included in a plurality of servers in the content delivery network. When an application generates a query for data, a server in the content delivery network determines whether the distributed data cache includes data associated with the query. If data associated with the query is stored in the distributed data cache, the data is retrieved from the distributed data cache. If the distributed data cache does not include data associated with the query, the data is retrieved from a database and the query and associated data are stored in the distributed data cache to expedite subsequent retrieval of the data when the application issues the same query.

Proceedings ArticleDOI
28 Mar 2010
TL;DR: CAMP estimates the performance degradation due to cache contention of processes running on CMPs and provides an automated way to obtain process-dependent characteristics, such as reuse distance histograms, without offline simulation, operating system modification, or additional hardware.
Abstract: The ongoing move to chip multiprocessors (CMPs) permits greater sharing of last-level cache by processor cores but this sharing aggravates the cache contention problem, potentially undermining performance improvements. Accurately modeling the impact of inter-process cache contention on performance and power consumption is required for optimized process assignment. However, techniques based on exhaustive consideration of process-to-processor mappings and cycle-accurate simulation are inefficient or intractable for CMPs, which often permit a large number of potential assignments. This paper proposes CAMP, a fast and accurate shared cache aware performance model for multi-core processors. CAMP estimates the performance degradation due to cache contention of processes running on CMPs. It uses reuse distance histograms, cache access frequencies, and the relationship between the throughput and cache miss rate of each process to predict its effective cache size when running concurrently and sharing cache with other processes, allowing instruction throughput estimation.We also provide an automated way to obtain process-dependent characteristics, such as reuse distance histograms, without offline simulation, operating system (OS) modification, or additional hardware. We tested the accuracy of CAMP using 55 different combinations of 10 SPEC CPU2000 benchmarks on a dual-core CMP machine. The average throughput prediction error was 1.57%.

Patent
15 Dec 2010
TL;DR: In this paper, a multi-tiered cache manager and methods for managing multi-tier cache are described, which causes cached data to be initially stored in RAM elements and selects portions of the cached data stored in the RAM elements to be moved to the flash elements.
Abstract: A multi-tiered cache manager and methods for managing multi-tiered cache are described. Multi-tiered cache manager causes cached data to be initially stored in the RAM elements and selects portions of the cached data stored in the RAM elements to be moved to the flash elements. Each flash element is organized as a plurality of write blocks having a block size and wherein a predefined maximum number of writes is permitted to each write block. The portions of the cached data may be selected based on a maximum write rate calculated from the maximum number of writes allowed for the flash device and a specified lifetime of the cache system.

Patent
15 Feb 2010
TL;DR: In this article, a flash memory device includes at least one mapping block, one modification block and one cache block, each of which includes data fields to store location information corresponding to the data has been written in the pages of the modification block in order.
Abstract: A computing system is provided. A flash memory device includes at least one mapping block, at least one modification block and at least one cache block. A processor is configured to perform: receiving a write command with a write logical address and predetermined data, loading content of a cache page from the cache block corresponding to the modification block according to the write logical address to a random access memory device in response to that a page of the mapping block corresponding to the write logical address has been used, the processor, reading orderly the content of the cache page stored in the random access memory device to obtain location information of an empty page of the modification block, and writing the predetermined data to the empty page according to the location information. Each cache page includes data fields to store location information corresponding to the data has been written in the pages of the modification block in order.

Journal ArticleDOI
TL;DR: The design and implementation of cooperative cache in wireless P2P networks are presented, and a novel asymmetric cooperative cache approach is proposed, where the data requests are transmitted to the cache layer on every node, but the data replies are only transmitted toThe cache layer at the intermediate nodes that need to cache the data.
Abstract: Some recent studies have shown that cooperative cache can improve the system performance in wireless P2P networks such as ad hoc networks and mesh networks. However, all these studies are at a very high level, leaving many design and implementation issues unanswered. In this paper, we present our design and implementation of cooperative cache in wireless P2P networks, and propose solutions to find the best place to cache the data. We propose a novel asymmetric cooperative cache approach, where the data requests are transmitted to the cache layer on every node, but the data replies are only transmitted to the cache layer at the intermediate nodes that need to cache the data. This solution not only reduces the overhead of copying data between the user space and the kernel space, it also allows data pipelines to reduce the end-to-end delay. We also study the effects of different MAC layers, such as 802.11-based ad hoc networks and multi-interface-multichannel-based mesh networks, on the performance of cooperative cache. Our results show that the asymmetric approach outperforms the symmetric approach in traditional 802.11-based ad hoc networks by removing most of the processing overhead. In mesh networks, the asymmetric approach can significantly reduce the data access delay compared to the symmetric approach due to data pipelines.

Patent
20 Dec 2010
TL;DR: Cache management techniques for a content distribution network (CDN), for example, a video on demand (VOD) system supporting user requests and delivery of video content, are described in this paper.
Abstract: Cache management techniques are described for a content distribution network (CDN), for example, a video on demand (VOD) system supporting user requests and delivery of video content A preferred cache size may be calculated for one or more cache devices in the CDN, for example, based on a maximum cache memory size, a bandwidth availability associated with the CDN, and a title dispersion calculation determined by the user requests within the CDN After establishing the cache with a set of assets (eg, video content), an asset replacement algorithm may be executed at one or more cache devices in the CDN When a determination is made that a new asset should be added to a full cache, a multi-factor comparative analysis may be performed on the assets currently residing in the cache, comparing the popularity and size of assets and combinations of assets, along with other factors to determine which assets should be replaced in the cache device

Proceedings ArticleDOI
13 Apr 2010
TL;DR: DProf helps programmers understand cache miss costs by attributing misses to data types instead of code, and introduces a number of new views of cache miss data, including a data profile, which reports the data types with the most cache misses, and a data flow graph, which summarizes how objects of a given type are accessed throughout their lifetime, and which accesses incur expensive cross-CPU cache loads.
Abstract: Effective use of CPU data caches is critical to good performance, but poor cache use patterns are often hard to spot using existing execution profiling tools. Typical profilers attribute costs to specific code locations. The costs due to frequent cache misses on a given piece of data, however, may be spread over instructions throughout the application. The resulting individually small costs at a large number of instructions can easily appear insignificant in a code profiler's output.DProf helps programmers understand cache miss costs by attributing misses to data types instead of code. Associating cache misses with data helps programmers locate data structures that experience misses in many places in the application's code. DProf introduces a number of new views of cache miss data, including a data profile, which reports the data types with the most cache misses, and a data flow graph, which summarizes how objects of a given type are accessed throughout their lifetime, and which accesses incur expensive cross-CPU cache loads. We present two case studies of using DProf to find and fix cache performance bottlenecks in Linux. The improvements provide a 16-57% throughput improvement on a range of memcached and Apache workloads.

Patent
Scott Howard Davis1
26 Apr 2010
TL;DR: In this article, the content aware cache filter component of the hypervisor of the server is implemented in an I/O virtualization layer that sits above a file system layer, such that any file system protocol may be implemented in the file system level.
Abstract: A server supporting the implementation of virtual machines includes a local memory used for caching, such as a solid state device drive. During I/O intensive processes, such as a boot storm, a “content aware” cache filter component of the hypervisor of the server first accesses a cache structure in a content cache device to determine whether data blocks have been stored in the cache structure prior to requesting the data blocks from a networked disk array via a standard I/O stack of the hypervisor. The content aware cache filter component is implemented in an I/O virtualization layer of the standard I/O stack that sits above a file system layer of the standard I/O stack, such that any file system protocol may be implemented in the file system layer.

Proceedings ArticleDOI
11 Sep 2010
TL;DR: Increasing cache efficiency can improve performance by reducing miss rate, or alternately, improve power and energy by allowing a smaller cache with the same miss rate.
Abstract: Caches mitigate the long memory latency that limits the performance of modern processors. However, caches can be quite inefficient. On average, a cache block in a 2MB L2 cache is dead 59% of the time, i.e., it will not be referenced again before it is evicted. Increasing cache efficiency can improve performance by reducing miss rate, or alternately, improve power and energy by allowing a smaller cache with the same miss rate.This paper proposes using predicted dead blocks to hold blocks evicted from other sets. When these evicted blocks are referenced again, the access can be satisfied from the other set, avoiding a costly access to main memory. The pool of predicted dead blocks can be thought of as a virtual victim cache. For a set of memory-intensive single-threaded workloads, a virtual victim cache in a 16-way set associative 2MB L2 cache reduces misses by 26%, yields an geometric mean speedup of 12.1% and improves cache efficiency by 27% on average, where cache efficiency is defined as the average time during which cache blocks contain live information. This virtual victim cache yields a lower average miss rate than a fully-associative LRU cache of the same capacity. For a set of multi-core workloads, the virtual victim cache improves throughput performance by 4% over LRU while improving cache efficiency by 62%.Alternately, a 1.7MB virtual victim cache achieves about the same performance as a larger 2MB L2 cache, reducing the number of SRAM cells required by 16%, thus maintaining performance while reducing power and area.

Proceedings ArticleDOI
01 Apr 2010
TL;DR: It is shown that bus-based networks with snooping protocols can significantly lower energy consumption and simplify network/protocol design and verification, with no loss in performance and it is believed that buses can be a viable and attractive option for future on-chip networks.
Abstract: It is expected that future on-chip networks for many-core processors will impose huge overheads in terms of energy, delay, complexity, verification effort, and area There is a common belief that the bandwidth necessary for future applications can only be provided by employing packet-switched networks with complex routers and a scalable directory-based coherence protocol We posit that such a scheme might likely be overkill in a well designed system in addition to being expensive in terms of power because of a large number of power-hungry routers We show that bus-based networks with snooping protocols can significantly lower energy consumption and simplify network/protocol design and verification, with no loss in performance We achieve these characteristics by dividing the chip into multiple segments, each having its own broadcast bus, with these buses further connected by a central bus This helps eliminate expensive routers, but suffers from the energy overhead of long wires We propose the use of multiple Bloom filters to effectively track data presence in the cache and restrict bus broadcasts to a subset of segments, significantly reducing energy consumption We further show that the use of OS page coloring helps maximize locality and improves the effectiveness of the Bloom filters We also employ low-swing wiring to further reduce the energy overheads of the links Performance can also be improved at relatively low costs by utilizing more of the abundant metal budgets on-chip and employing multiple address-interleaved buses rather than multiple routers Thus, with the combination of all the above innovations, we extend the scalability of buses and believe that buses can be a viable and attractive option for future on-chip networks We show energy reductions of up to 31X on average compared to many state-of-the-art packet switched networks

Patent
Wenchu Cen1
19 May 2010
TL;DR: In this article, the cache cluster is configurable in an active cluster configuration mode, where the plurality of cache service nodes are all in working state and a master cache service node is selected among the plurality.
Abstract: Processing cache data includes sending a cache processing request to a master cache service node in a cache cluster that includes a plurality of cache service nodes, the cache cluster being configurable in an active cluster configuration mode wherein the plurality of cache service nodes are all in working state and a master cache service node is selected among the plurality of cache service nodes, or in a standby cluster configuration mode, wherein the master cache service node is the only node among the plurality of cache service nodes that is in working state. It further includes waiting for a response from the master cache service node, determining whether the master cache service node has failed; and in the event that the master cache service node has failed, selecting a backup cache service node.

Patent
24 Jan 2010
TL;DR: A community-based mobile search cache as mentioned in this paper provides various techniques for maximizing the number of query results served from a local query cache, thereby significantly limiting the need to connect to the Internet or cloud using 3G or other wireless links to service search queries.
Abstract: A “Community-Based Mobile Search Cache” provides various techniques for maximizing the number of query results served from a local “query cache”, thereby significantly limiting the need to connect to the Internet or cloud using 3G or other wireless links to service search queries. The query cache is constructed remotely and downloaded to mobile devices. Contents of the query cache are determined by mining popular queries from mobile search logs, either globally or based on queries of one or more groups or subgroups of users. In various embodiments, searching and browsing behaviors of individual users are evaluated to customize the query cache for particular users or user groups. The content of web pages related to popular queries may also be included in the query cache. This allows cached web pages to be displayed without first displaying cached search results when a corresponding search result has a sufficiently high click-through probability.

Proceedings ArticleDOI
13 Jun 2010
TL;DR: An optimal algorithm and a heuristic approach that use the temporal reuse profile to determine the most beneficial memory blocks to be locked in the cache and provides significant improvement compared to the state-of-the-art locking algorithm both in terms of performance and efficiency.
Abstract: The performance of most embedded systems is critically dependent on the average memory access latency. Improving the cache hit rate can have significant positive impact on the performance of an application. Modern embedded processors often feature cache locking mechanisms that allow memory blocks to be locked in the cache under software control. Cache locking was primarily designed to offer timing predictability for hard real-time applications. Hence, the compiler optimization techniques focus on employing cache locking to improve worst-case execution time. However, cache locking can be quite effective in improving the average-case execution time of general embedded applications as well. In this paper, we explore static instruction cache locking to improve average-case program performance. We introduce temporal reuse profile to accurately and efficiently model the cost and benefit of locking memory blocks in the cache. We propose an optimal algorithm and a heuristic approach that use the temporal reuse profile to determine the most beneficial memory blocks to be locked in the cache. Experimental results show that locking heuristic achieves close to optimal results and can improve the cache miss rate by up to 24% across a suite of real-world benchmarks. Moreover, our heuristic provides significant improvement compared to the state-of-the-art locking algorithm both in terms of performance and efficiency.

Patent
17 Sep 2010
TL;DR: In this article, a cache device may comprise a non-volatile storage device configured to perform cache functions for a backing store, and a modified cache policy for the cache device in response to the risk of data loss exceeding the threshold risk level.
Abstract: Apparatuses, systems, and methods are disclosed for implementing a cache policy. A method may include determining a risk of data loss on a cache device. The cache device may comprise a non-volatile storage device configured to perform cache functions for a backing store. The cache device may implement a cache policy. A method may include determining that a risk of data loss on the cache devices exceeds a threshold risk level. A method may include implementing a modified cache policy for the cache device in response to the risk of data loss exceeding the threshold risk level. The modified cache policy may reduce the risk of data loss below the threshold level.

Patent
06 Dec 2010
TL;DR: In this article, the authors propose a migration policy to determine how long to cache the data in the write cache before migrating the data to the SDD using one or more migration triggers that cause the contents of the cache to be flushed to the SSD.
Abstract: A hybrid storage device uses a write cache such as a hard disk drive, for example, to cache data to a solid state drive (SSD). Data is logged sequentially to the write cache and later migrated to the SSD. The SSD is a primary storage that stores data permanently. The write cache is a persistent durable cache that may store data of disk write operations temporarily in a log structured fashion. A migration policy may be used to determine how long to cache the data in the write cache before migrating the data to the SDD. The migration policy may be implemented using one or more migration triggers that cause the contents of the write cache to be flushed to the SSD. Migration triggers may include a timeout trigger, a read threshold trigger, and a migration size trigger, for example.

Proceedings ArticleDOI
08 Mar 2010
TL;DR: The three methods which are proposed, i.e., spatial, temporal, and hybrid methods, bring about effective usage of the scratch-pad memory space, and achieve energy reduction in the instruction memory subsystems, are applicable to a real-time environment.
Abstract: Scratch-pad memory has been employed as a partial or entire replacement for cache memory due to its better energy efficiency. In this paper, we propose scratch-pad memory management techniques for priority-based preemptive multi-task systems. Our techniques are applicable to a real-time environment. The three methods which we propose, i.e., spatial, temporal, and hybrid methods, bring about effective usage of the scratch-pad memory space, and achieve energy reduction in the instruction memory subsystems. We formulate each method as an integer programming problem that simultaneously determines (1) partitioning of scratch-pad memory space for the tasks, and (2) allocation of program code to scratch-pad memory space for each task. It is remarkable that periods and priorities of tasks are considered in the formulas. Additionally, we implement an RTOS-hardware cooperative support mechanism for a runtime code allocation to the scratch-pad memory space. We have made the experiments with the fully functional real-time operating system. The experimental results with four task sets have demonstrated the effectiveness of our techniques. Up to 73 % energy reduction compared to a standard method was achieved.

Patent
01 Dec 2010
TL;DR: In this article, the authors propose an embodiment of a virtual address cache memory including a TLB virtual page memory, a data memory, and a cache state memory for the cache data stored in the data memory.
Abstract: An embodiment provides a virtual address cache memory including: a TLB virtual page memory configured to, when a rewrite to a TLB occurs, rewrite entry data; a data memory configured to hold cache data using a virtual page tag or a page offset as a cache index; a cache state memory configured to hold a cache state for the cache data stored in the data memory, in association with the cache index; a first physical address memory configured to, when the rewrite to the TLB occurs, rewrite a held physical address; and a second physical address memory configured to, when the cache data is written to the data memory after the occurrence of the rewrite to the TLB, rewrite a held physical address.