scispace - formally typeset
Search or ask a question

Showing papers on "Cache algorithms published in 2010"


Proceedings ArticleDOI
14 Mar 2010
TL;DR: This paper develops light-weight cooperative cache management algorithms aimed at maximizing the traffic volume served from cache and minimizing the bandwidth cost, and establishes that the performance of the proposed algorithms is guaranteed to be within a constant factor from the globally optimal performance.
Abstract: The delivery of video content is expected to gain huge momentum, fueled by the popularity of user-generated clips, growth of VoD libraries, and wide-spread deployment of IPTV services with features such as CatchUp/PauseLive TV and NPVR capabilities. The `time-shifted' nature of these personalized applications defies the broadcast paradigm underlying conventional TV networks, and increases the overall bandwidth demands by orders of magnitude. Caching strategies provide an effective mechanism for mitigating these massive bandwidth requirements by replicating the most popular content closer to the network edge, rather than storing it in a central site. The reduction in the traffic load lessens the required transport capacity and capital expense, and alleviates performance bottlenecks. In the present paper, we develop light-weight cooperative cache management algorithms aimed at maximizing the traffic volume served from cache and minimizing the bandwidth cost. As a canonical scenario, we focus on a cluster of distributed caches, either connected directly or via a parent node, and formulate the content placement problem as a linear program in order to benchmark the globally optimal performance. Under certain symmetry assumptions, the optimal solution of the linear program is shown to have a rather simple structure. Besides interesting in its own right, the optimal structure offers valuable guidance for the design of low-complexity cache management and replacement algorithms. We establish that the performance of the proposed algorithms is guaranteed to be within a constant factor from the globally optimal performance, with far more benign worst-case ratios than in prior work, even in asymmetric scenarios. Numerical experiments for typical popularity distributions reveal that the actual performance is far better than the worst-case conditions indicate.

727 citations


Proceedings ArticleDOI
19 Jun 2010
TL;DR: This paper proposes Static RRIP that is scan-resistant and Dynamic RRIP (DRRIP) that is both scan- resistant and thrash-resistant that require only 2-bits per cache block and easily integrate into existing LRU approximations found in modern processors.
Abstract: Practical cache replacement policies attempt to emulate optimal replacement by predicting the re-reference interval of a cache block. The commonly used LRU replacement policy always predicts a near-immediate re-reference interval on cache hits and misses. Applications that exhibit a distant re-reference interval perform badly under LRU. Such applications usually have a working-set larger than the cache or have frequent bursts of references to non-temporal data (called scans). To improve the performance of such workloads, this paper proposes cache replacement using Re-reference Interval Prediction (RRIP). We propose Static RRIP (SRRIP) that is scan-resistant and Dynamic RRIP (DRRIP) that is both scan-resistant and thrash-resistant. Both RRIP policies require only 2-bits per cache block and easily integrate into existing LRU approximations found in modern processors. Our evaluations using PC games, multimedia, server and SPEC CPU2006 workloads on a single-core processor with a 2MB last-level cache (LLC) show that both SRRIP and DRRIP outperform LRU replacement on the throughput metric by an average of 4% and 10% respectively. Our evaluations with over 1000 multi-programmed workloads on a 4-core CMP with an 8MB shared LLC show that SRRIP and DRRIP outperform LRU replacement on the throughput metric by an average of 7% and 9% respectively. We also show that RRIP outperforms LFU, the state-of the art scan-resistant replacement algorithm to-date. For the cache configurations under study, RRIP requires 2X less hardware than LRU and 2.5X less hardware than LFU.

715 citations


Journal ArticleDOI
TL;DR: Power Systems™ continue strong 7th Generation Power chip: Balanced Multi-Core design EDRAM technology SMT4 greater then 4X performance in same power envelope as previous generation.
Abstract: The Power7 is IBM's first eight-core processor, with each core capable of four-way simultaneous-multithreading operation. Its key architectural features include an advanced memory hierarchy with three levels of on-chip cache; embedded-DRAM devices used in the highest level of the cache; and a new memory interface. This balanced multicore design scales from 1 to 32 sockets in commercial and scientific environments.

259 citations


Proceedings ArticleDOI
04 Dec 2010
TL;DR: The zcache is presented, a cache design that allows much higher associativity than the number of physical ways, and it is shown that zcaches provide higher performance and better energy efficiency than conventional caches without incurring the overheads of designs with a large number of ways.
Abstract: The ever-increasing importance of main memory latency and bandwidth is pushing CMPs towards caches with higher capacity and associativity. Associativity is typically improved by increasing the number of ways. This reduces conflict misses, but increases hit latency and energy, placing a stringent trade-off on cache design. We present the zcache, a cache design that allows much higher associativity than the number of physical ways (e.g. a 64-associative cache with 4 ways). The zcache draws on previous research on skew-associative caches and cuckoo hashing. Hits, the common case, require a single lookup, incurring the latency and energy costs of a cache with a very low number of ways. On a miss, additional tag lookups happen off the critical path, yielding an arbitrarily large number of replacement candidates for the incoming block. Unlike conventional designs, the zcache provides associativity by increasing the number of replacement candidates, but not the number of cache ways. To understand the implications of this approach, we develop a general analysis framework that allows to compare associativity across different cache designs (e.g. a set-associative cache and a zcache) by representing associativity as a probability distribution. We use this framework to show that for zcaches, associativity depends only on the number of replacement candidates, and is independent of other factors (such as the number of cache ways or the workload). We also show that, for the same number of replacement candidates, the associativity of a zcache is superior than that of a set-associative cache for most workloads. Finally, we perform detailed simulations of multithreaded and multiprogrammed workloads on a large-scale CMP with zcache as the last-level cache. We show that zcaches provide higher performance and better energy efficiency than conventional caches without incurring the overheads of designs with a large number of ways.

203 citations


Patent
08 Sep 2010
TL;DR: In this paper, an apparatus, system, and method are disclosed for an caching data using a solid-state storage device, where metadata indicates what data in the cache is valid, as well as information about what data from the nonvolatile cache has been stored in the backing store.
Abstract: An apparatus, system, and method are disclosed for an caching data using a solid-state storage device. The solid-state storage device maintains metadata pertaining to cache operations performed on the solid-state storage device, as well as storage operations of the solid-state storage device. The metadata indicates what data in the cache is valid, as well as information about what data in the nonvolatile cache has been stored in backing store. A backup engine works through units in the nonvolatile cache device and backs up the valid data to backing store. During grooming operations, the groomer determines whether the data is valid and whether the data is discardable. Data that is both valid and discardable may be removed during the grooming operation. The groomer may also determine whether the data is cold in determining whether to remove the data from the cache device. The cache device may present to clients a logical space that is the same size as the backing store. The cache device may be transparent to the clients.

196 citations


Book ChapterDOI
17 Aug 2010
TL;DR: This work improves instruction cache data analysis techniques with a framework based on vector quantization and hidden Markov models, capable of carrying out efficient automated attacks using live I-cache timing data and presents general software countermeasures that can be employed at the kernel and/or compiler level.
Abstract: We improve instruction cache data analysis techniques with a framework based on vector quantization and hidden Markov models. As a result, we are capable of carrying out efficient automated attacks using live I-cache timing data. Using this analysis technique, we run an I-cache attack on OpenSSL's DSA implementation and recover keys using lattice methods. Previous I-cache attacks were proof-of-concept: we present results of an actual attack in a real-world setting, proving these attacks to be realistic. We also present general software countermeasures, along with their performance impact, that are not algorithm specific and can be employed at the kernel and/or compiler level.

192 citations


Patent
07 Jul 2010
TL;DR: In this paper, a security module on a computing device applies security rules to examine content in a network cache and identify suspicious cache content, such as a rule determining whether the cache content is associated with modified-time set into the future, and a rule deciding whether cache content was created in a low-security environment.
Abstract: A security module on a computing device applies security rules to examine content in a network cache and identify suspicious cache content. Cache content is identified as suspicious according to security rules, such as a rule determining whether the cache content is associated with modified-time set into the future, and a rule determining whether the cache content was created in a low-security environment. The security module may establish an out-of-band connection with the websites from which the cache content originated through a high security access network to receive responses from the websites, and use the responses to determine whether the cache content is suspicious cache content. Suspicious cache content is removed from the network cache to prevent the suspicious cache content from carrying out malicious activities.

168 citations


Journal ArticleDOI
TL;DR: This work presents a lossless compression algorithm that has been designed for fast on-line data compression, and cache compression in particular, and reduces the proposed algorithm to a register transfer level hardware design, permitting performance, power consumption, and area estimation.
Abstract: Microprocessor designers have been torn between tight constraints on the amount of on-chip cache memory and the high latency of off-chip memory, such as dynamic random access memory. Accessing off-chip memory generally takes an order of magnitude more time than accessing on-chip cache, and two orders of magnitude more time than executing an instruction. Computer systems and microarchitecture researchers have proposed using hardware data compression units within the memory hierarchies of microprocessors in order to improve performance, energy efficiency, and functionality. However, most past work, and all work on cache compression, has made unsubstantiated assumptions about the performance, power consumption, and area overheads of the proposed compression algorithms and hardware. It is not possible to determine whether compression at levels of the memory hierarchy closest to the processor is beneficial without understanding its costs. Furthermore, as we show in this paper, raw compression ratio is not always the most important metric. In this work, we present a lossless compression algorithm that has been designed for fast on-line data compression, and cache compression in particular. The algorithm has a number of novel features tailored for this application, including combining pairs of compressed lines into one cache line and allowing parallel compression of multiple words while using a single dictionary and without degradation in compression ratio. We reduced the proposed algorithm to a register transfer level hardware design, permitting performance, power consumption, and area estimation. Experiments comparing our work to previous work are described.

161 citations


Patent
06 Oct 2010
TL;DR: In this article, the server embeds incremental information in media fragments that eliminate the usage of a typical control channel, and the server provides uniform media fragment responses to media fragment requests, thereby allowing existing Internet cache infrastructure to cache streaming media data.
Abstract: A low latency streaming system provides a stateless protocol between a client and server with reduced latency. The server embeds incremental information in media fragments that eliminates the usage of a typical control channel. In addition, the server provides uniform media fragment responses to media fragment requests, thereby allowing existing Internet cache infrastructure to cache streaming media data. Each fragment has a distinguished Uniform Resource Locator (URL) that allows the fragment to be identified and cached by both Internet cache servers and the client's browser cache. The system reduces latency using various techniques, such as sending fragments that contain less than a full group of pictures (GOP), encoding media without dependencies on subsequent frames, and by allowing clients to request subsequent frames with only information about previous frames.

146 citations


Patent
25 Mar 2010
TL;DR: In this article, a method for optimising the distribution of data objects between caches in a cache domain of a resource limited network is described, where object information including the request frequency of each requested data object and the locations of the caches at which the requests were received is collated and stored.
Abstract: There is described a method for optimising the distribution of data objects between caches in a cache domain of a resource limited network. User requests for data objects are received at caches in the cache domain. A notification is sent from each cache at which a request is received to a cache manager. The notification reports the user request and identifies the requested data object. At the cache manager, object information including the request frequency of each requested data object and the locations of the caches at which the requests were received is collated and stored. At the cache manager, objects for distribution within the cache domain are identified on the basis of the object information. Instructions are sent from the cache manager to the caches to distribute data objects stored in those caches between themselves. The objects are classified into classes according to popularity, the classes including a high popularity class comprising objects which should be distributed to all caches in the cache domain, a medium popularity class comprising objects which should be distributed to a subset of the caches in the cache domain, and a low popularity class comprising objects which should not be distributed.

137 citations


Proceedings ArticleDOI
04 Dec 2010
TL;DR: This work proposes Temporal Locality Aware (TLA) cache management policies to allow an inclusive LLC to be aware of the temporal locality of lines in the core caches and shows that these policies improve inclusive cache performance without requiring any additional hardware structures.
Abstract: Inclusive caches are commonly used by processors to simplify cache coherence. However, the trade-off has been lower performance compared to non-inclusive and exclusive caches. Contrary to conventional wisdom, we show that the limited performance of inclusive caches is mostly due to inclusion victims—lines that are evicted from the core caches to satisfy the inclusion property—and not the reduced cache capacity of the hierarchy due to the duplication of data. These inclusion victims are incorrectly chosen for replacement because the last-level cache (LLC) is unaware of the temporal locality of lines in the core caches. We propose Temporal Locality Aware (TLA) cache management policies to allow an inclusive LLC to be aware of the temporal locality of lines in the core caches. We propose three TLA policies: Temporal Locality Hints (TLH), Early Core Invalidation (ECI), and Query Based Selection (QBS). All three policies improve inclusive cache performance without requiring any additional hardware structures. In fact, QBS performs similar to a non-inclusive cache hierarchy.

Proceedings ArticleDOI
19 Jun 2010
TL;DR: A new, cost-effective architecture - SieveStore - which enables the use of solid-state media to significantly filter access to storage ensembles and which achieves significantly higher hit ratios while using only 1/7th the number of SSD drives.
Abstract: Emerging solid-state storage media can significantly improve storage performance and energy. However, the high cost-per-byte of solid-state media has hindered wide-spread adoption in servers. This paper proposes a new, cost-effective architecture - SieveStore - which enables the use of solid-state media to significantly filter access to storage ensembles. Our paper makes three key contributions. First, we make a case for highly-selective, storage-ensemble-level disk-block caching based on the highly-skewed block popularity distribution and based on the dynamic nature of the popular block set. Second, we identify the problem of allocation-writes and show that selective cache allocation to reduce allocation-writes - sieving - is fundamental to enable efficient ensemble-level disk-caching. Third, we propose two practical variants of SieveStore. Based on week-long block access traces from a storage ensemble of 13 servers, we find that the two components (sieving and ensemble-level caching) each contribute to SieveStore's cost-effectiveness. Compared to unsieved, ensemble-level disk-caches, SieveStore achieves significantly higher hit ratios (35%-50% more, on average) while using only 1/7th the number of SSD drives. Further, ensemble-level caching is strictly better in cost-performance compared to per-server caching.

Patent
30 Jul 2010
TL;DR: In this paper, an apparatus, system, and method for redundant write caching is described. But the authors do not specify a hardware implementation of the redundancy mechanism, only a plurality of modules, including a write request module, a first cache write module, and a second cache write modules.
Abstract: An apparatus, system, and method are disclosed for redundant write caching. The apparatus, system, and method are provided with a plurality of modules including a write request module, a first cache write module, a second cache write module, and a trim module. The write request module detects a write request to store data on a storage device. The first cache write module writes data of the write request to a first cache. The second cache write module writes the data to a second cache. The trim module trims the data from one of the first cache and the second cache in response to an indicator that the storage device stores the data. The data remains available in the other of the first cache and the second cache to service read requests.

Patent
Kenneth P. Der1
28 Sep 2010
TL;DR: In this article, the authors propose a technique to protect host data by storing the block of data as a dirty cache block in a local cache of the local computerized node and performing a set of external caching operations to cache the set of sub-blocks.
Abstract: A technique protects host data. The technique involves receiving, at a local computerized node, a block of data from a host computer, the block of data including data sub-blocks. The technique further involves storing the block of data, as a dirty cache block, in a local cache of the local computerized node. The technique further involves performing a set of external caching operations to cache a set of sub-blocks in a set of external computerized nodes in communication with the local computerized node. Each external caching operation caches a respective sub-block of the set of sub-blocks in a cache of a respective external computerized node. The set of sub-blocks includes (i) the data sub-blocks of the block of data from the host and (ii) a set of checksums derived from the data sub-blocks of the block of data from the host.

Proceedings ArticleDOI
13 Jun 2010
TL;DR: This paper describes several cache-oblivious algorithms with optimal work, polylogarithmic depth, and sequential cache complexities that match the best sequential algorithms, including the first such algorithms for sorting and for sparse-matrix vector multiply on matrices with good vertex separators.
Abstract: In this paper we explore a simple and general approach for developing parallel algorithms that lead to good cache complexity on parallel machines with private or shared caches. The approach is to design nested-parallel algorithms that have low depth (span, critical path length) and for which the natural sequential evaluation order has low cache complexity in the cache-oblivious model. We describe several cache-oblivious algorithms with optimal work, polylogarithmic depth, and sequential cache complexities that match the best sequential algorithms, including the first such algorithms for sorting and for sparse-matrix vector multiply on matrices with good vertex separators.Using known mappings, our results lead to low cache complexities on shared-memory multiprocessors with a single level of private caches or a single shared cache. We generalize these mappings to multi-level cache hierarchies of private or shared caches, implying that our algorithms also have low cache complexities on such hierarchies. The key factor in obtaining these low parallel cache complexities is the low depth of the algorithms we propose.

Proceedings ArticleDOI
09 Jan 2010
TL;DR: A systematic measurement of the influence of CMP cache sharing on two kinds of commodity CMP machines, using a recently released CMP benchmark suite, PARSEC, with a number of potentially important factors on program, OS, and architecture levels considered shows some surprising results.
Abstract: Most modern Chip Multiprocessors (CMP) feature shared cache on chip. For multithreaded applications, the sharing reduces communication latency among co-running threads, but also results in cache contention.A number of studies have examined the influence of cache sharing on multithreaded applications, but most of them have concentrated on the design or management of shared cache, rather than a systematic measurement of the influence. Consequently, prior measurements have been constrained by the reliance on simulators, the use of out-of-date benchmarks, and the limited coverage of deciding factors. The influence of CMP cache sharing on contemporary multithreaded applications remains preliminarily understood.In this work, we conduct a systematic measurement of the influence on two kinds of commodity CMP machines, using a recently released CMP benchmark suite, PARSEC, with a number of potentially important factors on program, OS, and architecture levels considered. The measurement shows some surprising results. Contrary to commonly perceived importance of cache sharing, neither positive nor negative effects from the cache sharing are significant for most of the program executions, regardless of the types of parallelism, input datasets, architectures, numbers of threads, and assignments of threads to cores. After a detailed analysis, we find that the main reason is the mismatch of current development and compilation of multithreaded applications and CMP architectures. By transforming the programs in a cache-sharing-aware manner, we observe up to 36% performance increase when the threads are placed on cores appropriately.

Patent
12 Aug 2010
TL;DR: In this paper, a distributed data cache is used to expedite the retrieval of data for application execution by a server in a content delivery network, where the distributed cache is distributed across computer-readable storage media included in a plurality of servers.
Abstract: A distributed data cache included in a content delivery network expedites retrieval of data for application execution by a server in a content delivery network. The distributed data cache is distributed across computer-readable storage media included in a plurality of servers in the content delivery network. When an application generates a query for data, a server in the content delivery network determines whether the distributed data cache includes data associated with the query. If data associated with the query is stored in the distributed data cache, the data is retrieved from the distributed data cache. If the distributed data cache does not include data associated with the query, the data is retrieved from a database and the query and associated data are stored in the distributed data cache to expedite subsequent retrieval of the data when the application issues the same query.

Proceedings ArticleDOI
28 Mar 2010
TL;DR: CAMP estimates the performance degradation due to cache contention of processes running on CMPs and provides an automated way to obtain process-dependent characteristics, such as reuse distance histograms, without offline simulation, operating system modification, or additional hardware.
Abstract: The ongoing move to chip multiprocessors (CMPs) permits greater sharing of last-level cache by processor cores but this sharing aggravates the cache contention problem, potentially undermining performance improvements. Accurately modeling the impact of inter-process cache contention on performance and power consumption is required for optimized process assignment. However, techniques based on exhaustive consideration of process-to-processor mappings and cycle-accurate simulation are inefficient or intractable for CMPs, which often permit a large number of potential assignments. This paper proposes CAMP, a fast and accurate shared cache aware performance model for multi-core processors. CAMP estimates the performance degradation due to cache contention of processes running on CMPs. It uses reuse distance histograms, cache access frequencies, and the relationship between the throughput and cache miss rate of each process to predict its effective cache size when running concurrently and sharing cache with other processes, allowing instruction throughput estimation.We also provide an automated way to obtain process-dependent characteristics, such as reuse distance histograms, without offline simulation, operating system (OS) modification, or additional hardware. We tested the accuracy of CAMP using 55 different combinations of 10 SPEC CPU2000 benchmarks on a dual-core CMP machine. The average throughput prediction error was 1.57%.

Patent
15 Dec 2010
TL;DR: In this paper, a multi-tiered cache manager and methods for managing multi-tier cache are described, which causes cached data to be initially stored in RAM elements and selects portions of the cached data stored in the RAM elements to be moved to the flash elements.
Abstract: A multi-tiered cache manager and methods for managing multi-tiered cache are described. Multi-tiered cache manager causes cached data to be initially stored in the RAM elements and selects portions of the cached data stored in the RAM elements to be moved to the flash elements. Each flash element is organized as a plurality of write blocks having a block size and wherein a predefined maximum number of writes is permitted to each write block. The portions of the cached data may be selected based on a maximum write rate calculated from the maximum number of writes allowed for the flash device and a specified lifetime of the cache system.

Patent
15 Feb 2010
TL;DR: In this article, a flash memory device includes at least one mapping block, one modification block and one cache block, each of which includes data fields to store location information corresponding to the data has been written in the pages of the modification block in order.
Abstract: A computing system is provided. A flash memory device includes at least one mapping block, at least one modification block and at least one cache block. A processor is configured to perform: receiving a write command with a write logical address and predetermined data, loading content of a cache page from the cache block corresponding to the modification block according to the write logical address to a random access memory device in response to that a page of the mapping block corresponding to the write logical address has been used, the processor, reading orderly the content of the cache page stored in the random access memory device to obtain location information of an empty page of the modification block, and writing the predetermined data to the empty page according to the location information. Each cache page includes data fields to store location information corresponding to the data has been written in the pages of the modification block in order.

Proceedings ArticleDOI
04 Oct 2010
TL;DR: Adding TxCache to an application increased the throughput of a web application by up to 5.2×, only slightly less than a non-transactional cache, showing that consistency does not have to come at the price of performance.
Abstract: Distributed in-memory application data caches like memcached are a popular solution for scaling database-driven web sites. These systems are easy to add to existing deployments, and increase performance significantly by reducing load on both the database and application servers. Unfortunately, such caches do not integrate well with the database or the application. They cannot maintain transactional consistency across the entire system, violating the isolation properties of the underlying database. They leave the application responsible for locating data in the cache and keeping it up to date, a frequent source of application complexity and programming errors.Addressing both of these problems, we introduce a transactional cache, TxCache, with a simple programming model. TxCache ensures that any data seen within a transaction, whether it comes from the cache or the database, reflects a slightly stale but consistent snapshot of the database. TxCache makes it easy to add caching to an application by simply designating functions as cacheable; it automatically caches their results, and invalidates the cached data as the underlying database changes. Our experiments found that adding TxCache increased the throughput of a web application by up to 5.2×, only slightly less than a non-transactional cache, showing that consistency does not have to come at the price of performance.

Journal ArticleDOI
TL;DR: The design and implementation of cooperative cache in wireless P2P networks are presented, and a novel asymmetric cooperative cache approach is proposed, where the data requests are transmitted to the cache layer on every node, but the data replies are only transmitted toThe cache layer at the intermediate nodes that need to cache the data.
Abstract: Some recent studies have shown that cooperative cache can improve the system performance in wireless P2P networks such as ad hoc networks and mesh networks. However, all these studies are at a very high level, leaving many design and implementation issues unanswered. In this paper, we present our design and implementation of cooperative cache in wireless P2P networks, and propose solutions to find the best place to cache the data. We propose a novel asymmetric cooperative cache approach, where the data requests are transmitted to the cache layer on every node, but the data replies are only transmitted to the cache layer at the intermediate nodes that need to cache the data. This solution not only reduces the overhead of copying data between the user space and the kernel space, it also allows data pipelines to reduce the end-to-end delay. We also study the effects of different MAC layers, such as 802.11-based ad hoc networks and multi-interface-multichannel-based mesh networks, on the performance of cooperative cache. Our results show that the asymmetric approach outperforms the symmetric approach in traditional 802.11-based ad hoc networks by removing most of the processing overhead. In mesh networks, the asymmetric approach can significantly reduce the data access delay compared to the symmetric approach due to data pipelines.

Patent
20 Dec 2010
TL;DR: Cache management techniques for a content distribution network (CDN), for example, a video on demand (VOD) system supporting user requests and delivery of video content, are described in this paper.
Abstract: Cache management techniques are described for a content distribution network (CDN), for example, a video on demand (VOD) system supporting user requests and delivery of video content A preferred cache size may be calculated for one or more cache devices in the CDN, for example, based on a maximum cache memory size, a bandwidth availability associated with the CDN, and a title dispersion calculation determined by the user requests within the CDN After establishing the cache with a set of assets (eg, video content), an asset replacement algorithm may be executed at one or more cache devices in the CDN When a determination is made that a new asset should be added to a full cache, a multi-factor comparative analysis may be performed on the assets currently residing in the cache, comparing the popularity and size of assets and combinations of assets, along with other factors to determine which assets should be replaced in the cache device

Proceedings ArticleDOI
28 Jun 2010
TL;DR: An integrated timing analysis framework that captures timing effects of both shared cache and shared bus is provided and a cycle-accurate simulation infra-structure is developed to evaluate the precision of the analysis.
Abstract: Timing analysis of concurrent programs running on multi-core platforms is currently an important problem. The key to solving this problem is to accurately model the timing effects of shared resources in multi-cores, namely shared cache and bus. In this paper, we provide an integrated timing analysis framework that captures timing effects of both shared cache and shared bus. We also develop a cycle-accurate simulation infra-structure to evaluate the precision of our analysis. Experimental results from a large fragment of an in-orbit spacecraft software show that our analysis produces around 20% over-estimation over simulation results.

Proceedings ArticleDOI
13 Apr 2010
TL;DR: DProf helps programmers understand cache miss costs by attributing misses to data types instead of code, and introduces a number of new views of cache miss data, including a data profile, which reports the data types with the most cache misses, and a data flow graph, which summarizes how objects of a given type are accessed throughout their lifetime, and which accesses incur expensive cross-CPU cache loads.
Abstract: Effective use of CPU data caches is critical to good performance, but poor cache use patterns are often hard to spot using existing execution profiling tools. Typical profilers attribute costs to specific code locations. The costs due to frequent cache misses on a given piece of data, however, may be spread over instructions throughout the application. The resulting individually small costs at a large number of instructions can easily appear insignificant in a code profiler's output.DProf helps programmers understand cache miss costs by attributing misses to data types instead of code. Associating cache misses with data helps programmers locate data structures that experience misses in many places in the application's code. DProf introduces a number of new views of cache miss data, including a data profile, which reports the data types with the most cache misses, and a data flow graph, which summarizes how objects of a given type are accessed throughout their lifetime, and which accesses incur expensive cross-CPU cache loads. We present two case studies of using DProf to find and fix cache performance bottlenecks in Linux. The improvements provide a 16-57% throughput improvement on a range of memcached and Apache workloads.

Proceedings ArticleDOI
11 Sep 2010
TL;DR: An important challenge in multicore processors is the maintenance of cache coherence in a scalable manner, and Directory-based protocols save bandwidth and achieve scalability by associating information about sharer cores with every cache block.
Abstract: An important challenge in multicore processors is the maintenance of cache coherence in a scalable manner. Directory-based protocols save bandwidth and achieve scalability by associating information about sharer cores with every cache block. As the number of cores and cache sizes increase, the directory itself adds significant area and energy overheads.In this paper, we propose SPACE, a directory design based on recognizing and representing the subset of sharing patterns present in an application. SPACE takes advantage of the observation that many memory locations in an application are accessed by the same set of processors, resulting in a few sharing patterns that occur frequently. The sharing pattern of a cache block is the bit vector representing the processors that share the block. SPACE decouples the sharing pattern from each cache block and holds them in a separate directory table. Multiple cache lines that have the same sharing pattern point to a common entry in the directory table. In addition, when the table capacity is exceeded, patterns that are similar to each other are dynamically collated into a single entry.Our results show that overall, SPACE is within 2% of the performance of a conventional directory. When compared to coarse vector directories, dynamically collating similar patterns eliminates more false sharers. Our experimentation also reveals that a small directory table (256-512 entries) can handle the access patterns in many applications, with the SPACE directory table size being O(P) and requiring a pointer per cache line whose size is O(log2P). Specifically, SPACE requires E44% of the area of a conventional directory at 16 processors and 25% at 32 processors.

Patent
Scott Howard Davis1
26 Apr 2010
TL;DR: In this article, the content aware cache filter component of the hypervisor of the server is implemented in an I/O virtualization layer that sits above a file system layer, such that any file system protocol may be implemented in the file system level.
Abstract: A server supporting the implementation of virtual machines includes a local memory used for caching, such as a solid state device drive. During I/O intensive processes, such as a boot storm, a “content aware” cache filter component of the hypervisor of the server first accesses a cache structure in a content cache device to determine whether data blocks have been stored in the cache structure prior to requesting the data blocks from a networked disk array via a standard I/O stack of the hypervisor. The content aware cache filter component is implemented in an I/O virtualization layer of the standard I/O stack that sits above a file system layer of the standard I/O stack, such that any file system protocol may be implemented in the file system layer.

Proceedings ArticleDOI
11 Sep 2010
TL;DR: Increasing cache efficiency can improve performance by reducing miss rate, or alternately, improve power and energy by allowing a smaller cache with the same miss rate.
Abstract: Caches mitigate the long memory latency that limits the performance of modern processors. However, caches can be quite inefficient. On average, a cache block in a 2MB L2 cache is dead 59% of the time, i.e., it will not be referenced again before it is evicted. Increasing cache efficiency can improve performance by reducing miss rate, or alternately, improve power and energy by allowing a smaller cache with the same miss rate.This paper proposes using predicted dead blocks to hold blocks evicted from other sets. When these evicted blocks are referenced again, the access can be satisfied from the other set, avoiding a costly access to main memory. The pool of predicted dead blocks can be thought of as a virtual victim cache. For a set of memory-intensive single-threaded workloads, a virtual victim cache in a 16-way set associative 2MB L2 cache reduces misses by 26%, yields an geometric mean speedup of 12.1% and improves cache efficiency by 27% on average, where cache efficiency is defined as the average time during which cache blocks contain live information. This virtual victim cache yields a lower average miss rate than a fully-associative LRU cache of the same capacity. For a set of multi-core workloads, the virtual victim cache improves throughput performance by 4% over LRU while improving cache efficiency by 62%.Alternately, a 1.7MB virtual victim cache achieves about the same performance as a larger 2MB L2 cache, reducing the number of SRAM cells required by 16%, thus maintaining performance while reducing power and area.

Patent
Wenchu Cen1
19 May 2010
TL;DR: In this article, the cache cluster is configurable in an active cluster configuration mode, where the plurality of cache service nodes are all in working state and a master cache service node is selected among the plurality.
Abstract: Processing cache data includes sending a cache processing request to a master cache service node in a cache cluster that includes a plurality of cache service nodes, the cache cluster being configurable in an active cluster configuration mode wherein the plurality of cache service nodes are all in working state and a master cache service node is selected among the plurality of cache service nodes, or in a standby cluster configuration mode, wherein the master cache service node is the only node among the plurality of cache service nodes that is in working state. It further includes waiting for a response from the master cache service node, determining whether the master cache service node has failed; and in the event that the master cache service node has failed, selecting a backup cache service node.

Patent
24 Jan 2010
TL;DR: A community-based mobile search cache as mentioned in this paper provides various techniques for maximizing the number of query results served from a local query cache, thereby significantly limiting the need to connect to the Internet or cloud using 3G or other wireless links to service search queries.
Abstract: A “Community-Based Mobile Search Cache” provides various techniques for maximizing the number of query results served from a local “query cache”, thereby significantly limiting the need to connect to the Internet or cloud using 3G or other wireless links to service search queries. The query cache is constructed remotely and downloaded to mobile devices. Contents of the query cache are determined by mining popular queries from mobile search logs, either globally or based on queries of one or more groups or subgroups of users. In various embodiments, searching and browsing behaviors of individual users are evaluated to customize the query cache for particular users or user groups. The content of web pages related to popular queries may also be included in the query cache. This allows cached web pages to be displayed without first displaying cached search results when a corresponding search result has a sufficiently high click-through probability.