scispace - formally typeset
Search or ask a question

Showing papers on "Cache invalidation published in 2007"


Proceedings ArticleDOI
09 Jun 2007
TL;DR: A Dynamic Insertion Policy (DIP) is proposed to choose between BIP and the traditional LRU policy depending on which policy incurs fewer misses, and shows that DIP reduces the average MPKI of the baseline 1MB 16-way L2 cache by 21%, bridging two-thirds of the gap between LRU and OPT.
Abstract: The commonly used LRU replacement policy is susceptible to thrashing for memory-intensive workloads that have a working set greater than the available cache size. For such applications, the majority of lines traverse from the MRU position to the LRU position without receiving any cache hits, resulting in inefficient use of cache space. Cache performance can be improved if some fraction of the working set is retained in the cache so that at least that fraction of the working set can contribute to cache hits.We show that simple changes to the insertion policy can significantly reduce cache misses for memory-intensive workloads. We propose the LRU Insertion Policy (LIP) which places the incoming line in the LRU position instead of the MRU position. LIP protects the cache from thrashing and results in close to optimal hitrate for applications that have a cyclic reference pattern. We also propose the Bimodal Insertion Policy (BIP) as an enhancement of LIP that adapts to changes in the working set while maintaining the thrashing protection of LIP. We finally propose a Dynamic Insertion Policy (DIP) to choose between BIP and the traditional LRU policy depending on which policy incurs fewer misses. The proposed insertion policies do not require any change to the existing cache structure, are trivial to implement, and have a storage requirement of less than two bytes. We show that DIP reduces the average MPKI of the baseline 1MB 16-way L2 cache by 21%, bridging two-thirds of the gap between LRU and OPT.

722 citations


Proceedings ArticleDOI
09 Jun 2007
TL;DR: The results show that the new cache designs with built-in security can defend against cache-based side channel attacks in general-rather than only specific attacks on a given cryptographic algorithm-with very little performance degradation and hardware cost.
Abstract: Software cache-based side channel attacks are a serious new class of threats for computers. Unlike physical side channel attacks that mostly target embedded cryptographic devices, cache-based side channel attacks can also undermine general purpose systems. The attacks are easy to perform, effective on most platforms, and do not require special instruments or excessive computation power. In recently demonstrated attacks on software implementations of ciphers like AES and RSA, the full key can be recovered by an unprivileged user program performing simple timing measurements based on cache misses.We first analyze these attacks, identifying cache interference as the root cause of these attacks. We identify two basic mitigation approaches: the partition-based approach eliminates cache interference whereas the randomization-based approach randomizes cache interference so that zero information can be inferred. We present new security-aware cache designs, the Partition-Locked cache (PLcache) and Random Permutation cache (RPcache), analyze and prove their security, and evaluate their performance. Our results show that our new cache designs with built-in security can defend against cache-based side channel attacks in general-rather than only specific attacks on a given cryptographic algorithm-with very little performance degradation and hardware cost.

594 citations


Proceedings ArticleDOI
10 Feb 2007
TL;DR: This paper proposes a hardware transactional memory system called LogTM Signature Edition (LogTM-SE), which uses signatures to summarize a transactions read-and write-sets and detects conflicts on coherence requests (eager conflict detection), and allows cache victimization, unbounded nesting, thread context switching and migration, and paging.
Abstract: This paper proposes a hardware transactional memory (HTM) system called LogTM Signature Edition (LogTM-SE). LogTM-SE uses signatures to summarize a transactions read-and write-sets and detects conflicts on coherence requests (eager conflict detection). Transactions update memory "in place" after saving the old value in a per-thread memory log (eager version management). Finally, a transaction commits locally by clearing its signature, resetting the log pointer, etc., while aborts must undo the log. LogTM-SE achieves two key benefits. First, signatures and logs can be implemented without changes to highly-optimized cache arrays because LogTM-SE never moves cached data, changes a blocks cache state, or flash clears bits in the cache. Second, transactions are more easily virtualized because signatures and logs are software accessible, allowing the operating system and runtime to save and restore this state. In particular, LogTM-SE allows cache victimization, unbounded nesting (both open and closed), thread context switching and migration, and paging

384 citations


Journal ArticleDOI
TL;DR: It is demonstrated that migratory dynamic NUCA approaches improve performance significantly for a subset of the workloads at the cost of increased complexity, especially as per-application cache partitioning strategies are applied.
Abstract: We propose an organization for the on-chip memory system of a chip multiprocessor in which 16 processors share a 16-Mbyte pool of 64 level-2 (L2) cache banks. The L2 cache is organized as a nonuniform cache architecture (NUCA) array with a switched network embedded in it for high performance. We show that this organization can support a spectrum of degrees of sharing: unshared, in which each processor owns a private portion of the cache, thus reducing hit latency, and completely shared, in which every processor shares the entire cache, thus minimizing misses, and every point in between. We measure the optimal degree of sharing for different cache bank mapping policies and also evaluate a per-application cache partitioning strategy. We conclude that a static NUCA organization with sharing degrees of 2 or 4 works best across a suite of commercial and scientific parallel workloads. We demonstrate that migratory dynamic NUCA approaches improve performance significantly for a subset of the workloads at the cost of increased complexity, especially as per-application cache partitioning strategies are applied. We also evaluate the energy efficiency of each design point in terms of network traffic, bank accesses, and external memory accesses.

319 citations


Proceedings ArticleDOI
21 Mar 2007
TL;DR: The design and implementation of a scheme to schedule threads based on sharing patterns detected online using features of standard performance monitoring units (PMUs) available in today's processing units are described and reductions in cross-chip cache accesses are demonstrated.
Abstract: The major chip manufacturers have all introduced chip multiprocessing (CMP) and simultaneous multithreading (SMT) technology into their processing units. As a result, even low-end computing systems and game consoles have become shared memory multiprocessors with L1 and L2 cache sharing within a chip. Mid- and large-scale systems will have multiple processing chips and hence consist of an SMP-CMP-SMT configuration with non-uniform data sharing overheads. Current operating system schedulers are not aware of these new cache organizations, and as a result, distribute threads across processors in a way that causes many unnecessary, long-latency cross-chip cache accesses.In this paper we describe the design and implementation of a scheme to schedule threads based on sharing patterns detected online using features of standard performance monitoring units (PMUs) available in today's processing units. The primary advantage of using the PMU infrastructure is that it is fine-grained (down to the cache line) and has relatively low overhead. We have implemented our scheme in Linux running on an 8-way Power5 SMP-CMP-SMT multi-processor. For commercial multithreaded server workloads (VolanoMark, SPECjbb, and RUBiS), we are able to demonstrate reductions in cross-chip cache accesses of up to 70%. These reductions lead to application-reported performance improvements of up to 7%.

289 citations


Proceedings ArticleDOI
17 Jun 2007
TL;DR: For workloads that can benefit from cache partitioning, CCP achieves up to 60%, and on average 12%, better performance than the exhaustive search of optimal static partitions, and provides the best results on almost all evaluation metrics for different cache sizes.
Abstract: This paper presents Cooperative Cache Partitioning (CCP) to allocate cache resources among threads concurrently running on CMPs. Unlike cache partitioning schemes that use a single spatial partition repeatedly throughout a stable program phase, CCP resolves cache contention with multiple time-sharing partitions. Timesharing cache resources among partitions allows each thrashing thread to speed up dramatically in at least one partition by unfairly shrinking other threads' capacity allocations, while improving fairness by giving different partitions equal chance to execute. Quality-of-Service (QoS) is guaranteed over the long term by orchestrating the shrink and expansion of each thread's capacity across partitions to bound the average slowdown. Time-sharing based cache partitioning is further integrated with CMP cooperative caching [6] to exploit the benefits of LRU-based latency optimizations, which leads to a simplified partitioning algorithm and better performance for workloads that do not benefit from cache partitioning. We evaluate the effectiveness of CCP by simulating a 4-core CMP running all combinations of 7 representative SPEC2000 benchmarks. For workloads that can benefit from cache partitioning, CCP achieves up to 60%, and on average 12%, better performance than the exhaustive search of optimal static partitions. Overall, CCP provides the best results on almost all evaluation metrics for different cache sizes.

280 citations


Patent
06 Dec 2007
TL;DR: In this article, an apparatus, system, and method for solid-state storage as cache for high-capacity, non-volatile storage is described. But the system is based on a cache front-end and a cache back-end.
Abstract: An apparatus, system, and method are disclosed for solid-state storage as cache for high-capacity, non-volatile storage. The apparatus, system, and method are provided with a plurality of modules including a cache front-end module and a cache back-end module. The cache front-end module manages data transfers associated with a storage request. The data transfers between a requesting device and solid-state storage function as cache for one or more HCNV storage devices, and the data transfers may include one or more of data, metadata, and metadata indexes. The solid-state storage may include an array of non-volatile, solid-state data storage elements. The cache back-end module manages data transfers between the solid-state storage and the one or more HCNV storage devices.

238 citations


Patent
23 Mar 2007
TL;DR: A cache includes an object cache layer and a byte cache layer, each configured to store information to storage devices included in the cache appliance as mentioned in this paper, and an application proxy layer may also be included.
Abstract: A cache includes an object cache layer and a byte cache layer, each configured to store information to storage devices included in the cache appliance. An application proxy layer may also be included. In addition, the object cache layer may be configured to identify content that should not be cached by the byte cache layer, which itself may be configured to compress contents of the object cache layer. In some cases the contents of the byte cache layer may be stored as objects within the object cache.

208 citations


Journal ArticleDOI
TL;DR: This work presents the first quantitative, analytical results for the predictability of replacement policies, and introduces three metrics, evict, fill, and mls that capture aspects of cache-state predictability.
Abstract: Hard real-time systems must obey strict timing constraints. Therefore, one needs to derive guarantees on the worst-case execution times of a system's tasks. In this context, predictable behavior of system components is crucial for the derivation of tight and thus useful bounds. This paper presents results about the predictability of common cache replacement policies. To this end, we introduce three metrics, evict, fill, and mls that capture aspects of cache-state predictability. A thorough analysis of the LRU, FIFO, MRU, and PLRU policies yields the respective values under these metrics. To the best of our knowledge, this work presents the first quantitative, analytical results for the predictability of replacement policies. Our results support empirical evidence in static cache analysis.

202 citations


Proceedings ArticleDOI
09 Jun 2007
TL;DR: This is the first paper demonstrating the effectiveness of PDF on real benchmarks, providing a direct comparison between PDF and WS, revealing the limiting factors for PDF in practice, and presenting an approach for overcoming these factors.
Abstract: In chip multiprocessors (CMPs), limiting the number of offchip cache misses is crucial for good performance. Many multithreaded programs provide opportunities for constructive cache sharing, in which concurrently scheduled threads share a largely overlapping working set. In this paper, we compare the performance of two state-of-the-art schedulers proposed for fine-grained multithreaded programs: Parallel Depth First (PDF), which is specifically designed for constructive cache sharing, and Work Stealing (WS), which is a more traditional design. Our experimental results indicate that PDF scheduling yields a 1.3--1.6X performance improvement relative to WS for several fine-grain parallel benchmarks on projected future CMP configurations; we also report several issues that may limit the advantage of PDF in certain applications. These results also indicate that PDF more effectively utilizes off-chip bandwidth, making it possible to trade-off on-chip cache for a larger number of cores. Moreover, we find that task granularity plays a key role in cache performance. Therefore, we present an automatic approach for selecting effective grain sizes, based on a new working set profiling algorithm that is an order of magnitude faster than previous approaches. This is the first paper demonstrating the effectiveness of PDF on real benchmarks, providing a direct comparison between PDF and WS, revealing the limiting factors for PDF in practice, and presenting an approach for overcoming these factors.

174 citations


Patent
22 Feb 2007
TL;DR: In this article, cache coherency circuitry ensures that data accessed by each processing unit is up-to-date and has snoop indication circuitry whose content is derived from the already-provided segment filtering data.
Abstract: Each of plural processing units has a cache, and each cache has indication circuitry containing segment filtering data. The indication circuitry responds to an address specified by an access request from an associated processing unit to reference the segment filtering data to indicate whether the data is either definitely not stored or is potentially stored in that segment. Cache coherency circuitry ensures that data accessed by each processing unit is up-to-date and has snoop indication circuitry whose content is derived from the already-provided segment filtering data. For certain access requests, the cache coherency circuitry initiates a coherency operation during which the snoop indication circuitry determines whether any of the caches requires a snoop operation. For each cache that does, the cache coherency circuitry issues a notification to that cache identifying the snoop operation to be performed.

Patent
16 Jan 2007
TL;DR: In this article, the authors present an apparatus for caching data in a network, with the apparatus including a proxy cache configured to receive request for an object from a client and to fetch data blocks from a server.
Abstract: In one embodiment, the invention provides an apparatus for caching data in a network, with the apparatus including a proxy cache configured to receive request for an object from a client and to fetch data blocks from a server. The proxy cache may be configured to cache the data blocks in a hierarchical relationship within the object. The object may be, for example, a data file or a directory. The data blocks that are cached in the proxy cache define an active data set which is based upon a request from a client.

Proceedings ArticleDOI
10 Feb 2007
TL;DR: This work proposes a novel non-uniform cache architecture in which the amount of cache space that can be shared among the cores is controlled dynamically and shows that this scheme outperforms a private and shared cache organization as well as a hybrid NUCA organization in which blocks in a local partition can spill over to neighbor core partitions.
Abstract: The significant speed-gap between processor and memory and the limited chip memory bandwidth make last-level cache performance crucial for future chip multiprocessors. To use the capacity of shared last-level caches efficiently and to allow for a short access time, proposed non-uniform cache architectures (NUCAs) are organized into per-core partitions. If a core runs out of cache space, blocks are typically relocated to nearby partitions, thus managing the cache as a shared cache. This uncontrolled sharing of all resources may unfortunately result in pollution that degrades performance. We propose a novel non-uniform cache architecture in which the amount of cache space that can be shared among the cores is controlled dynamically. The adaptive scheme estimates, continuously, the effect of increasing/decreasing the shared partition size on the overall performance. We show that our scheme outperforms a private and shared cache organization as well as a hybrid NUCA organization in which blocks in a local partition can spill over to neighbor core partitions

Proceedings ArticleDOI
09 Jun 2007
TL;DR: This paper proposes OneTM to simplify the implementation of unbounded transactional memory by bounding the concurrency of transactions that overflow the cache, and introduces the permissions-only cache to extend the bound at which transactions overflow to allow the fast, bounded case to be used as frequently as possible.
Abstract: Hardware transactional memory has great potential to simplify the creation ofcorrect and efficient multithreaded programs, allowing programmers to exploitmore effectively the soon-to-be-ubiquitous multi-core designs. Several recentproposals have extended the original bounded transactional memory to unboundedtransactional memory, a crucial step toward transactions becoming ageneral-purpose primitive. Unfortunately, supporting the concurrent executionof an unbounded number of unbounded transactions is challenging, and as aresult, many proposed implementations are complex.This paper explores a different approach. First, we introduce thepermissions-only cache to extend the bound at which transactions overflow toallow the fast, bounded case to be used as frequently as possible. Second, wepropose OneTM to simplify the implementation of unbounded transactional memoryby bounding the concurrency of transactions that overflow the cache. Thesemechanisms work synergistically to provide a simple and fast unboundedtransactional memory system.The permissions-only cache efficiently maintains the coherencepermissions-but not data-for blocks read or written transactionally thathave been evicted from the processor's caches. By holding coherencepermissions for these blocks, the regular cache coherence protocol can be usedto detect transactional conflicts using only a few bits of on-chip storage peroverflowed cache block.OneTM allows only one overflowed transaction at a time, relying on thepermissions-only cache to ensure that overflow is infrequent. We present twoimplementations. In OneTM-Serialized, an overflowed transaction simply stallsall other threads in the application.In OneTM-Concurrent, non-overflowedtransactions and non-transactional code can execute concurrently with theoverflowed transaction, providing more concurrency while retaining OneTM's coresimplifying assumption.

Proceedings ArticleDOI
01 Oct 2007
TL;DR: This work proposes to directly predict reuse-distances via instruction-based (PC) prediction and use this information for cache level optimizations and evaluates the reusedistance based replacement policy of the L2 cache using a subset of the most memory intensive SPEC2000.
Abstract: Several cache management techniques have been proposed that indirectly try to base their decisions on cacheline reuse-distance, like Cache Decay which is a postdiction of reuse-distances: if a cacheline has not been accessed for some ldquodecay intervalrdquo we know that its reuse-distance is at least as large as this decay interval. In this work, we propose to directly predict reuse-distances via instruction-based (PC) prediction and use this information for cache level optimizations. In this paper, we choose as our target for optimization the replacement policy of the L2 cache, because the gap between the LRU and the theoretical optimal replacement algorithm is comparatively large for L2 caches. This indicates that, in many situations, there is ample room for improvement. We evaluate our reusedistance based replacement policy using a subset of the most memory intensive SPEC2000 and our results show significant benefits across the board.

Patent
15 Feb 2007
TL;DR: In this paper, a server computer identifies a cached document and its associated cache update history in response to a request or in anticipation of a request from a client computer, and analyzes the document's cached update history to determine if the cached document is de facto fresh.
Abstract: A server computer identifies a cached document and its associated cache update history in response to a request or in anticipation of a request from a client computer. The server computer analyzes the document's cache update history to determine if the cached document is de facto fresh. If the cached document is de facto fresh, the server computer then transmits the cached document to the client computer. Independently, the server computer also fetches an instance of the document from another source like a web host and updates the document's cache update history using the fetched instance of the document.

Journal ArticleDOI
TL;DR: In GroCoca, a GROup-based COoperative CAching scheme, a family of algorithms is proposed to discover and maintain all TCGs dynamically and two cooperative cache management protocols are designed to control data replicas and improve data accessibility in TCGs.
Abstract: In a mobile cooperative caching environment, we observe the need for cooperating peers to cache useful data items together, so as to improve cache hit from peers. This could be achieved by capturing the data requirement of individual peers in conjunction with their mobility pattern, for which we realized via a GROup-based COoperative CAching scheme (GroCoca). In GroCoca, we define a tightly-coupled group (TCG) as a collection of peers that possess similar mobility pattern and display similar data affinity. A family of algorithms is proposed to discover and maintain all TCGs dynamically. Furthermore, two cooperative cache management protocols, namely, cooperative cache admission control and replacement, are designed to control data replicas and improve data accessibility in TCGs. A cache signature scheme is also adopted in GroCoca in order to provide information for the mobile clients to determine whether their TCG members are likely caching their desired data items and to perform cooperative cache replacement Experimental results show that GroCoca outperforms the conventional caching scheme and standard COoperative CAching scheme (COCA) in terms of access latency and global cache hit ratio. However, GroCoca generally incurs higher power consumption.

Proceedings ArticleDOI
01 Dec 2007
TL;DR: The tagless hit instruction cache (TH-IC) is proposed, a technique for completely eliminating the performance penalty associated with filter caches, as well as a further reduction in energy consumption due to not having to access the tag array on cache hits.
Abstract: Very small instruction caches have been shown to greatly reduce fetch energy. However, for many appli- cations the use of a small filter cache can lead to an unacceptable increase in execution time. In this paper, we propose the Tagless Hit Instruction Cache (TH-IC), a technique for completely eliminating the performance penalty associated with filter caches, as well as a fur- ther reduction in energy consumption due to not having to access the tag array on cache hits. Using a few meta- data bits per line, we are able to more efficiently track the cache contents and guarantee when hits will occur in our small TH-IC. When a hit is not guaranteed, we can instead fetch directly from the L1 instruction cache, eliminating any additional cycles due to a TH-IC miss. Experimental results show that the overall processor en- ergy consumption can be significantly reduced due to the faster application running time and the elimination of tag comparisons for most of the accesses.

Patent
30 Jul 2007
TL;DR: In this paper, a mechanism for selectively disabling and enabling read caching based on past performance of the cache and current read/write requests is proposed to improve overall performance by using an autonomic algorithm to disable read caching.
Abstract: A mechanism for selectively disabling and enabling read caching based on past performance of the cache and current read/write requests. The system improves overall performance by using an autonomic algorithm to disable read caching for regions of backend disk storage (i.e., the backstore) that have had historically low cache hit ratios. The result is that more cache becomes available for workloads with larger hit ratios, and less time and machine cycles are spent searching the cache for data that is unlikely to be there.

Patent
25 May 2007
TL;DR: In this article, a multicore processor comprises a plurality of cache memories, each associated with one cache memory, and each of the cache memories is configured to maintain at least a portion of cache memory in which each cache line is dynamically managed as either local to the associated processor core or shared among multiple processor cores.
Abstract: A multicore processor comprises a plurality of cache memories, and a plurality of processor cores, each associated with one of the cache memories. Each of at least some of the cache memories is configured to maintain at least a portion of the cache memory in which each cache line is dynamically managed as either local to the associated processor core or shared among multiple processor cores.

Proceedings ArticleDOI
09 Jun 2007
TL;DR: This work extends the widely-used CACTI cache modeling tool to take network design parameters into account and proposes novel cache access optimizations that introduce heterogeneity within the inter-bank network to alleviate the interconnect delay bottleneck.
Abstract: The ever increasing sizes of on-chip caches and the growing domination of wire delay necessitate significant changes to cache hierarchy design methodologies Many recent proposals advocate splitting the cache into a large number of banks and employing a network-on-chip (NoC) to allow fast access to nearby banks (referred to as Non-Uniform Cache Architectures--NUCA) Most studies on NUCA organizations have assumed a generic NoC and focused on logical policies for cache block placement, movement, and search Since wire/router delay and power are major limiting factors in modern processors, this work focuses on interconnect design and its influence on NUCA performance and power We extend the widely-used CACTI cache modeling tool to take network design parameters into account With these overheads appropriately accounted for, the optimal cache organization is typically very different from that assumed in prior NUCA studies To alleviate the interconnect delay bottleneck, we propose novel cache access optimizations that introduce heterogeneity within the inter-bank network The careful consideration of interconnect choices for a large cache results in a 51% performance improvement over a baseline generic NoC and the introduction of heterogeneity within the network yields an additional 11-15% performance improvement

Journal ArticleDOI
TL;DR: The 16-way set associative, single-ported 16-MB cache for the Dual-Core Intel Xeon Processor 7100 Series uses a 0.624 mum2 cell in a 65-nm 8-metal technology to minimize both leakage and dynamic power.
Abstract: The 16-way set associative, single-ported 16-MB cache for the Dual-Core Intel Xeon Processor 7100 Series uses a 0.624 mum2 cell in a 65-nm 8-metal technology. Low power techniques are implemented in the L3 cache to minimize both leakage and dynamic power. Sleep transistors are used in the SRAM array and peripherals, reducing the cache leakage by more than 2X. Only 0.8% of the cache is powered up for a cache access. Dynamic cache line disable (Intel Cache Safe Technology) with a history buffer protects the cache from latent defects and infant mortality failures

Proceedings ArticleDOI
01 Dec 2007
TL;DR: This work proposes a novel replacement strategy that mimics the replacement decisions of OPT and can cover 40% of the gap between OPT and LRU for a 2MB cache resulting in 7% overall speedup.
Abstract: The inherent temporal locality in memory accesses is filtered out by the L1 cache. As a consequence, an L2 cache with LRU replacement incurs significantly higher misses than the optimal replacement policy (OPT). We propose to narrow this gap through a novel replacement strategy that mimics the replacement decisions of OPT. The L2 cache is logically divided into two components, a Shepherd Cache (SC) with a simple FIFO replacement and a Main Cache (MC) with an emulation of optimal replacement. The SC plays the dual role of caching lines and guiding the replacement decisions in MC. Our pro- posed organization can cover 40% of the gap between OPT and LRU for a 2MB cache resulting in 7% overall speedup. Comparison with the dynamic insertion policy, a victim buffer, a V-Way cache and an LRU based fully associative cache demonstrates that our scheme performs better than all these strategies.

Proceedings Article
13 Feb 2007
TL;DR: Karma is presented, a global non-centralized, dynamic and informed management policy for multiple levels of cache that leverages application hints to make informed allocation and replacement decisions in all cache levels, preserving exclusive caching and adjusting to changes in access patterns.
Abstract: Multilevel caching, common in many storage configurations, introduces new challenges to traditional cache management: data must be kept in the appropriate cache and replication avoided across the various cache levels Some existing solutions focus on avoiding replication across the levels of the hierarchy, working well without information about temporal locality-information missing at all but the highest level of the hierarchy Others use application hints to influence cache contents We present Karma, a global non-centralized, dynamic and informed management policy for multiple levels of cache Karma leverages application hints to make informed allocation and replacement decisions in all cache levels, preserving exclusive caching and adjusting to changes in access patterns We show the superiority of Karma through comparison to existing solutions including LRU, 2Q, ARC, MultiQ, LRU-SP, and Demote, demonstrating better cache performance than all other solutions and up to 85% better performance than LRU on representative workloads

Patent
Holger Karn1, Sven Miller1
11 Apr 2007
TL;DR: In this paper, a computer-implemented method is described to collect cache-efficiency-indicator values of an at least one cache fragment during operation of a database system over a period of time.
Abstract: A computer-implemented method is disclosed. The method includes collecting cache-efficiency-indicator values of an at least one cache fragment during operation of a database system over a period of time. Providing approximation-function-parameter values for the collected, cache-efficiency-indicator values, an approximation function representing a relation between a cache-efficiency-indicator and the size of a respective cache fragment. The method continues by providing a set of workload windows based on the approximation-function-parameter values. Next, providing a workload-window information for the set of workload windows, the workload-window information including at least one approximation-function-parameter value representing each determined workload window. The method further includes storing the workload-window information for a comparison based on current, cache-efficiency-indicator values and the workload-window information.

Proceedings ArticleDOI
10 Feb 2007
TL;DR: This work proposes line distillation (LDIS), a technique that retains only the used words and evicts the unused words in a cache line, and proposes distill cache, a cache organization to utilize the capacity created by LDIS.
Abstract: Caches are organized at a line-size granularity to exploit spatial locality. However, when spatial locality is low, many words in the cache line are not used. Unused words occupy cache space but do not contribute to cache hits. Filtering these words can allow the cache to store more cache lines. We show that unused words in a cache line are unlikely to be accessed in the less recent part of the LRU stack. We propose line distillation (LDIS), a technique that retains only the used words and evicts the unused words in a cache line. We also propose distill cache, a cache organization to utilize the capacity created by LDIS. Our experiments with 16 memory-intensive benchmarks show that LDIS reduces the average misses for a 1MB 8-way L2 cache by 30% and improves the average IPC by 12%

Patent
21 Aug 2007
TL;DR: In this paper, a facility for determining whether to consistency-check a cache entry is described, where the facility randomly or pseudorandomly selects a value in a range, and if the selected value satisfies a predetermined consistency-checking threshold within the range, the facility consistency-checks the entry, and may decide to propagate this knowledge to other cache managers.
Abstract: A facility for determining whether to consistency-check a cache entry is described. The facility randomly or pseudorandomly selects a value in a range. If the selected value satisfies a predetermined consistency-checking threshold within the range, the facility consistency-checks the entry, and may decide to propagate this knowledge to other cache managers. If, on the other hand, the selected value does not satisfy the consistency-checking threshold, the facility determines not to consistency-check the entry.

Journal ArticleDOI
TL;DR: A proposed relay-peer-based cache consistency protocol offers a generic and flexible method for carrying out cache invalidation in mobile wireless environments.
Abstract: The trend toward wireless communications and advances in mobile technologies are increasing consumer demand for ubiquitous access to Internet-based information and services. A 3D framework provides a basis for designing, analyzing, and evaluating strategies to address data consistency issues in mobile wireless environments. A proposed relay-peer-based cache consistency protocol offers a generic and flexible method for carrying out cache invalidation

Patent
26 Jul 2007
TL;DR: In this paper, the cache of user devices is encouraged to maintain more current data prior to requesting web pages that would invoke the download of the more recent data, by modifying markup language files that are downloaded to the user device.
Abstract: The delivery of content is improved from a user's perspective by encouraging the cache of user devices to maintain more current data prior to requesting web pages that would invoke the download of the more current data. A cache simulation of the cache in the user devices is maintained. When objects within the cache simulator are expected to require updating, actions are taken to encourage the user device to update the cache. The actions include modifying markup language files that are downloaded to the user device to invoke the update. For instance, artificial URL's that request the updated objects, embedded scripts, user selection objects, or the like are used to invoke an update process.

Proceedings ArticleDOI
30 Sep 2007
TL;DR: Techniques exploring the lockdown of instruction caches at compile-time to minimize WCETs are presented, which explicitly take the worst-case execution path into account during each step of the optimization procedure.
Abstract: Caches are notorious for their unpredictability. It is difficult or even impossible to predict if a memory access results in a definite cache hit or miss. This unpredictability is highly undesired for real-time systems. The Worst-Case Execution Time (WCET) of a software running on an embedded processor is one of the most important metrics during real-time system design. The WCET depends to a large extent on the total amount of time spent for memory accesses. In the presence of caches, WCET analysis must always assume a memory access to be a cache miss if it can not be guaranteed that it is a hit. Hence, WCETs for cached systems are imprecise due to the overestimation caused by the caches. Modern caches can be controlled by software. The software can load parts of its code or of its data into the cache and lock the cache afterwards. Cache locking prevents the cache's contents from being flushed by deactivating the replacement. A locked cache is highly predictable and leads to very precise WCET estimates, because the uncertainty caused by the replacement strategy is eliminated completely. This paper presents techniques exploring the lockdown of instruction caches at compile-time to minimize WCETs. In contrast to the current state of the art in the area of cache locking, our techniques explicitly take the worst-case execution path into account during each step of the optimization procedure. This way, we can make sure that always those parts of the code are locked in the I-cache that lead to the highest WCET reduction. The results demonstrate that WCET reductions from 54% up to 73% can be achieved with an acceptable amount of CPU seconds required for the optimization and WCET analyses themselves.