scispace - formally typeset
Search or ask a question

Showing papers on "Cache invalidation published in 2014"


Journal ArticleDOI
TL;DR: This paper proposes a novel coded caching scheme that exploits both local and global caching gains, leading to a multiplicative improvement in the peak rate compared with previously known schemes, and argues that the performance of the proposed scheme is within a constant factor of the information-theoretic optimum for all values of the problem parameters.
Abstract: Caching is a technique to reduce peak traffic rates by prefetching popular content into memories at the end users. Conventionally, these memories are used to deliver requested content in part from a locally cached copy rather than through the network. The gain offered by this approach, which we term local caching gain, depends on the local cache size (i.e., the memory available at each individual user). In this paper, we introduce and exploit a second, global, caching gain not utilized by conventional caching schemes. This gain depends on the aggregate global cache size (i.e., the cumulative memory available at all users), even though there is no cooperation among the users. To evaluate and isolate these two gains, we introduce an information-theoretic formulation of the caching problem focusing on its basic structure. For this setting, we propose a novel coded caching scheme that exploits both local and global caching gains, leading to a multiplicative improvement in the peak rate compared with previously known schemes. In particular, the improvement can be on the order of the number of users in the network. In addition, we argue that the performance of the proposed scheme is within a constant factor of the information-theoretic optimum for all values of the problem parameters.

1,857 citations


Proceedings ArticleDOI
10 Jun 2014
TL;DR: In this article, the authors studied the optimal cache content placement in a wireless small cell base station (sBS) with limited backhaul capacity, where the cache content content placement is optimized based on the demand history.
Abstract: Optimal cache content placement in a wireless small cell base station (sBS) with limited backhaul capacity is studied. The sBS has a large cache memory and provides content-level selective offloading by delivering high data rate contents to users in its coverage area. The goal of the sBS content controller (CC) is to store the most popular contents in the sBS cache memory such that the maximum amount of data can be fetched directly form the sBS, not relying on the limited backhaul resources during peak traffic periods. If the popularity profile is known in advance, the problem reduces to a knapsack problem. However, it is assumed in this work that, the popularity profile of the files is not known by the CC, and it can only observe the instantaneous demand for the cached content. Hence, the cache content placement is optimised based on the demand history. By refreshing the cache content at regular time intervals, the CC tries to learn the popularity profile, while exploiting the limited cache capacity in the best way possible. Three algorithms are studied for this cache content placement problem, leading to different exploitation-exploration trade-offs. We provide extensive numerical simulations in order to study the time-evolution of these algorithms, and the impact of the system parameters, such as the number of files, the number of users, the cache size, and the skewness of the popularity profile, on the performance. It is shown that the proposed algorithms quickly learn the popularity profile for a wide range of system parameters.

322 citations


Proceedings ArticleDOI
13 Dec 2014
TL;DR: A novel random fill cache architecture is proposed that replaces demand fetch with random cache fill within a configurable neighborhood window and shows that it provides information-theoretic security against reuse based attacks.
Abstract: Correctly functioning caches have been shown to leak critical secrets like encryption keys, through various types of cache side-channel attacks. This nullifies the security provided by strong encryption and allows confidentiality breaches, impersonation attacks and fake services. Hence, future cache designs must consider security, ideally without degrading performance and power efficiency. We introduce a new classification of cache side channel attacks: contention based attacks and reuse based attacks. Previous secure cache designs target only contention based attacks, and we show that they cannot defend against reuse based attacks. We show the surprising insight that the fundamental demand fetch policy of a cache is a security vulnerability that causes the success of reuse based attacks. We propose a novel random fill cache architecture that replaces demand fetch with random cache fill within a configurable neighborhood window. We show that our random fill cache does not degrade performance, and in fact, improves the performance for some types of applications. We also show that it provides information-theoretic security against reuse based attacks.

217 citations


Proceedings ArticleDOI
13 Dec 2014
TL;DR: Unison Cache incorporates the tag metadata directly into the stacked DRAM to enable scalability to arbitrary stacked-DRAM capacities and employs large, page-sized cache allocation units to achieve high hit rates and reduction in tag overheads.
Abstract: Recent research advocates large die-stacked DRAM caches in many core servers to break the memory latency and bandwidth wall. To realize their full potential, die-stacked DRAM caches necessitate low lookup latencies, high hit rates and the efficient use of off-chip bandwidth. Today's stacked DRAM cache designs fall into two categories based on the granularity at which they manage data: block-based and page-based. The state-of-the-art block-based design, called Alloy Cache, collocates a tag with each data block (e.g., 64B) in the stacked DRAM to provide fast access to data in a single DRAM access. However, such a design suffers from low hit rates due to poor temporal locality in the DRAM cache. In contrast, the state-of-the-art page-based design, called Footprint Cache, organizes the DRAM cache at page granularity (e.g., 4KB), but fetches only the blocks that will likely be touched within a page. In doing so, the Footprint Cache achieves high hit rates with moderate on-chip tag storage and reasonable lookup latency. However, multi-gigabyte stacked DRAM caches will soon be practical and needed by server applications, thereby mandating tens of MBs of tag storage even for page-based DRAM caches. We introduce a novel stacked-DRAM cache design, Unison Cache. Similar to Alloy Cache's approach, Unison Cache incorporates the tag metadata directly into the stacked DRAM to enable scalability to arbitrary stacked-DRAM capacities. Then, leveraging the insights from the Footprint Cache design, Unison Cache employs large, page-sized cache allocation units to achieve high hit rates and reduction in tag overheads, while predicting and fetching only the useful blocks within each page to minimize the off-chip traffic. Our evaluation using server workloads and caches of up to 8GB reveals that Unison cache improves performance by 14% compared to Alloy Cache due to its high hit rate, while outperforming the state-of-the art page-based designs that require impractical SRAM-based tags of around 50MB.

162 citations


Proceedings ArticleDOI
10 Jun 2014
TL;DR: It is proved that the optimal online scheme has approximately the same performance as the optimal offline scheme, in which the cache contents can be updated based on the entire set of popular files before each new request.
Abstract: We consider a basic content distribution scenario consisting of a single origin server connected through a shared bottleneck link to a number of users each equipped with a cache of finite memory. The users issue a sequence of content requests from a set of popular files, and the goal is to operate the caches as well as the server such that these requests are satisfied with the minimum number of bits sent over the shared link. Assuming a basic Markov model for renewing the set of popular files, we characterize approximately the optimal long-term average rate of the shared link. We further prove that the optimal online scheme has approximately the same performance as the optimal offline scheme, in which the cache contents can be updated based on the entire set of popular files before each new request. To support these theoretical results, we propose an online coded caching scheme termed coded least-recently sent (LRS) and simulate it for a demand time series derived from the dataset made available by Netflix for the Netflix Prize. For this time series, we show that the proposed coded LRS algorithm significantly outperforms the popular least-recently used (LRU) caching algorithm.

155 citations


Proceedings ArticleDOI
08 Jul 2014
TL;DR: In this paper, the authors consider a network consisting of a file server connected through a shared link to a number of users, each equipped with a cache and show that caching only the most popular files can be highly suboptimal.
Abstract: We consider a network consisting of a file server connected through a shared link to a number of users, each equipped with a cache. Knowing the popularity distribution of the files, the goal is to optimally populate the caches, such as to minimize the expected load of the shared link. For a single cache, it is well known that storing the most popular files is optimal in this setting. However, we show here that this is no longer the case for multiple caches. Indeed, caching only the most popular files can be highly suboptimal. Instead, a fundamentally different approach is needed, in which the cache contents are used as side information for coded communication over the shared link. We propose such a coded caching scheme and prove that it is close to optimal.

145 citations


Proceedings ArticleDOI
13 Dec 2014
TL;DR: A specialized cache management policy for GPGPUs is proposed that is coordinated with warp throttling to dynamically control the active number of warps and a simple predictor to dynamically estimate the optimal number of active warps that can take full advantage of the cache space and on-chip resources.
Abstract: With the SIMT execution model, GPUs can hidememory latency through massive multithreading for many applications that have regular memory access patterns. To support applications with irregular memory access patterns, cache hierarchies have been introduced to GPU architectures to capture temporal and spatial locality and mitigate the effect of irregular accesses. However, GPU caches exhibit poor efficiency due to the mismatch of the throughput-oriented execution model and its cache hierarchy design, which limits system performance and energy-efficiency. The massive amount of memory requests generated by GPU scause cache contention and resource congestion. Existing CPUcache management policies that are designed for multicoresystems, can be suboptimal when directly applied to GPUcaches. We propose a specialized cache management policy for GPGPUs. The cache hierarchy is protected from contention by the bypass policy based on reuse distance. Contention and resource congestion are detected at runtime. To avoid oversaturatingon-chip resources, the bypass policy is coordinated with warp throttling to dynamically control the active number of warps. We also propose a simple predictor to dynamically estimate the optimal number of active warps that can take full advantage of the cache space and on-chip resources. Experimental results show that cache efficiency is significantly improved and on-chip resources are better utilized for cache sensitive benchmarks. This results in a harmonic mean IPCimprovement of 74% and 17% (maximum 661% and 44% IPCimprovement), compared to the baseline GPU architecture and optimal static warp throttling, respectively.

142 citations


Patent
14 Jul 2014
TL;DR: In this article, cache optimization techniques are employed to organize resources within caches such that the most requested content (e.g., the most popular content) is more readily available, and the resources propagate through a cache server hierarchy associated with the service provider.
Abstract: Resource management techniques, such as cache optimization, are employed to organize resources within caches such that the most requested content (e.g., the most popular content) is more readily available. A service provider utilizes content expiration data as indicative of resource popularity. As resources are requested, the resources propagate through a cache server hierarchy associated with the service provider. More frequently requested resources are maintained at edge cache servers based on shorter expiration data that is reset with each repeated request. Less frequently requested resources are maintained at higher levels of a cache server hierarchy based on longer expiration data associated with cache servers higher on the hierarchy.

136 citations


Proceedings ArticleDOI
11 Aug 2014
TL;DR: In this paper, a hierarchical content delivery network with two layers of caches is considered, and a new caching scheme that combines two basic approaches is proposed to provide coded multicasting opportunities within each layer and across multiple layers.
Abstract: caching of popular content during off-peak hours is a strategy to reduce network loads during peak hours. Recent work has shown significant benefits of designing such caching strategies not only to locally deliver the part of the content, but also to provide coded multicasting opportunities even among users with different demands. Exploiting both of these gains was shown to be approximately optimal for caching systems with a single layer of caches. Motivated by practical scenarios, we consider, in this paper, a hierarchical content delivery network with two layers of caches. We propose a new caching scheme that combines two basic approaches. The first approach provides coded multicasting opportunities within each layer; the second approach provides coded multicasting opportunities across multiple layers. By striking the right balance between these two approaches, we show that the proposed scheme achieves the optimal communication rates to within a constant multiplicative and additive gap. We further show that there is no tension between the rates in each of the two layers up to the aforementioned gap. Thus, both the layers can simultaneously operate at approximately the minimum rate.

118 citations


Journal ArticleDOI
TL;DR: A Time-To-Live (TTL) based caching model, that assigns a timer to each content stored in the cache and redraws it every time the content is requested (at each hit/miss), is introduced.

106 citations


Proceedings ArticleDOI
24 Aug 2014
TL;DR: It is established that maintaining the tags in SRAM, because of its smaller access latency, leads to overall better performance and a simple technique is proposed to throttle the number of sets prefetched which can satisfy over 60% of DRAM cache tag accesses on average.
Abstract: 3D-stacking technology has enabled the option of embedding a large DRAM onto the processor. Prior works have proposed to use this as a DRAM cache. Because of its large size (a DRAM cache can be in the order of hundreds of megabytes), the total size of the tags associated with it can also be quite large (in the order of tens of megabytes). The large size of the tags has created a problem. Should we maintain the tags in the DRAM and pay the cost of a costly tag access in the critical path? Or should we maintain the tags in the faster SRAM by paying the area cost of a large SRAM for this purpose? Prior works have primarily chosen the former and proposed a variety of techniques for reducing the cost of a DRAM tag access. In this paper, we first establish (with the help of a study) that maintaining the tags in SRAM, because of its smaller access latency, leads to overall better performance. Motivated by this study, we ask if it is possible to maintain tags in SRAM without incurring high area overhead. Our key idea is simple. We propose to cache the tags in a small SRAM tag cache - we show that there is enough spatial and temporal locality amongst tag accesses to merit this idea. We propose the ATCache which is a small SRAM tag cache. Similar to a conventional cache, the ATCache caches recently accessed tags to exploit temporal locality; it exploits spatial locality by prefetching tags from nearby cache sets. In order to avoid the high miss latency and cache pollution caused by excessive prefetching, we use a simple technique to throttle the number of sets prefetched. Our proposed ATCache (which consumes 0.4% of overall tag size) can satisfy over 60% of DRAM cache tag accesses on average.

Posted Content
TL;DR: In this article, the authors derived an order-optimal scheme which judiciously shares cache memory among files with different popularities, and derived new information-theoretic lower bounds, which use a sliding-window entropy inequality.
Abstract: To address the exponentially rising demand for wireless content, use of caching is emerging as a potential solution. It has been recently established that joint design of content delivery and storage (coded caching) can significantly improve performance over conventional caching. Coded caching is well suited to emerging heterogeneous wireless architectures which consist of a dense deployment of local-coverage wireless access points (APs) with high data rates, along with sparsely-distributed, large-coverage macro-cell base stations (BS). This enables design of coded caching-and-delivery schemes that equip APs with storage, and place content in them in a way that creates coded-multicast opportunities for combining with macro-cell broadcast to satisfy users even with different demands. Such coded-caching schemes have been shown to be order-optimal with respect to the BS transmission rate, for a system with single-level content, i.e., one where all content is uniformly popular. In this work, we consider a system with non-uniform popularity content which is divided into multiple levels, based on varying degrees of popularity. The main contribution of this work is the derivation of an order-optimal scheme which judiciously shares cache memory among files with different popularities. To show order-optimality we derive new information-theoretic lower bounds, which use a sliding-window entropy inequality, effectively creating a non-cutset bound. We also extend the ideas to when users can access multiple caches along with the broadcast. Finally we consider two extreme cases of user distribution across caches for the multi-level popularity model: a single user per cache (single-user setup) versus a large number of users per cache (multi-user setup), and demonstrate a dichotomy in the order-optimal strategies for these two extreme cases.

Proceedings ArticleDOI
11 Aug 2014
TL;DR: Optimal cache content placement is studied in a wireless infostation network (WIN), which models a limited coverage wireless network with a large cache memory, formulated as a multi-armed bandit problem with switching cost, and an algorithm to solve it is presented.
Abstract: Optimal cache content placement is studied in a wireless infostation network (WIN), which models a limited coverage wireless network with a large cache memory. WIN provides content-level selective offloading by delivering high data rate contents stored in its cache memory to the users through a broadband connection. The goal of the WIN central controller (CC) is to store the most popular content in the cache memory of the WIN such that the maximum amount of data can be fetched directly from the cache rather than being downloaded from the core network. If the popularity profile of the available set of contents is known in advance, the optimization of the cache content reduces to a knapsack problem. However, it is assumed in this work that the popularity profile of the files is not known, and only the instantaneous demands for those contents stored in the cache can be observed. Hence, the cache content placement is optimised based on the demand history, and on the cost associated to placing each content in the cache. By refreshing the cache content at regular time intervals, the CC tries to learn the popularity profile, while at the same time exploiting the limited cache capacity in the best way possible. This problem is formulated as a multi-armed bandit problem with switching cost, and an algorithm to solve it is presented. The performance of the algorithm is measured in terms of regret, which is proven to be logarithmic and sub-linear uniformly over time for a specific and a general case, respectively.

Proceedings ArticleDOI
19 Mar 2014
TL;DR: By adaptively controlling the rate at which the client downloads video segments from the cache, the approach can reduce bitrate oscillations, prevent sudden rate changes, and provides traffic savings, and improves the quality of experience of clients.
Abstract: Video streaming is a major source of Internet traffic today and usage continues to grow at a rapid rate To cope with this new and massive source of traffic, ISPs use methods such as caching to reduce the amount of traffic traversing their networks and serve customers better However, the presence of a standard cache server in the video transfer path may result in bitrate oscillations and sudden rate changes for Dynamic Adaptive Streaming over HTTP (DASH) clients In this paper, we investigate the interactions between a client and a cache that result in these problems, and propose an approach to solve it By adaptively controlling the rate at which the client downloads video segments from the cache, we can ensure that clients will get smooth video We verify our results using simulation and show that compared to a standard cache our approach (1) can reduce bitrate oscillations (2) prevents sudden rate changes, and compared to a no-cache scenario (3) provides traffic savings, and (4) improves the quality of experience of clients

Proceedings ArticleDOI
01 Feb 2014
TL;DR: A Read-Write Partitioning (RWP) policy is proposed that minimizes read misses by dynamically partitioning the cache into clean and dirty partitions, where partitions grow in size if they are more likely to receive future read requests.
Abstract: Cache read misses stall the processor if there are no independent instructions to execute. In contrast, most cache write misses are off the critical path of execution, since writes can be buffered in the cache or the store buffer. With few exceptions, cache lines that serve loads are more critical for performance than cache lines that serve only stores. Unfortunately, traditional cache management mechanisms do not take into account this disparity between read-write criticality. This paper proposes a Read-Write Partitioning (RWP) policy that minimizes read misses by dynamically partitioning the cache into clean and dirty partitions, where partitions grow in size if they are more likely to receive future read requests. We show that exploiting the differences in read-write criticality provides better performance over prior cache management mechanisms. For a single-core system, RWP provides 5% average speedup across the entire SPEC CPU2006 suite, and 14% average speedup for cache-sensitive benchmarks, over the baseline LRU replacement policy. We also show that RWP can perform within 3% of a new yet complex instruction-address-based technique, Read Reference Predictor (RRP), that bypasses cache lines which are unlikely to receive any read requests, while requiring only 5.4% of RRP's state overhead. On a 4-core system, our RWP mechanism improves system throughput by 6% over the baseline and outperforms three other state-of-the-art mechanisms we evaluate.

Patent
30 Apr 2014
TL;DR: In this article, the authors present a method, a system and a server of removing a distributed caching object from a cache server by comparing an active period of a located cache server with an expiration period associated with an object, thus saving the other cache servers from wasting resources to perform removal operations.
Abstract: The present disclosure discloses a method, a system and a server of removing a distributed caching object In one embodiment, the method receives a removal request, where the removal request includes an identifier of an object The method may further apply consistent Hashing to the identifier of the object to obtain a Hash result value of the identifier, locates a corresponding cache server based on the Hash result value and renders the corresponding cache server to be a present cache server In some embodiments, the method determines whether the present cache server is in an active status and has an active period greater than an expiration period associated with the object Additionally, in response to determining that the present cache server is in an active status and has an active period greater than the expiration period associated with the object, the method removes the object from the present cache server By comparing an active period of a located cache server with an expiration period associated with an object, the exemplary embodiments precisely locate a cache server that includes the object to be removed and perform a removal operation, thus saving the other cache servers from wasting resources to perform removal operations and hence improving the overall performance of the distributed cache system

Proceedings ArticleDOI
13 Dec 2014
TL;DR: The Skewed Compressed Cache (SCC), a new hardware compressed cache that lowers overheads and increases performance, is proposed using novel sparse super-block tags and a skewed associative mapping that takes compressed size into account.
Abstract: Cache compression seeks the benefits of a larger cache with the area and power of a smaller cache. Ideally, a compressed cache increases effective capacity by tightly compacting compressed blocks, has low tag and metadata overheads, and allows fast lookups. Previous compressed cache designs, however, fail to achieve all these goals. In this paper, we propose the Skewed Compressed Cache (SCC), a new hardware compressed cache that lowers overheads and increases performance. SCC tracks super blocks to reduce tag overhead, compacts blocks into a variable number of sub-blocks to reduce internal fragmentation, but retains a direct tag-data mapping to find blocks quickly and eliminate extra metadata (i.e., No backward pointers). Saccades this using novel sparse super-block tags and a skewed associative mapping that takes compressed size into account. In our experiments, SCC provides on average 8% (up to 22%) higher performance, and on average 6% (up to 20%) lower total energy, achieving the benefits of the recent Decoupled Compressed Cache [26] with a factor of 4 lower area overhead and lower design complexity.

Proceedings ArticleDOI
10 Jun 2014
TL;DR: This work proposes cache deduplication that effectively increases last- level cache capacity and detects duplicate data blocks and stores only one copy of the data in a way that can be accessed through multiple physical addresses.
Abstract: Caches are essential to the performance of modern micro- processors. Much recent work on last-level caches has focused on exploiting reference locality to improve efficiency. However, value redundancy is another source of potential improvement. We find that many blocks in the working set of typical benchmark programs have the same values. We propose cache deduplication that effectively increases last- level cache capacity. Rather than exploit specific value redundancy with compression, as in previous work, our scheme detects duplicate data blocks and stores only one copy of the data in a way that can be accessed through multiple physical addresses. We find that typical benchmarks exhibit significant value redundancy, far beyond the zero-content blocks one would expect in any program. Our deduplicated cache effectively increases capacity by an average of 112% com- pared to an 8MB last-level cache while reducing the physical area by 12.2%, yielding an average performance improvement of 15.2%.

Patent
17 Mar 2014
TL;DR: In this article, a hybrid storage system is described having a mixture of different types of storage devices comprising rotational drives, flash devices, SDRAM, and SRAM, which is used as the main storage, providing lowest cost per unit of storage memory.
Abstract: A hybrid storage system is described having a mixture of different types of storage devices comprising rotational drives, flash devices, SDRAM, and SRAM. The rotational drives are used as the main storage, providing lowest cost per unit of storage memory. Flash memory is used as a higher-level cache for rotational drives. Methods for managing multiple levels of cache for this storage system is provided having a very fast Level 1 cache which consists of volatile memory (SRAM or SDRAM), and a non-volatile Level 2 cache using an array of flash devices. It describes a method of distributing the data across the rotational drives to make caching more efficient. It also describes efficient techniques for flushing data from L1 cache and L2 cache to the rotational drives, taking advantage of concurrent flash devices operations, concurrent rotational drive operations, and maximizing sequential access types in the rotational drives rather than random accesses which are relatively slower. Methods provided here may be extended for systems that have more than two cache levels.

Proceedings ArticleDOI
02 Jun 2014
TL;DR: A Hierarchical Adaptive Replacement Cache (H-ARC) policy is proposed that considers all four factors of a page's status: dirty, clean, recency, and frequency, and it is very challenging to design a policy that can also increase the cache hit ratio for better system performance.
Abstract: With the rapid development of new types of nonvolatile memory (NVM), one of these technologies may replace DRAM as the main memory in the near future. Some drawbacks of DRAM, such as data loss due to power failure or a system crash can be remedied by NVM's non-volatile nature. In the meantime, solid state drives (SSDs) are becoming widely deployed as storage devices for faster random access speed compared with traditional hard disk drives (HDDs). For applications demanding higher reliability and better performance, using NVM as the main memory and SSDs as storage devices becomes a promising architecture. Although SSDs have better performance than HDDs, SSDs cannot support in-place updates (i.e., an erase operation has to be performed before a page can be updated) and suffer from a low endurance problem that each unit will wear out after certain number of erase operations. In an NVM based main memory, any updated pages called dirty pages can be kept longer without the urgent need to be flushed to SSDs. This difference opens an opportunity to design new cache policies that help extend the lifespan of SSDs by wisely choosing cache eviction victims to decrease storage write traffic. However, it is very challenging to design a policy that can also increase the cache hit ratio for better system performance. Most existing DRAM-based cache policies have mainly concentrated on the recency or frequency status of a page. On the other hand, most existing NVM-based cache policies have mainly focused on the dirty or clean status of a page. In this paper, by extending the concept of the Adaptive Replacement Cache (ARC), we propose a Hierarchical Adaptive Replacement Cache (H-ARC) policy that considers all four factors of a page's status: dirty, clean, recency, and frequency. Specifically, at the higher level, H-ARC adaptively splits the whole cache space into a dirty-page cache and a clean-page cache. At the lower level, inside the dirty-page cache and the clean-page cache, H-ARC splits them into a recency-page cache and a frequency-page cache separately. During the page eviction process, all parts of the cache will be balanced towards to their desired sizes.

Patent
Sanjeev N. Trika1
19 Feb 2014
TL;DR: In this paper, the authors propose a method and system to allow power fail-safe write-back or write-through caching of data in a persistent storage device into one or more cache lines of a caching device.
Abstract: A method and system to allow power fail-safe write-back or write-through caching of data in a persistent storage device into one or more cache lines of a caching device. No metadata associated with any of the cache lines is written atomically into the caching device when the data in the storage device is cached. As such, specialized cache hardware to allow atomic writing of metadata during the caching of data is not required.

Proceedings ArticleDOI
28 Jul 2014
TL;DR: A cache partitioning algorithm is presented that is optimal with respect to task set schedulability and compared to state-of-the-art pre-emption cost analysis based on benchmark code and on a large number of synthetic task sets.
Abstract: In hard real-time systems, cache partitioning is often suggested as a means of increasing the predictability of caches in pre-emptively scheduled systems: when a task is assigned its own cache partition, inter-task cache eviction is avoided, and timing verification is reduced to the standard worst case execution time (WCET) analysis used in non-pre-emptive systems. The downside of cache partitioning is the potential increase in execution times. In this paper, we evaluate cache partitioning for hard real time systems in terms of overall schedulability. To this end, we examine the sensitivity of task execution times to the size of the cache partition allocated and present a cache partitioning algorithm that is optimal with respect to task set schedulability. We then evaluate the performance of cache partitioning compared to state-of-the-art pre-emption cost analysis based on benchmark code and on a large number of synthetic task sets. This allows us to derive general conclusions about the usability of cache partitioning and identify task set and system parameters that influence the relative e ectiveness of cache partitioning.

Proceedings ArticleDOI
24 Mar 2014
TL;DR: This paper proves that one of the previously published formulae for the probability of a cache hit is optimal with respect to the limited information that it uses, and introduces a simple exhaustive method that computes a precise pWCET distribution, albeit at the cost of exponential complexity.
Abstract: In this paper, we investigate Static Probabilistic Timing Analysis (SPTA) for single processor systems that use a cache with an evict-on-miss random replacement policy. We show that previously published formulae for the probability of a cache hit can produce results that are optimistic and unsound when used to compute probabilistic Worst-Case Execution Time (pWCET) distributions. We investigate the correctness, optimality, and precision of different approaches to SPTA. We prove that one of the previously published formulae for the probability of a cache hit is optimal with respect to the limited information that it uses. We improve upon this formulation by using extra information about cache contention. To investigate the precision of various approaches to SPTA, we introduce a simple exhaustive method that computes a precise pWCET distribution, albeit at the cost of exponential complexity. Further, we integrate this precise approach, applied to small numbers of frequently accessed memory blocks, with imprecise analysis of other memory blocks, to form a combined approach that improves precision, without significantly increasing its complexity. The performance of the various approaches are compared on benchmark programs.

Patent
10 Feb 2014
TL;DR: In this paper, the authors describe a computer system for both online transaction processing and online analytical processing, comprising of a processor coupled to a database, the database comprising the database, a main store (116) for storing records, a differential buffer (114) for receiving and buffering added or deleted or modified records, and a cache store (112) for caching a result of a query against the schema; and the cache controller being configured for: storing the result of the query in the cache store; receiving an analytical request; and determining, in response to the received request, an
Abstract: The invention relates to a computer system for both online transaction processing and online analytical processing, comprising: a processor coupled to a database, the database comprising the database comprising: a main store (116) for storing records, a differential buffer (114) for receiving and buffering added or deleted or modified records, the differential buffer being coupled to the main store, a schema comprising records stored in the main store and records stored in the differential buffer, and a cache store (112) for caching a result of a query against the schema; and a cache controller (106) executable by the processor and communicatively coupled to the database, the cache controller being configured for: storing the result of the query in the cache store; receiving an analytical request; and determining, in response to the received request, an up-to-date result of the query by (216): accessing the cache store to obtain the cached result; determining the records of the schema that have been added or deleted or modified since the step of storing the cached result in the cache store on the basis of the records stored in the differential buffer; and incrementally deriving the up-to-date result from the cached result and from the records determined in the previous step.

Proceedings ArticleDOI
19 Jun 2014
TL;DR: This work presents a distributed proactive caching approach that exploits user mobility information to decide where to proactively cache data to support seamless mobility, while efficiently utilizing cache storage using a congestion pricing scheme.
Abstract: We present a distributed proactive caching approach that exploits user mobility information to decide where to proactively cache data to support seamless mobility, while efficiently utilizing cache storage using a congestion pricing scheme. The proposed approach is applicable to the case where objects have different sizes and to a two-level cache hierarchy, for both of which the proactive caching problem is hard. Our evaluation results show how various system parameters influence the delay gains of the proposed approach, which achieves robust and good performance relative to an oracle and an optimal scheme for a flat cache structure.

Patent
11 Apr 2014
TL;DR: In this paper, a data structure for maintaining cache supporting compression and cache-wide deduplication is presented, where a first mapping is generated from short-length signatures to a storage location and a quantized length measure on a cache storage device; unused contiguous regions on the cache device are allocated.
Abstract: Systems and methods for generating and storing a data structure for maintaining cache supporting compression and cache-wide deduplication, including generating data structures with fixed size memory regions configured to hold multiple signatures as keys, wherein the number of the fixed size memory regions is bounded. A first mapping is generated from short-length signatures to a storage location and a quantized length measure on a cache storage device; and unused contiguous regions on the cache device are allocated. Metadata and cache page content is retrieved using a single input/output operation; a correctness of a full value of hash functions of uncompressed cache page content is validated; a second mapping is generated from short-length signatures to entries in the first mapping; and verification of whether the cached page content corresponds to a full-length original logical block address using the metadata is performed.

Proceedings ArticleDOI
13 Dec 2014
TL;DR: Futility Scaling (FS) is proposed, a novel replacement-based cache partitioning scheme that can precisely partition the whole cache while still maintaining high associativity even with a large number of partitions.
Abstract: As shared last level caches are widely used in many-core CMPs to boost system performance, partitioning a large shared cache among multiple concurrently running applications becomes increasingly important in order to reduce destructive interference. However, while recent works start to show the promise of using replacement-based partitioning schemes, such existing schemes either suffer from the severe associativity degradation when the number of partitions is high, or lack the ability to precisely partition the whole cache which leads to decreased resource efficiency. In this paper, we propose Futility Scaling (FS), a novel replacement-based cache partitioning scheme that can precisely partition the whole cache while still maintaining high associativity even with a large number of partitions. The futility of a cache line represents the uselessness of this line to application performance and can be ranked in different ways by various policies, e.g., LRU and LFU. The idea of FS is to control the size of a partition by properly scaling the futility of its cache lines. We study the properties of FS on both associativity and sizing in an analytical framework, and present a feedback-based implementation of FS that incurs little overhead in practice. Simulation results show that, FS improves performance over previously proposed Vantage and Prism by up to 6.0% and 13.7%, respectively.

Proceedings ArticleDOI
13 Dec 2014
TL;DR: The proposed Bi-Modal Cache is able to make judicious use of the available DRAM cache capacity as well as reduce the off-chip memory bandwidth consumption by leveraging the tremendous internal bandwidth and capacity that stacked DRAM organizations provide.
Abstract: In this paper, we present Bi-Modal Cache - a flexible stacked DRAM cache organization which simultaneously achieves several objectives: (i) improved cache hit ratio, (ii) moving the tag storage overhead to DRAM, (iii) lower cache hit latency than tags-in-SRAM, and (iv) reduction in off-chip bandwidth wastage. The Bi-Modal Cache addresses the miss rate versus off-chip bandwidth dilemma by organizing the data in a bi-modal fashion - blocks with high spatial locality are organized as large blocks and those with little spatial locality as small blocks. By adaptively selecting the right granularity of storage for individual blocks at run-time, the proposed DRAM cache organization is able to make judicious use of the available DRAM cache capacity as well as reduce the off-chip memory bandwidth consumption. The Bi-Modal Cache improves cache hit latency despite moving the metadata to DRAM by means of a small SRAM based Way Locator. Further by leveraging the tremendous internal bandwidth and capacity that stacked DRAM organizations provide, the Bi-Modal Cache enables efficient concurrent accesses to tags and data to reduce hit time. Through detailed simulations, we demonstrate that the Bi-Modal Cache achieves overall performance improvement (in terms of Average Normalized Turnaround Time (ANTT)) of 10.8%, 13.8% and 14.0% in 4-core, 8-core and 16-core workloads respectively.

Patent
26 Dec 2014
TL;DR: In this article, a hardware/software co-optimization for inter-VM communication for NFVs and other producer-consumer workloads is presented, which includes multi-core processors with multi-level cache hierarchies including and L1 and L2 cache for each core and a shared last level cache (LLC).
Abstract: Methods and apparatus implementing Hardware/Software co-optimization to improve performance and energy for inter-VM communication for NFVs and other producer-consumer workloads. The apparatus include multi-core processors with multi-level cache hierarchies including and L1 and L2 cache for each core and a shared last-level cache (LLC). One or more machine-level instructions are provided for proactively demoting cachelines from lower cache levels to higher cache levels, including demoting cachelines from L1/L2 caches to an LLC. Techniques are also provided for implementing hardware/software co-optimization in multi-socket NUMA architecture system, wherein cachelines may be selectively demoted and pushed to an LLC in a remote socket. In addition, techniques are disclosure for implementing early snooping in multi-socket systems to reduce latency when accessing cachelines on remote sockets.

Proceedings ArticleDOI
02 Jun 2014
TL;DR: A novel framework for optimal cache management in ICNs which jointly considers caching strategy and content routing via linear network coding and an efficient network coding based cache management (NCCM) algorithm to obtain a near-optimal caching and routing solution for ICNs is proposed.
Abstract: The increasing demand for media-rich content has driven many efforts to redesign the Internet architecture. As one of the major candidates, information-centric network (ICN) has attracted significant attention, where in-network cache is a key component in different ICN architectures. In this paper, we propose a novel framework for optimal cache management in ICNs which jointly considers caching strategy and content routing. Specifically, we propose a cache management framework for ICNs based on software-defined networking (SDN) where a controller is responsible for determining the optimal caching strategy and content routing via linear network coding (LNC). Under the proposed cache management framework, we formally formulate the problem of minimizing the network bandwidth cost by jointly considering caching strategy and content routing with LNC. We develop an efficient network coding based cache management (NCCM) algorithm to obtain a near-optimal caching and routing solution for ICNs. We further develop a lower bound of the problem and conduct extensive experiments to compare the performance of the NCCM algorithm with the lower bound. Simulation results validate the effectiveness of the NCCM algorithm and framework.