scispace - formally typeset
Search or ask a question

Showing papers on "Cache coloring published in 2014"


Proceedings Article
20 Aug 2014
TL;DR: This paper presents FLUSH+RELOAD, a cache side-channel attack technique that exploits a weakness in the Intel X86 processors to monitor access to memory lines in shared pages and recovers 96.7% of the bits of the secret key by observing a single signature or decryption round.
Abstract: Sharing memory pages between non-trusting processes is a common method of reducing the memory footprint of multi-tenanted systems In this paper we demonstrate that, due to a weakness in the Intel X86 processors, page sharing exposes processes to information leaks We present FLUSH+RELOAD, a cache side-channel attack technique that exploits this weakness to monitor access to memory lines in shared pages Unlike previous cache side-channel attacks, FLUSH+RELOAD targets the Last-Level Cache (ie L3 on processors with three cache levels) Consequently, the attack program and the victim do not need to share the execution core We demonstrate the efficacy of the FLUSH+RELOAD attack by using it to extract the private encryption keys from a victim program running GnuPG 1413 We tested the attack both between two unrelated processes in a single operating system and between processes running in separate virtual machines On average, the attack is able to recover 967% of the bits of the secret key by observing a single signature or decryption round

1,001 citations


Proceedings ArticleDOI
10 Jun 2014
TL;DR: In this article, the authors studied the optimal cache content placement in a wireless small cell base station (sBS) with limited backhaul capacity, where the cache content content placement is optimized based on the demand history.
Abstract: Optimal cache content placement in a wireless small cell base station (sBS) with limited backhaul capacity is studied. The sBS has a large cache memory and provides content-level selective offloading by delivering high data rate contents to users in its coverage area. The goal of the sBS content controller (CC) is to store the most popular contents in the sBS cache memory such that the maximum amount of data can be fetched directly form the sBS, not relying on the limited backhaul resources during peak traffic periods. If the popularity profile is known in advance, the problem reduces to a knapsack problem. However, it is assumed in this work that, the popularity profile of the files is not known by the CC, and it can only observe the instantaneous demand for the cached content. Hence, the cache content placement is optimised based on the demand history. By refreshing the cache content at regular time intervals, the CC tries to learn the popularity profile, while exploiting the limited cache capacity in the best way possible. Three algorithms are studied for this cache content placement problem, leading to different exploitation-exploration trade-offs. We provide extensive numerical simulations in order to study the time-evolution of these algorithms, and the impact of the system parameters, such as the number of files, the number of users, the cache size, and the skewness of the popularity profile, on the performance. It is shown that the proposed algorithms quickly learn the popularity profile for a wide range of system parameters.

322 citations


Proceedings ArticleDOI
13 Dec 2014
TL;DR: A novel random fill cache architecture is proposed that replaces demand fetch with random cache fill within a configurable neighborhood window and shows that it provides information-theoretic security against reuse based attacks.
Abstract: Correctly functioning caches have been shown to leak critical secrets like encryption keys, through various types of cache side-channel attacks. This nullifies the security provided by strong encryption and allows confidentiality breaches, impersonation attacks and fake services. Hence, future cache designs must consider security, ideally without degrading performance and power efficiency. We introduce a new classification of cache side channel attacks: contention based attacks and reuse based attacks. Previous secure cache designs target only contention based attacks, and we show that they cannot defend against reuse based attacks. We show the surprising insight that the fundamental demand fetch policy of a cache is a security vulnerability that causes the success of reuse based attacks. We propose a novel random fill cache architecture that replaces demand fetch with random cache fill within a configurable neighborhood window. We show that our random fill cache does not degrade performance, and in fact, improves the performance for some types of applications. We also show that it provides information-theoretic security against reuse based attacks.

217 citations


Proceedings ArticleDOI
13 Dec 2014
TL;DR: Unison Cache incorporates the tag metadata directly into the stacked DRAM to enable scalability to arbitrary stacked-DRAM capacities and employs large, page-sized cache allocation units to achieve high hit rates and reduction in tag overheads.
Abstract: Recent research advocates large die-stacked DRAM caches in many core servers to break the memory latency and bandwidth wall. To realize their full potential, die-stacked DRAM caches necessitate low lookup latencies, high hit rates and the efficient use of off-chip bandwidth. Today's stacked DRAM cache designs fall into two categories based on the granularity at which they manage data: block-based and page-based. The state-of-the-art block-based design, called Alloy Cache, collocates a tag with each data block (e.g., 64B) in the stacked DRAM to provide fast access to data in a single DRAM access. However, such a design suffers from low hit rates due to poor temporal locality in the DRAM cache. In contrast, the state-of-the-art page-based design, called Footprint Cache, organizes the DRAM cache at page granularity (e.g., 4KB), but fetches only the blocks that will likely be touched within a page. In doing so, the Footprint Cache achieves high hit rates with moderate on-chip tag storage and reasonable lookup latency. However, multi-gigabyte stacked DRAM caches will soon be practical and needed by server applications, thereby mandating tens of MBs of tag storage even for page-based DRAM caches. We introduce a novel stacked-DRAM cache design, Unison Cache. Similar to Alloy Cache's approach, Unison Cache incorporates the tag metadata directly into the stacked DRAM to enable scalability to arbitrary stacked-DRAM capacities. Then, leveraging the insights from the Footprint Cache design, Unison Cache employs large, page-sized cache allocation units to achieve high hit rates and reduction in tag overheads, while predicting and fetching only the useful blocks within each page to minimize the off-chip traffic. Our evaluation using server workloads and caches of up to 8GB reveals that Unison cache improves performance by 14% compared to Alloy Cache due to its high hit rate, while outperforming the state-of-the art page-based designs that require impractical SRAM-based tags of around 50MB.

162 citations


Proceedings ArticleDOI
19 Jun 2014
TL;DR: This paper proposes the memory request prioritization buffer (MRPB), a hardware structure that improves caching efficiency of massively parallel workloads by applying two prioritization methods-request reordering and cache bypassing-to memory requests before they access a cache.
Abstract: Massively parallel, throughput-oriented systems such as graphics processing units (GPUs) offer high performance for a broad range of programs. They are, however, complex to program, especially because of their intricate memory hierarchies with multiple address spaces. In response, modern GPUs have widely adopted caches, hoping to providing smoother reductions in memory access traffic and latency. Unfortunately, GPU caches often have mixed or unpredictable performance impact due to cache contention that results from the high thread counts in GPUs. We propose the memory request prioritization buffer (MRPB) to ease GPU programming and improve GPU performance. This hardware structure improves caching efficiency of massively parallel workloads by applying two prioritization methods—request reordering and cache bypassing—to memory requests before they access a cache. MRPB then releases requests into the cache in a more cache-friendly order. The result is drastically reduced cache contention and improved use of the limited per-thread cache capacity. For a simulated 16KB L1 cache, MRPB improves the average performance of the entire PolyBench and Rodinia suites by 2.65× and 1.27× respectively, outperforming a state-of-the-art GPU cache management technique.

158 citations


Proceedings ArticleDOI
13 Dec 2014
TL;DR: This paper proposes CAMEO, a hardware-based Cache-like Memory Organization that not only makes stacked DRAM visible as part of the memory address space but also exploits data locality on a fine-grained basis and proposes a low overhead Line Location Table (LLT) that tracks the physical location of all data lines.
Abstract: This paper analyzes the trade-offs in architecting stacked DRAM either as part of main memory or as a hardware-managed cache. Using stacked DRAM as part of main memory increases the effective capacity, but obtaining high performance from such a system requires Operating System (OS) support to migrate data at a page-granularity. Using stacked DRAM as a hardware cache has the advantages of being transparent to the OS and perform data management at a line-granularity but suffers from reduced main memory capacity. This is because the stacked DRAM cache is not part of the memory address space. Ideally, we want the stacked DRAM to contribute towards capacity of main memory, and still maintain the hardware-based fine-granularity of a cache. We propose CAMEO, a hardware-based Cache-like Memory Organization that not only makes stacked DRAM visible as part of the memory address space but also exploits data locality on a fine-grained basis. CAMEO retains recently accessed data lines in stacked DRAM and swaps out the victim line to off chip memory. Since CAMEO can change the physical location of a line dynamically, we propose a low overhead Line Location Table (LLT) that tracks the physical location of all data lines. We also propose an accurate Line Location Predictor (LLP) to avoid the serialization of the LLT look-up and memory access. We evaluate a system that has 4GB stacked memory and 12GB off-chip memory. Using stacked DRAM as a cache improves performance by 50%, using as part of main memory improves performance by 33%, whereas CAMEO improves performance by 78%. Our proposed design is very close to an idealized memory system that uses the 4GB stacked DRAM as a hardware-managed cache and also increases the main memory capacity by an additional 4GB.

145 citations


Proceedings ArticleDOI
08 Jul 2014
TL;DR: In this paper, the authors consider a network consisting of a file server connected through a shared link to a number of users, each equipped with a cache and show that caching only the most popular files can be highly suboptimal.
Abstract: We consider a network consisting of a file server connected through a shared link to a number of users, each equipped with a cache. Knowing the popularity distribution of the files, the goal is to optimally populate the caches, such as to minimize the expected load of the shared link. For a single cache, it is well known that storing the most popular files is optimal in this setting. However, we show here that this is no longer the case for multiple caches. Indeed, caching only the most popular files can be highly suboptimal. Instead, a fundamentally different approach is needed, in which the cache contents are used as side information for coded communication over the shared link. We propose such a coded caching scheme and prove that it is close to optimal.

145 citations


Proceedings ArticleDOI
13 Dec 2014
TL;DR: A specialized cache management policy for GPGPUs is proposed that is coordinated with warp throttling to dynamically control the active number of warps and a simple predictor to dynamically estimate the optimal number of active warps that can take full advantage of the cache space and on-chip resources.
Abstract: With the SIMT execution model, GPUs can hidememory latency through massive multithreading for many applications that have regular memory access patterns. To support applications with irregular memory access patterns, cache hierarchies have been introduced to GPU architectures to capture temporal and spatial locality and mitigate the effect of irregular accesses. However, GPU caches exhibit poor efficiency due to the mismatch of the throughput-oriented execution model and its cache hierarchy design, which limits system performance and energy-efficiency. The massive amount of memory requests generated by GPU scause cache contention and resource congestion. Existing CPUcache management policies that are designed for multicoresystems, can be suboptimal when directly applied to GPUcaches. We propose a specialized cache management policy for GPGPUs. The cache hierarchy is protected from contention by the bypass policy based on reuse distance. Contention and resource congestion are detected at runtime. To avoid oversaturatingon-chip resources, the bypass policy is coordinated with warp throttling to dynamically control the active number of warps. We also propose a simple predictor to dynamically estimate the optimal number of active warps that can take full advantage of the cache space and on-chip resources. Experimental results show that cache efficiency is significantly improved and on-chip resources are better utilized for cache sensitive benchmarks. This results in a harmonic mean IPCimprovement of 74% and 17% (maximum 661% and 44% IPCimprovement), compared to the baseline GPU architecture and optimal static warp throttling, respectively.

142 citations


Proceedings ArticleDOI
24 Aug 2014
TL;DR: A memory management framework called COLORIS, which provides support for both static and dynamic cache partitioning using page coloring and monitors the cache miss rates of running applications and triggers re-partitioning of the cache to prevent miss rates exceeding applications-specific ranges.
Abstract: Shared caches in multicore processors are subject to contention from co-running threads. The resultant interference can lead to highly-variable performance for individual applications. This is particularly problematic for real-time applications, requiring predictable timing guarantees. Previous work has applied page coloring techniques to partition a shared cache, so that conflict misses are minimized amongst co-running workloads. However, prior page coloring techniques have not addressed the problem of partitioning a cache on over-committed processors where there are more executable threads than cores. Similarly, page coloring techniques have not proven efficient at adapting the cache partition sizes for threads with varying memory demands. This paper presents a memory management framework called COLORIS, which provides support for both static and dynamic cache partitioning using page coloring. COLORIS supports novel policies to reconfigure the assignment of page colors amongst application threads in over-committed systems. For quality-of-service (QoS), COLORIS monitors the cache miss rates of running applications and triggers re-partitioning of the cache to prevent miss rates exceeding applications-specific ranges. This paper presents the design and evaluation of COLORIS as applied to Linux. We show the efficiency and effectiveness of COLORIS to color memory pages for a set of SPEC CPU2006 workloads, thereby enhancing performance isolation over existing page coloring techniques.

133 citations


Journal ArticleDOI
TL;DR: A two timescale joint optimization of power and cache control to support real-time video streaming and the proposed solution is shown to be asymptotically optimal for high SNR and small timeslot duration.
Abstract: We propose a cache-enabled opportunistic cooperative MIMO (CoMP) framework for wireless video streaming. By caching a portion of the video files at the relays (RS) using a novel MDS-coded random cache scheme, the base station (BS) and RSs opportunistically employ CoMP to achieve spatial multiplexing gain without expensive payload backhaul. We study a two timescale joint optimization of power and cache control to support real-time video streaming. The cache control is to create more CoMP opportunities and is adaptive to the long-term popularity of the video files. The power control is to guarantee the QoS requirements and is adaptive to the channel state information (CSI), the cache state at the RS and the queue state information (QSI) at the users. The joint problem is decomposed into an inner power control problem and an outer cache control problem. We first derive a closed-form power control policy from an approximated Bellman equation. Based on this, we transform the outer problem into a convex stochastic optimization problem and propose a stochastic subgradient algorithm to solve it. Finally, the proposed solution is shown to be asymptotically optimal for high SNR and small timeslot duration. Its superior performance over various baselines is verified by simulations.

125 citations


Proceedings ArticleDOI
19 Jun 2014
TL;DR: This work extends reuse distance to GPUs by modelling the GPU's hierarchy of threads, warps, threadblocks, and sets of active threads, including conditional and non-uniform latencies, cache associativity, miss-status holding-registers, and warp divergence.
Abstract: As modern GPUs rely partly on their on-chip memories to counter the imminent off-chip memory wall, the efficient use of their caches has become important for performance and energy. However, optimising cache locality system-atically requires insight into and prediction of cache behaviour. On sequential processors, stack distance or reuse distance theory is a well-known means to model cache behaviour. However, it is not straightforward to apply this theory to GPUs, mainly because of the parallel execution model and fine-grained multi-threading. This work extends reuse distance to GPUs by modelling: 1) the GPU's hierarchy of threads, warps, threadblocks, and sets of active threads, 2) conditional and non-uniform latencies, 3) cache associativity, 4) miss-status holding-registers, and 5) warp divergence. We implement the model in C++ and extend the Ocelot GPU emulator to extract lists of memory addresses. We compare our model with measured cache miss rates for the Parboil and PolyBench/GPU benchmark suites, showing a mean absolute error of 6% and 8% for two cache configurations. We show that our model is faster and even more accurate compared to the GPGPU-Sim simulator.

Proceedings ArticleDOI
24 Aug 2014
TL;DR: It is established that maintaining the tags in SRAM, because of its smaller access latency, leads to overall better performance and a simple technique is proposed to throttle the number of sets prefetched which can satisfy over 60% of DRAM cache tag accesses on average.
Abstract: 3D-stacking technology has enabled the option of embedding a large DRAM onto the processor. Prior works have proposed to use this as a DRAM cache. Because of its large size (a DRAM cache can be in the order of hundreds of megabytes), the total size of the tags associated with it can also be quite large (in the order of tens of megabytes). The large size of the tags has created a problem. Should we maintain the tags in the DRAM and pay the cost of a costly tag access in the critical path? Or should we maintain the tags in the faster SRAM by paying the area cost of a large SRAM for this purpose? Prior works have primarily chosen the former and proposed a variety of techniques for reducing the cost of a DRAM tag access. In this paper, we first establish (with the help of a study) that maintaining the tags in SRAM, because of its smaller access latency, leads to overall better performance. Motivated by this study, we ask if it is possible to maintain tags in SRAM without incurring high area overhead. Our key idea is simple. We propose to cache the tags in a small SRAM tag cache - we show that there is enough spatial and temporal locality amongst tag accesses to merit this idea. We propose the ATCache which is a small SRAM tag cache. Similar to a conventional cache, the ATCache caches recently accessed tags to exploit temporal locality; it exploits spatial locality by prefetching tags from nearby cache sets. In order to avoid the high miss latency and cache pollution caused by excessive prefetching, we use a simple technique to throttle the number of sets prefetched. Our proposed ATCache (which consumes 0.4% of overall tag size) can satisfy over 60% of DRAM cache tag accesses on average.

Posted Content
TL;DR: In this article, the authors derived an order-optimal scheme which judiciously shares cache memory among files with different popularities, and derived new information-theoretic lower bounds, which use a sliding-window entropy inequality.
Abstract: To address the exponentially rising demand for wireless content, use of caching is emerging as a potential solution. It has been recently established that joint design of content delivery and storage (coded caching) can significantly improve performance over conventional caching. Coded caching is well suited to emerging heterogeneous wireless architectures which consist of a dense deployment of local-coverage wireless access points (APs) with high data rates, along with sparsely-distributed, large-coverage macro-cell base stations (BS). This enables design of coded caching-and-delivery schemes that equip APs with storage, and place content in them in a way that creates coded-multicast opportunities for combining with macro-cell broadcast to satisfy users even with different demands. Such coded-caching schemes have been shown to be order-optimal with respect to the BS transmission rate, for a system with single-level content, i.e., one where all content is uniformly popular. In this work, we consider a system with non-uniform popularity content which is divided into multiple levels, based on varying degrees of popularity. The main contribution of this work is the derivation of an order-optimal scheme which judiciously shares cache memory among files with different popularities. To show order-optimality we derive new information-theoretic lower bounds, which use a sliding-window entropy inequality, effectively creating a non-cutset bound. We also extend the ideas to when users can access multiple caches along with the broadcast. Finally we consider two extreme cases of user distribution across caches for the multi-level popularity model: a single user per cache (single-user setup) versus a large number of users per cache (multi-user setup), and demonstrate a dichotomy in the order-optimal strategies for these two extreme cases.

Proceedings ArticleDOI
03 Nov 2014
TL;DR: The results show that while these defences were highly effective a few processor generations ago, the trend towards imprecise events in modern microarchitectures weakens the defences and introduces new channels, demonstrating the necessity of careful empirical analysis of timing channels.
Abstract: Storage channels can be provably eliminated in well-designed, high-assurance kernels. Timing channels remain the last mile for confidentiality and are still beyond the reach of formal analysis, so must be dealt with empirically. We perform such an analysis, collecting a large data set (2,000 hours of observations) for two representative timing channels, the locally-exploitable cache channel and a remote exploit of OpenSSL execution timing, on the verified seL4 microkernel. We also evaluate the effectiveness, in bandwidth reduction, of a number of black-box mitigation techniques (cache colouring, instruction-based scheduling and deterministic delivery of server responses) across a number of hardware platforms. Our (somewhat unexpected) results show that while these defences were highly effective a few processor generations ago, the trend towards imprecise events in modern microarchitectures weakens the defences and introduces new channels. This demonstrates the necessity of careful empirical analysis of timing channels.

Proceedings ArticleDOI
11 Aug 2014
TL;DR: Optimal cache content placement is studied in a wireless infostation network (WIN), which models a limited coverage wireless network with a large cache memory, formulated as a multi-armed bandit problem with switching cost, and an algorithm to solve it is presented.
Abstract: Optimal cache content placement is studied in a wireless infostation network (WIN), which models a limited coverage wireless network with a large cache memory. WIN provides content-level selective offloading by delivering high data rate contents stored in its cache memory to the users through a broadband connection. The goal of the WIN central controller (CC) is to store the most popular content in the cache memory of the WIN such that the maximum amount of data can be fetched directly from the cache rather than being downloaded from the core network. If the popularity profile of the available set of contents is known in advance, the optimization of the cache content reduces to a knapsack problem. However, it is assumed in this work that the popularity profile of the files is not known, and only the instantaneous demands for those contents stored in the cache can be observed. Hence, the cache content placement is optimised based on the demand history, and on the cost associated to placing each content in the cache. By refreshing the cache content at regular time intervals, the CC tries to learn the popularity profile, while at the same time exploiting the limited cache capacity in the best way possible. This problem is formulated as a multi-armed bandit problem with switching cost, and an algorithm to solve it is presented. The performance of the algorithm is measured in terms of regret, which is proven to be logarithmic and sub-linear uniformly over time for a specific and a general case, respectively.

Proceedings ArticleDOI
08 Jul 2014
TL;DR: This paper solves the cache deployment optimization (CaDeOp) problem of determining how much server, energy, and bandwidth resources to provision in each cache AS, i.e., each AS chosen for cache deployment, as a mixed integer program (MIP).
Abstract: Content delivery networks (CDNs) deploy globally distributed systems of caches in a large number of autonomous systems (ASes) It is important for a CDN operator to satisfy the performance requirements of end users, while minimizing the cache deployment cost In this paper, we study the cache deployment optimization (CaDeOp) problem of determining how much server, energy, and bandwidth resources to provision in each cache AS, ie, each AS chosen for cache deployment The CaDeOp objective is to minimize the total cost incurred by the CDN, subject to meeting the end-user performance requirements We formulate the CaDeOp problem as a mixed integer program (MIP) and solve it for realistic AS-level topologies, traffic de- mands, and non-linear energy and bandwidth costs We also evaluate the sensitivity of the results to our parametric assump- tions When the end-user performance requirements become more stringent, the CDN footprint rapidly expands, requiring cache deployments in additional ASes and geographical regions Also, the CDN cost increases several times, with the cost balance shifting toward bandwidth and energy costs On the other hand, the traffic distribution among the cache ASes stays relatively even, with the top 20% of the cache ASes serving around 30% of the overall traffic

Proceedings ArticleDOI
01 Feb 2014
TL;DR: A Read-Write Partitioning (RWP) policy is proposed that minimizes read misses by dynamically partitioning the cache into clean and dirty partitions, where partitions grow in size if they are more likely to receive future read requests.
Abstract: Cache read misses stall the processor if there are no independent instructions to execute. In contrast, most cache write misses are off the critical path of execution, since writes can be buffered in the cache or the store buffer. With few exceptions, cache lines that serve loads are more critical for performance than cache lines that serve only stores. Unfortunately, traditional cache management mechanisms do not take into account this disparity between read-write criticality. This paper proposes a Read-Write Partitioning (RWP) policy that minimizes read misses by dynamically partitioning the cache into clean and dirty partitions, where partitions grow in size if they are more likely to receive future read requests. We show that exploiting the differences in read-write criticality provides better performance over prior cache management mechanisms. For a single-core system, RWP provides 5% average speedup across the entire SPEC CPU2006 suite, and 14% average speedup for cache-sensitive benchmarks, over the baseline LRU replacement policy. We also show that RWP can perform within 3% of a new yet complex instruction-address-based technique, Read Reference Predictor (RRP), that bypasses cache lines which are unlikely to receive any read requests, while requiring only 5.4% of RRP's state overhead. On a 4-core system, our RWP mechanism improves system throughput by 6% over the baseline and outperforms three other state-of-the-art mechanisms we evaluate.

Patent
30 Apr 2014
TL;DR: In this article, the authors present a method, a system and a server of removing a distributed caching object from a cache server by comparing an active period of a located cache server with an expiration period associated with an object, thus saving the other cache servers from wasting resources to perform removal operations.
Abstract: The present disclosure discloses a method, a system and a server of removing a distributed caching object In one embodiment, the method receives a removal request, where the removal request includes an identifier of an object The method may further apply consistent Hashing to the identifier of the object to obtain a Hash result value of the identifier, locates a corresponding cache server based on the Hash result value and renders the corresponding cache server to be a present cache server In some embodiments, the method determines whether the present cache server is in an active status and has an active period greater than an expiration period associated with the object Additionally, in response to determining that the present cache server is in an active status and has an active period greater than the expiration period associated with the object, the method removes the object from the present cache server By comparing an active period of a located cache server with an expiration period associated with an object, the exemplary embodiments precisely locate a cache server that includes the object to be removed and perform a removal operation, thus saving the other cache servers from wasting resources to perform removal operations and hence improving the overall performance of the distributed cache system

Journal ArticleDOI
14 Jun 2014
TL;DR: STAG is proposed, a high density, energy-efficient GPGPU cache hierarchy design using a new spintronic memory technology called Domain Wall Memory (DWM), which inherently offer unprecedented benefits in density by storing multiple bits in the domains of a ferromagnetic nanowire.
Abstract: General-purpose Graphics Processing Units (GPGPUs) are widely used for executing massively parallel workloads from various application domains. Feeding data to the hundreds to thousands of cores that current GPGPUs integrate places great demands on the memory hierarchy, fueling an ever-increasing demand for on-chip memory.In this work, we propose STAG, a high density, energy-efficient GPGPU cache hierarchy design using a new spintronic memory technology called Domain Wall Memory (DWM). DWMs inherently offer unprecedented benefits in density by storing multiple bits in the domains of a ferromagnetic nanowire, which logically resembles a bit-serial tape. However, this structure also leads to a unique challenge that the bits must be sequentially accessed by performing "shift" operations, resulting in variable and potentially higher access latencies. To address this challenge, STAG utilizes a number of architectural techniques : (i) a hybrid cache organization that employs different DWM bit-cells to realize the different memory arrays within the GPGPU cache hierarchy, (ii) a clustered, bit-interleaved organization, in which the bits in a cache block are spread across a cluster of DWM tapes, allowing parallel access, (iii) tape head management policies that predictively configure DWM arrays to reduce the expected number of shift operations for subsequent accesses, and (iv) a shift aware pro- motion buffer (SaPB), in which accesses to the DWM cache are predicted based on intra-warp locality, and locations that would incur a large shift penalty are promoted to a smaller buffer. Over a wide range of benchmarks from the Rodinia, IS- PASS and Parboil suites, STAG achieves significant benefits in performance (12.1% over SRAM and 5.8% over STT-MRAM) and energy (3.3X over SRAM and 2.6X over STT-MRAM)

Proceedings ArticleDOI
13 Dec 2014
TL;DR: The Skewed Compressed Cache (SCC), a new hardware compressed cache that lowers overheads and increases performance, is proposed using novel sparse super-block tags and a skewed associative mapping that takes compressed size into account.
Abstract: Cache compression seeks the benefits of a larger cache with the area and power of a smaller cache. Ideally, a compressed cache increases effective capacity by tightly compacting compressed blocks, has low tag and metadata overheads, and allows fast lookups. Previous compressed cache designs, however, fail to achieve all these goals. In this paper, we propose the Skewed Compressed Cache (SCC), a new hardware compressed cache that lowers overheads and increases performance. SCC tracks super blocks to reduce tag overhead, compacts blocks into a variable number of sub-blocks to reduce internal fragmentation, but retains a direct tag-data mapping to find blocks quickly and eliminate extra metadata (i.e., No backward pointers). Saccades this using novel sparse super-block tags and a skewed associative mapping that takes compressed size into account. In our experiments, SCC provides on average 8% (up to 22%) higher performance, and on average 6% (up to 20%) lower total energy, achieving the benefits of the recent Decoupled Compressed Cache [26] with a factor of 4 lower area overhead and lower design complexity.

Proceedings ArticleDOI
10 Jun 2014
TL;DR: This work proposes cache deduplication that effectively increases last- level cache capacity and detects duplicate data blocks and stores only one copy of the data in a way that can be accessed through multiple physical addresses.
Abstract: Caches are essential to the performance of modern micro- processors. Much recent work on last-level caches has focused on exploiting reference locality to improve efficiency. However, value redundancy is another source of potential improvement. We find that many blocks in the working set of typical benchmark programs have the same values. We propose cache deduplication that effectively increases last- level cache capacity. Rather than exploit specific value redundancy with compression, as in previous work, our scheme detects duplicate data blocks and stores only one copy of the data in a way that can be accessed through multiple physical addresses. We find that typical benchmarks exhibit significant value redundancy, far beyond the zero-content blocks one would expect in any program. Our deduplicated cache effectively increases capacity by an average of 112% com- pared to an 8MB last-level cache while reducing the physical area by 12.2%, yielding an average performance improvement of 15.2%.

Patent
17 Mar 2014
TL;DR: In this article, a hybrid storage system is described having a mixture of different types of storage devices comprising rotational drives, flash devices, SDRAM, and SRAM, which is used as the main storage, providing lowest cost per unit of storage memory.
Abstract: A hybrid storage system is described having a mixture of different types of storage devices comprising rotational drives, flash devices, SDRAM, and SRAM. The rotational drives are used as the main storage, providing lowest cost per unit of storage memory. Flash memory is used as a higher-level cache for rotational drives. Methods for managing multiple levels of cache for this storage system is provided having a very fast Level 1 cache which consists of volatile memory (SRAM or SDRAM), and a non-volatile Level 2 cache using an array of flash devices. It describes a method of distributing the data across the rotational drives to make caching more efficient. It also describes efficient techniques for flushing data from L1 cache and L2 cache to the rotational drives, taking advantage of concurrent flash devices operations, concurrent rotational drive operations, and maximizing sequential access types in the rotational drives rather than random accesses which are relatively slower. Methods provided here may be extended for systems that have more than two cache levels.

Proceedings ArticleDOI
02 Jun 2014
TL;DR: A Hierarchical Adaptive Replacement Cache (H-ARC) policy is proposed that considers all four factors of a page's status: dirty, clean, recency, and frequency, and it is very challenging to design a policy that can also increase the cache hit ratio for better system performance.
Abstract: With the rapid development of new types of nonvolatile memory (NVM), one of these technologies may replace DRAM as the main memory in the near future. Some drawbacks of DRAM, such as data loss due to power failure or a system crash can be remedied by NVM's non-volatile nature. In the meantime, solid state drives (SSDs) are becoming widely deployed as storage devices for faster random access speed compared with traditional hard disk drives (HDDs). For applications demanding higher reliability and better performance, using NVM as the main memory and SSDs as storage devices becomes a promising architecture. Although SSDs have better performance than HDDs, SSDs cannot support in-place updates (i.e., an erase operation has to be performed before a page can be updated) and suffer from a low endurance problem that each unit will wear out after certain number of erase operations. In an NVM based main memory, any updated pages called dirty pages can be kept longer without the urgent need to be flushed to SSDs. This difference opens an opportunity to design new cache policies that help extend the lifespan of SSDs by wisely choosing cache eviction victims to decrease storage write traffic. However, it is very challenging to design a policy that can also increase the cache hit ratio for better system performance. Most existing DRAM-based cache policies have mainly concentrated on the recency or frequency status of a page. On the other hand, most existing NVM-based cache policies have mainly focused on the dirty or clean status of a page. In this paper, by extending the concept of the Adaptive Replacement Cache (ARC), we propose a Hierarchical Adaptive Replacement Cache (H-ARC) policy that considers all four factors of a page's status: dirty, clean, recency, and frequency. Specifically, at the higher level, H-ARC adaptively splits the whole cache space into a dirty-page cache and a clean-page cache. At the lower level, inside the dirty-page cache and the clean-page cache, H-ARC splits them into a recency-page cache and a frequency-page cache separately. During the page eviction process, all parts of the cache will be balanced towards to their desired sizes.

Proceedings ArticleDOI
28 Jul 2014
TL;DR: A cache partitioning algorithm is presented that is optimal with respect to task set schedulability and compared to state-of-the-art pre-emption cost analysis based on benchmark code and on a large number of synthetic task sets.
Abstract: In hard real-time systems, cache partitioning is often suggested as a means of increasing the predictability of caches in pre-emptively scheduled systems: when a task is assigned its own cache partition, inter-task cache eviction is avoided, and timing verification is reduced to the standard worst case execution time (WCET) analysis used in non-pre-emptive systems. The downside of cache partitioning is the potential increase in execution times. In this paper, we evaluate cache partitioning for hard real time systems in terms of overall schedulability. To this end, we examine the sensitivity of task execution times to the size of the cache partition allocated and present a cache partitioning algorithm that is optimal with respect to task set schedulability. We then evaluate the performance of cache partitioning compared to state-of-the-art pre-emption cost analysis based on benchmark code and on a large number of synthetic task sets. This allows us to derive general conclusions about the usability of cache partitioning and identify task set and system parameters that influence the relative e ectiveness of cache partitioning.

Proceedings ArticleDOI
24 Mar 2014
TL;DR: This paper proves that one of the previously published formulae for the probability of a cache hit is optimal with respect to the limited information that it uses, and introduces a simple exhaustive method that computes a precise pWCET distribution, albeit at the cost of exponential complexity.
Abstract: In this paper, we investigate Static Probabilistic Timing Analysis (SPTA) for single processor systems that use a cache with an evict-on-miss random replacement policy. We show that previously published formulae for the probability of a cache hit can produce results that are optimistic and unsound when used to compute probabilistic Worst-Case Execution Time (pWCET) distributions. We investigate the correctness, optimality, and precision of different approaches to SPTA. We prove that one of the previously published formulae for the probability of a cache hit is optimal with respect to the limited information that it uses. We improve upon this formulation by using extra information about cache contention. To investigate the precision of various approaches to SPTA, we introduce a simple exhaustive method that computes a precise pWCET distribution, albeit at the cost of exponential complexity. Further, we integrate this precise approach, applied to small numbers of frequently accessed memory blocks, with imprecise analysis of other memory blocks, to form a combined approach that improves precision, without significantly increasing its complexity. The performance of the various approaches are compared on benchmark programs.

Patent
Chen Sijia1, Xing Jin1, Volvet Zhang1, Rui Zhang1, Mengkang Wang1 
08 May 2014
TL;DR: In this paper, a cache memory to receive storage units via an uplink from a transmitting client, each storage unit including a decodable video unit, having a priority, and enable downloading of the storage unit via a plurality of downlinks to receiving clients, and a controller processor to purge the cache memory of one of storage units when all of the following conditions are satisfied: the one storage unit is not being downloaded to any of the receiving clients.
Abstract: In one embodiment, a managed cache system, includes a cache memory to receive storage units via an uplink from a transmitting client, each storage unit including a decodable video unit, each storage unit having a priority, and enable downloading of the storage units via a plurality of downlinks to receiving clients, and a controller processor to purge the cache memory of one of the storage units when all of the following conditions are satisfied: the one storage unit is not being downloaded to any of the receiving clients, the one storage unit is not currently subject to a purging exclusion, and another one of the storage units now residing in the cache, having a higher priority than the priority of the one storage unit, arrived in the cache after the one storage unit. Related apparatus and methods are also described.

Proceedings ArticleDOI
19 Jun 2014
TL;DR: This work presents a distributed proactive caching approach that exploits user mobility information to decide where to proactively cache data to support seamless mobility, while efficiently utilizing cache storage using a congestion pricing scheme.
Abstract: We present a distributed proactive caching approach that exploits user mobility information to decide where to proactively cache data to support seamless mobility, while efficiently utilizing cache storage using a congestion pricing scheme. The proposed approach is applicable to the case where objects have different sizes and to a two-level cache hierarchy, for both of which the proactive caching problem is hard. Our evaluation results show how various system parameters influence the delay gains of the proposed approach, which achieves robust and good performance relative to an oracle and an optimal scheme for a flat cache structure.

Patent
11 Apr 2014
TL;DR: In this paper, a data structure for maintaining cache supporting compression and cache-wide deduplication is presented, where a first mapping is generated from short-length signatures to a storage location and a quantized length measure on a cache storage device; unused contiguous regions on the cache device are allocated.
Abstract: Systems and methods for generating and storing a data structure for maintaining cache supporting compression and cache-wide deduplication, including generating data structures with fixed size memory regions configured to hold multiple signatures as keys, wherein the number of the fixed size memory regions is bounded. A first mapping is generated from short-length signatures to a storage location and a quantized length measure on a cache storage device; and unused contiguous regions on the cache device are allocated. Metadata and cache page content is retrieved using a single input/output operation; a correctness of a full value of hash functions of uncompressed cache page content is validated; a second mapping is generated from short-length signatures to entries in the first mapping; and verification of whether the cached page content corresponds to a full-length original logical block address using the metadata is performed.

Patent
12 Mar 2014
TL;DR: In this article, a technique for concurrently accessing a data set includes initializing a shared cache with a column data store configured to store an expected data set in columns and creating a memory map for accessing the physical memory location in the shared cache.
Abstract: A technique for concurrently accessing a data set includes initializing a shared cache with a column data store configured to store an expected data set in columns and creating a memory map for accessing the physical memory location in the shared cache. Other operations include mapping the applications' data access requests to the shared cache with the memory map. One advantage of the disclosed technique is that only one instance of the expected data set is stored in memory, so each application is not required to create additional instances of the expected data set in the applications memory address space. Therefore, larger expected data sets may be entirely stored in memory without limiting the number of applications running concurrently.

Proceedings ArticleDOI
13 Dec 2014
TL;DR: Futility Scaling (FS) is proposed, a novel replacement-based cache partitioning scheme that can precisely partition the whole cache while still maintaining high associativity even with a large number of partitions.
Abstract: As shared last level caches are widely used in many-core CMPs to boost system performance, partitioning a large shared cache among multiple concurrently running applications becomes increasingly important in order to reduce destructive interference. However, while recent works start to show the promise of using replacement-based partitioning schemes, such existing schemes either suffer from the severe associativity degradation when the number of partitions is high, or lack the ability to precisely partition the whole cache which leads to decreased resource efficiency. In this paper, we propose Futility Scaling (FS), a novel replacement-based cache partitioning scheme that can precisely partition the whole cache while still maintaining high associativity even with a large number of partitions. The futility of a cache line represents the uselessness of this line to application performance and can be ranked in different ways by various policies, e.g., LRU and LFU. The idea of FS is to control the size of a partition by properly scaling the futility of its cache lines. We study the properties of FS on both associativity and sizing in an analytical framework, and present a feedback-based implementation of FS that incurs little overhead in practice. Simulation results show that, FS improves performance over previously proposed Vantage and Prism by up to 6.0% and 13.7%, respectively.