scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

POSTER: Variable Sized Cache-Block Compaction

01 Sep 2019-pp 471-472
TL;DR: Experimental results reveal that VSCC outperforms state-of-the-art techniques from the performance and energy point of view while keeping the storage overheads within acceptable limits.
Abstract: Data blocks compressed to different sizes can be stored together inside a single cache-block to increase space utilization. However, the lack of a common size offset makes it challenging to locate individual blocks without additional tag overhead. We propose Variable Sized Cache-Block Compaction (VSCC) that allows us to store variable sized compressed blocks together and locate them inside a cache-block by using their compression encodings – available inside the tag metadata. We introduce a novel read/write scheme and a new BDI compression encoding, which reduce the necessary operations by 50%. Experimental results reveal that VSCC outperforms state-of-the-art techniques from the performance and energy point of view while keeping the storage overheads within acceptable limits.
References
More filters
Journal ArticleDOI
TL;DR: The high level of collaboration on the gem5 project, combined with the previous success of the component parts and a liberal BSD-like license, make gem5 a valuable full-system simulation tool.
Abstract: The gem5 simulation infrastructure is the merger of the best aspects of the M5 [4] and GEMS [9] simulators. M5 provides a highly configurable simulation framework, multiple ISAs, and diverse CPU models. GEMS complements these features with a detailed and exible memory system, including support for multiple cache coherence protocols and interconnect models. Currently, gem5 supports most commercial ISAs (ARM, ALPHA, MIPS, Power, SPARC, and x86), including booting Linux on three of them (ARM, ALPHA, and x86).The project is the result of the combined efforts of many academic and industrial institutions, including AMD, ARM, HP, MIPS, Princeton, MIT, and the Universities of Michigan, Texas, and Wisconsin. Over the past ten years, M5 and GEMS have been used in hundreds of publications and have been downloaded tens of thousands of times. The high level of collaboration on the gem5 project, combined with the previous success of the component parts and a liberal BSD-like license, make gem5 a valuable full-system simulation tool.

4,039 citations


"POSTER: Variable Sized Cache-Block ..." refers methods in this paper

  • ...We consider benchmarks from SPEC CPU 2006 [5] and CRONO [6] running on GEM5 simulator [7]....

    [...]

Proceedings ArticleDOI
19 Sep 2012
TL;DR: There is a need for a simple yet efficient compression technique that can effectively compress common in-cache data patterns, and has minimal effect on cache access latency.
Abstract: Cache compression is a promising technique to increase on-chip cache capacity and to decrease on-chip and off-chip bandwidth usage. Unfortunately, directly applying well-known compression algorithms (usually implemented in software) leads to high hardware complexity and unacceptable decompression/compression latencies, which in turn can negatively affect performance. Hence, there is a need for a simple yet efficient compression technique that can effectively compress common in-cache data patterns, and has minimal effect on cache access latency.

348 citations


"POSTER: Variable Sized Cache-Block ..." refers methods in this paper

  • ...Both DCC and YACC use BDI [4] compression scheme....

    [...]

  • ...New BDI encoding: To avoid fetching multiple compression sizes for determining the position of a block, we introduce a new base-delta-immediate (BDI) encoding (refer to Table I)....

    [...]

Proceedings ArticleDOI
04 Oct 2015
TL;DR: CRONO as discussed by the authors is a benchmark suite composed of multi-threaded graph algorithms for shared memory multicore processors, which can be used to identify, analyze and develop novel architectural methods to mitigate their efficiency bottlenecks.
Abstract: Algorithms operating on a graph setting are known to be highly irregular and unstructured. This leads to workload imbalance and data locality challenge when these algorithms are parallelized and executed on the evolving multicore processors. Previous parallel benchmark suites for shared memory multicores have focused on various workload domains, such as scientific, graphics, vision, financial and media processing. However, these suites lack graph applications that must be evaluated in the context of architectural design space exploration for futuristic multicores. This paper presents CRONO, a benchmark suite composed of multi-threaded graph algorithms for shared memory multicore processors. We analyze and characterize these benchmarks using a multicore simulator, as well as a real multicore machine setup. CRONO uses both synthetic and real world graphs. Our characterization shows that graph benchmarks are diverse and challenging in the context of scaling efficiency. They exhibit low locality due to unstructured memory access patterns, and incur fine-grain communication between threads. Energy overheads also occur due to nondeterministic memory and synchronization patterns on network connections. Our characterization reveals that these challenges remain in state-of-the-art graph algorithms, and in this context CRONO can be used to identify, analyze and develop novel architectural methods to mitigate their efficiency bottlenecks in futuristic multicore processors.

93 citations


"POSTER: Variable Sized Cache-Block ..." refers methods in this paper

  • ...We consider benchmarks from SPEC CPU 2006 [5] and CRONO [6] running on GEM5 simulator [7]....

    [...]

Proceedings ArticleDOI
07 Dec 2013
TL;DR: The Decoupled Compressed Cache (DCC) is proposed, which exploits spatial locality to improve both the performance and energy-efficiency of cache compression and nearly doubles the benefits of previous compressed caches with similar area overhead.
Abstract: In multicore processor systems, last-level caches (LLCs) play a crucial role in reducing system energy by i) filtering out expensive accesses to main memory and ii) reducing the time spent executing in high-power states. Cache compression can increase effective cache capacity and reduce misses, improve performance, and potentially reduce system energy. However, previous compressed cache designs have demonstrated only limited benefits due to internal fragmentation and limited tags. In this paper, we propose the Decoupled Compressed Cache (DCC), which exploits spatial locality to improve both the performance and energy-efficiency of cache compression. DCC uses decoupled super-blocks and non-contiguous sub-block allocation to decrease tag overhead without increasing internal fragmentation. Non-contiguous sub-blocks also eliminate the need for energy-expensive re-compaction when a block's size changes. Compared to earlier compressed caches, DCC increases normalized effective capacity to a maximum of 4 and an average of 2.2 for a wide range of workloads. A further optimized Co-DCC (Co-Compacted DCC) design improves the average normalized effective capacity to 2.6 by co-compacting the compressed blocks in a super-block. Our simulations show that DCC nearly doubles the benefits of previous compressed caches with similar area overhead. We also demonstrate a practical DCC design based on a recent commercial LLC design.

83 citations


"POSTER: Variable Sized Cache-Block ..." refers background or methods in this paper

  • ...Though DCCS8 achieves better effective cache capacity, it incurs high storage overhead....

    [...]

  • ...Decoupled compressed cache (DCC) [1] stored variable-sized blocks together with the help of additional tag structures, while yet another compressed cache (YACC) [2] compromised on the overall performance to eliminate the storage overhead....

    [...]

  • ...A longer run-time and additional accesses increase the overall energy consumption of DCC. Owing to a significantly lower miss-rate, VSCC outperforms YACC as well....

    [...]

  • ...Lack of any additional tag structure allows VSCC to outperform DCC in terms of IPC and energy, in spite of DCC having lower miss-rate....

    [...]

  • ...Irrespective of a lower missrate, DCC suffers from lower IPC due to a longer run-time – attributed by the accesses made to its additional tag structure....

    [...]

Proceedings ArticleDOI
01 Apr 1994
TL;DR: The decoupled sectored cache introduced in this paper will allow the same level of performance as a non-sectored cache, but at a significantly lower hardware cost.
Abstract: Sectored caches have been used for many years in order to reconcile low tag array size and small or medium block size In a sectored cache, a single address tag is associated with a sector consisting on several cache lines, while validity, dirty and coherency tags are associated with each of the inner cache linesMaintaining a low tag array size is a major issue in many cache designs (eg L2 caches) Using a sectored cache is a design trade-off between a low size of the tag array which is possible with large line size and a low memory traffic which requires a small line sizeThis technique has been used in many cache designs including small on-chip microprocessor caches and large external second level caches Unfortunately, as on some applications, the miss ratio on a sectored cache is significantly higher than the miss ratio on a non-sectored cache (factors higher than two are commonly observed), a significant part of the potential performance may be wasted in miss penaltiesUsually in a cache, a cache line location is statically linked to one and only one address tag word location In the decoupled sectored cache we introduce in this paper, this monolithic association is broken; the address tag location associated with a cache line location is dynamically chosen at fetch time among several possible locationsThe tag volume on a decoupled sectored cache is in the same range as the tag volume in a traditional sectored cache; but the hit ratio on a decoupled sectored cache is very close to the hit ratio on a non-sectored cache A decoupled sectored cache will allow the same level of performance as a non-sectored cache, but at a significantly lower hardware cost

81 citations


"POSTER: Variable Sized Cache-Block ..." refers background in this paper

  • ...Recent compaction techniques [1], [2] have adopted sectoring [3] – which uses a single tag to track multiple neighbouring blocks (collectively known as sectors) with some additional metadata....

    [...]