scispace - formally typeset
Proceedings ArticleDOI

Reducing traffic generated by conflict misses in caches

TLDR
Experimental results show that the BCC and SCC cache reduce the amount of traffic significantly in many cases and overall they incur the same number of cache misses as the direct-mapped cache.
Abstract
Off-chip memory accesses are a major source of power consumption in embedded processors. In order to reduce the amount of traffic between the processor and the off-chip memory as well as to hide the memory latency, nearly all embedded processors have a cache on the same die as the processor core. Because small caches dissipate less power and are cheaper than large caches, a small cache is preferable to a large cache. Furthermore, because set-associative caches consume more power than direct-mapped caches, a direct-mapped cache is preferable to a set-associative one. Small, direct-mapped caches generally incur many conflict misses, however. In this paper we propose and evaluate a structure called the Conflict Detection Table (CDT). This table can be used to determine if a memory access is expected to hit the cache. If a hit is expected and a miss occurs, then a conflict is detected and appropriate action can be taken. In addition, we propose two cache structures that employ this technique: the Bypass in Case of Conflict (BCC) cache and the Sub-block in Case of Conflict (SCC) cache. The BCC cache bypasses the cache when a conflict is detected, whereas the SCC cache fetches a sub-block of the missing cache block in such a case. Experimental results using several embedded workloads show that the BCC and SCC cache reduce the amount of traffic significantly in many cases. Furthermore, overall they incur the same number of cache misses as the direct-mapped cache. This shows that the BCC and SCC cache reduce the amount of power consumed with a negligible reduction in performance.

read more

Content maybe subject to copyright    Report

Citations
More filters
Patent

Reducing layout conflicts among code units with caller-callee relationships

TL;DR: A code placement technique that organizes code units to at least reduce layout conflicts among caller/callee code units is presented in this paper, where the code preparation environment determines those code units of a code representation that have overlapping memory mappings with their counterpart code units.
Proceedings ArticleDOI

Dynamic techniques to reduce memory traffic in embedded systems

TL;DR: This paper measures how much traffic is generated by small, direct-mapped caches and what the minimal amount of traffic is and yields an upper bound on the amount oftraffic that can be saved by utilizing the on-chip memory more effectively.
Proceedings ArticleDOI

Cache Miss-Aware Dynamic Stack Allocation

TL;DR: Experimental results show that dynamic stack allocation significantly reduces cache misses from 4% to 42% in various benchmarks with relatively small power consumption and no extra delay.

Energy reduction techniques for caches and multiprocessors

TL;DR: This dissertation studies several techniques that aim at reducing energy consumption in processors by decreasing the amount of data transferred between a processor and external memory and aims at improving or at least maintaining performance.
Journal ArticleDOI

Data Cache System based on the Selective Bank Algorithm for Embedded System

TL;DR: This paper presents a high performance and low power cache structure with a bank selection mechanism that enhances exploitation of spatial and temporal locality and reduces conflict misses and cache pollution at the same time.
References
More filters
Book

Computer Architecture: A Quantitative Approach

TL;DR: This best-selling title, considered for over a decade to be essential reading for every serious student and practitioner of computer design, has been updated throughout to address the most important trends facing computer designers today.
Proceedings ArticleDOI

MiBench: A free, commercially representative embedded benchmark suite

TL;DR: A new version of SimpleScalar that has been adapted to the ARM instruction set is used to characterize the performance of the benchmarks using configurations similar to current and next generation embedded processors.
Proceedings ArticleDOI

MediaBench: a tool for evaluating and synthesizing multimedia and communications systems

TL;DR: The MediaBench benchmark suite as discussed by the authors is a benchmark suite that has been designed to fill the gap between the compiler community and embedded applications developers, which has been constructed through a three-step process: intuition and market driven initial selection, experimental measurement, and integration with system synthesis algorithms to establish usefulness.
Proceedings ArticleDOI

Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

TL;DR: In this article, a hardware technique to improve the performance of caches is presented, where a small fully-associative cache between a cache and its refill path is used to place prefetched data and not in the cache.