scispace - formally typeset
Open AccessBook ChapterDOI

Tuning Caches to Applications for Low-Energy Embedded Systems

TLDR
Four methods for tuning a microprocessors’ cache subsystem to the needs of any executing application for low-energy embedded systems are discussed and it is shown that a victim buffer can be very effective as a configurable parameter in a memory hierarchy.
Abstract
The power consumed by the memory hierarchy of a microprocessor can contribute to as much as 50% of the total microprocessor system power, and is thus a good candidate for power and energy optimizations. We discuss four methods for tuning a microprocessors’ cache subsystem to the needs of any executing application for low-energy embedded systems. We introduce onchip hardware implementing an efficient cache tuning heuristic that can automatically, transparently, and dynamically tune a configurable level-one cache’s total size, associativity and line size to an executing application. We extend the single-level cache tuning heuristic for a two-level cache using a methodology applicable to both a simulation-based exploration environment and a hardware-based system prototyping environment. We show that a victim buffer can be very effective as a configurable parameter in a memory hierarchy. We reduce static energy dissipation of on-chip data cache by compressing the frequent values that widely exist in a data cache memory.

read more

Content maybe subject to copyright    Report

Chapter
6
TUNING CACHES TO APPLICATIONS FOR
LOW-ENERGY EMBEDDED SYSTEMS
Ann Gordon-Ross
1
, Chuanjun Zhang
2
, Frank Vahid
1,3
, and Nikil Dutt
4
1
Department of Computer Science and Engineering, University of California, Riverside;
2
Department of Electrical Engineering, University of California, Riverside;
3
Also with the
Center for Embedded Computer Systems at UC Irvine;
4
Center for Embedded Computer
Systems, School of Information and Computer Science, University of California, Irvine.
Abstract: The power consumed by the memory hierarchy of a microprocessor can
contribute to as much as 50% of the total microprocessor system power, and is
thus a good candidate for power and energy optimizations. We discuss four
methods for tuning a microprocessors’ cache subsystem to the needs of any
executing application for low-energy embedded systems. We introduce on-
chip hardware implementing an efficient cache tuning heuristic that can
automatically, transparently, and dynamically tune a configurable level-one
cache’s total size, associativity and line size to an executing application. We
extend the single-level cache tuning heuristic for a two-level cache using a
methodology applicable to both a simulation-based exploration environment
and a hardware-based system prototyping environment. We show that a victim
buffer can be very effective as a configurable parameter in a memory
hierarchy. We reduce static energy dissipation of on-chip data cache by
compressing the frequent values that widely exist in a data cache memory.
Key words: Cache; configurable; architecture tuning; low power; low energy; embedded
systems; on-chip CAD; dynamic optimization; cache hierarchy; cache
exploration; cache optimization; victim buffer; frequent value.
1. INTRODUCTION
The power consumed by the memory hierarchy of a microprocessor can
contribute to 50% or more of total microprocessor system power
1
. Such a
large contributor to power is a good candidate for power and energy
optimization. The design of the caches in a memory hierarchy plays a major
role in the memory hierarchy’s power and performance.

2
Chapter 6
Tuning cache design parameters to the needs of a particular application
or program region can save energy. Cache design parameters include: cache
size, meaning the total number of data byte storage; cache associativity,
meaning the number of tag and data ways simultaneously read per cache
access; cache line size, meaning the number of bytes in a block when
moving data between cache and the next memory level; and victim buffer
use, meaning a small fully-associative buffer storing recently-evicted cache
data lines. Every application has different cache requirements that cannot be
efficiently satisfied with one predetermined cache configuration. For
instance, different applications have vastly different spatial and temporal
locality and thus have different requirements
2
with respect to cache size,
cache line size, cache associativity, victim buffer configuration, etc. In
addition to tunable cache parameters, widely existing frequent values in data
caches for some applications can enable data encoding within the cache for
reduced power consumption. We define cache tuning as the task of
choosing the best configuration of cache design parameters for a particular
application, or for a particular phase of an application, such that
performance, power and/or energy are optimized.
New technologies enable cache tuning. Core-based processors allow a
designer to choose a particular cache configuration
3-7
. Some processor
designs allow caches to be configured during system reset or even during
runtime
2,8,9
.
Manual tuning of the cache is hard. A single-level cache may have many
tens of different cache configurations, and interdependent multi-level caches
may have thousands of cache configurations. The configuration space gets
even larger if other dependent configurable architecture parameters are
considered, such as bus and processor parameters. Exhaustively searching
the space may be too slow even if fully automated. With possible average
energy savings of over 40% through tuning
2,10
, we sought to develop
automated cache tuning methods.
In this chapter, we discuss four methods of cache tuning for energy
savings. We discuss an in-system method for automatically, transparently,
and dynamically tuning a level-one cache; an automatic tuning methodology
for two-level caches applicable to both a simulation-based exploration
environment or a hardware-based prototyping environment; a configurable
victim buffer; and a data cache that encodes frequent data values.

6. tuning Caches To Applications for Low-Energy Embedded Systems
3
2. BACKGROUND – TUNABLE CACHE
PARAMETERS
Many methods exist for configuring a single level of cache to a particular
application during design time and in-system during runtime. Cache
configuration can be specified during design time for many commercial soft
cores from MIPS
6
, ARM
5
, and Arc
4
and for environments such as Tensilica’s
Xtensa processor generator
7
and Altera’s Nios embedded processor system
3
.
Configurable cache hardware also exists to assist in cache configuration.
Motorola’s M*CORE
9
processors offer way configuration which allows the
ways of a unified data/instruction cache to individually be specified as either
data or instruction ways. Additionally, ways may be shut down entirely.
Way shut-down is further explored by Albonesi
8
to reduce dynamic power
by an average of 40%. An adaptive cache line size methodology is proposed
by Veidenbaum et al.
11
to reduce memory traffic by more than 50%.
Exhaustive search methods may be used to find optimal cache
configurations, but the time required for an exhaustive search is often
prohibitive. Several tools do exist for assisting designers in tuning a single
level of cache. Platune
12
is a framework for tuning configurable system-on-
a-chip (SOC) platforms. Platune offers many configurable parameters
beyond just cache parameters, and prunes the search space by isolating
interdependent parameters from independent parameters. The level one
cache parameters, being dependent, are explored exhaustively.
Heuristic methods exist to prune the search space of the configurable
cache. Palesi et al.
13
improves upon the exhaustive search used in Platune by
using a genetic algorithm to produce comparable results in less time. Zhang
et al.
14
presents a cache configuration exploration methodology wherein a
cache exploration component searches configurations in order of their
impact on energy, and produces a list of Pareto-optimal points representing
reasonable tradeoffs in energy and performance. Ghosh et al.
15
uses an
analytical model to efficiently explore cache size and associativity and
directly computes a cache configuration to meet the designers’ performance
constraints.
Few methods exist for tuning multiple levels of a cache hierarchy.
Balasubramonian et al.
10
proposes a hardware-based cache configuration
management algorithm to improve memory hierarchy performance while
considering energy consumption. An average reduction in memory hierarchy
energy of 43% can be achieved with a configurable level two and level three
cache hierarchy coupled with a conventional level one cache.

4
Chapter 6
3. A SELF-TUNING LEVEL ONE CACHE
ARCHITECTURE
Tuning a cache to a particular application can be a cumbersome task left
for designers even with the advent of recent computer-aided design (CAD)
tuning aids. Large configuration spaces may take a designer weeks or
months to explore and with a small time-to-market, lengthy tuning iterations
may not be feasible. We propose to move the CAD environment on-chip,
eliminating designer effort for cache tuning. We introduce on-chip hardware
implementing an efficient heuristic that automatically, transparently, and
dynamically tunes the cache to the executing program to reduce energy
16
.
3.1 Configurable Cache Architecture
The on-chip hardware tunes four cache parameters in the level-one cache:
cache line size (64, 32, or 16 bytes), cache size (8, 4, or 2 Kbytes),
associativity (4, 2, or 1-way), and cache way prediction (on or off). Way
prediction is a method for reducing set-associative cache energy, in which
one way is initially accessed, and other ways accessed only upon a miss.
I$
Tuner
D$
Micro-
processor
Off chip
Memory
Figure 6-1. Self-tuning cache architecture
The exploration space is quite large, necessitating an efficient exploration
heuristic implemented with specialized tuning hardware, as illustrated in
Figure 6-1. The tuning phase may be activated during a special software-
selected tuning mode, during startup of a task, whenever a program phase
change is detected, or at fixed time intervals. The choice of approach is
orthogonal to the design of the self-tuning architecture itself.
The cache architecture supports a certain range of configurations
2
. The
base level-one cache of 8 Kbytes consists of four banks that can operate as
four ways. A special configuration register allows the ways to be
concatenated to form either a direct-mapped or 2-way set associative 8
Kbyte cache. The configuration register may also be configured to shut
down ways, resulting in a 4 Kbyte direct-mapped or 2-way set associative
cache or a 2 Kbyte direct-mapped cache. Specifically, due to the bank layout
for way shut down, 2 Kbyte 2- or 4-way set associative and 4 Kbyte 4-way

6. tuning Caches To Applications for Low-Energy Embedded Systems
5
set associative caches are not possible using the configurable cache
hardware.
3.2 Heuristic Development Through Analysis
A naïve tuning approach would simply try all possible combinations of
configurable parameters in an arbitrary order. For each configuration, the
miss rate can be measured and used to estimate the energy consumption of
the particular cache configuration. After all configurations are executed, the
approach would simply choose the configuration with the lowest energy
consumption. However, such an exhaustive method may involve the
inspection of too many configurations. Therefore, we wish to develop a
cache tuning heuristic that minimizes the number of configurations explored.
When developing a good heuristic, the parameter (cache size, line size,
associativity, or way prediction) with the largest impact in performance and
energy would likely be the best parameter to search first. We analyzed each
parameter to determine the parameter’s impact on miss rate and energy by
fixing three parameters and varying the third.
We observed that varying the cache size had the largest average impact
on energy and miss rate changing the cache size can impact the energy by
a factor of two or more. From our analysis, we developed a search heuristic
that first determines the best cache size, determines the best line size, then
the best associativity, and finally, if the best associativity is greater than one,
our heuristic determines whether to use way prediction or not.
3.3 Search Heuristic
The heuristic developed based on the importance of parameters is
summarized below:
1. Begin with a 2 Kbyte, direct-mapped cache with a 16 byte line size.
Increase the cache size to 4 Kbytes. If the increase in cache size causes a
decrease in energy consumption, increase the cache size to 8 Kbytes.
Choose the cache size with the best energy consumption.
2. For the best cache size determined in step 1, increase the line size from
16 bytes to 32 bytes. If the increase in line size causes a decrease in
energy consumption, increase the line size to 64 bytes. Choose the line
size with the best energy consumption.
3. For the best cache size determined in step 1 and the best line size
determined in step 2, increase the associativity to 2 ways. If the increase
in associativity causes a decrease in energy consumption, increase the

Citations
More filters
Proceedings ArticleDOI

Automatic application-specific microarchitecture reconfiguration

TL;DR: An automatic optimization technique that helps the developers with the processor microarchitecture customization by running in time that is linear with the number of parameter values, based on an assumption of parameter independence.
Proceedings Article

A Run-time Reconfigurable Cache Architecture.

TL;DR: This paper presents a meta-analyses of the immune system’s response to TSPs and suggests a number of mechanisms that may be useful in the diagnosis of immune-inflammatory diseases.

Adaptive Cache Infrastructure: Supporting dynamic Program Changes following dynamic Program Behavior

TL;DR: An infrastructure for fitting cache accesses to a program’s requirements for a distinct phase is proposed, with the memory access behavior being most important and caches being a very big subset of them.
References
More filters
Proceedings ArticleDOI

MediaBench: a tool for evaluating and synthesizing multimedia and communications systems

TL;DR: The MediaBench benchmark suite as discussed by the authors is a benchmark suite that has been designed to fill the gap between the compiler community and embedded applications developers, which has been constructed through a three-step process: intuition and market driven initial selection, experimental measurement, and integration with system synthesis algorithms to establish usefulness.
Proceedings ArticleDOI

Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

TL;DR: In this article, a hardware technique to improve the performance of caches is presented, where a small fully-associative cache between a cache and its refill path is used to place prefetched data and not in the cache.
Journal ArticleDOI

CACTI: an enhanced cache access and cycle time model

TL;DR: In this paper, an analytical model for the access and cycle times of on-chip direct-mapped and set-associative caches is presented, where the inputs to the model are the cache size, block size, and associativity, as well as array organization and process parameters.
Proceedings ArticleDOI

Selective cache ways: on-demand cache resource allocation

TL;DR: In this paper, a tradeoff between performance and energy is made between a small performance degradation for energy savings, and the tradeoff can produce a significant reduction in cache energy dissipation.

Evaluating future microprocessors : The SimpleScalar tool set

TL;DR: An anview of the SimpleScalar tool set is given, show how to obtain, install and use it, and details about the tools’ internals are discussed.
Related Papers (5)
Frequently Asked Questions (14)
Q1. What are the contributions in this paper?

The authors discuss four methods for tuning a microprocessors ’ cache subsystem to the needs of any executing application for low-energy embedded systems. The authors introduce onchip hardware implementing an efficient cache tuning heuristic that can automatically, transparently, and dynamically tune a configurable level-one cache ’ s total size, associativity and line size to an executing application. The authors show that a victim buffer can be very effective as a configurable parameter in a memory hierarchy. 

When developing a good heuristic, the parameter (cache size, line size, associativity, or way prediction) with the largest impact in performance and energy would likely be the best parameter to search first. 

The basic intuition behind their heuristic is that interlacing the exploration allows for better modeling and tuning of the interdependencies between the different levels of cache hierarchy. 

It took over one month of continual simulation time on an UltraSparc compute server to generate the data for their nine benchmarks. 

Directmapped caches are popular in embedded microprocessor architecture due to their simplicity and good hit rates for many applications. 

The authors simulated numerous Powerstone9 and MediaBench18 benchmarks using SimpleScalar19, a cycle-accurate simulator that includes a MIPS-like microprocessor model, to obtain the number of cache accesses and cache misses for each benchmark and configuration explored. 

The FV cache was proposed based on the observation that a small number of distinct frequently occurring data values often occupy a large portion of program memory data spaces and therefore account for a large portion of memory accesses27. 

Instead of synthesizing the FVs on-chip, a register file may be used to store FVs so that they can be rewritten on each activation of a different program. 

The target architecture for their two-level cache tuning heuristic contains separate level one instruction and data caches and separate level two instruction and data caches. 

the authors obtained the power consumed by their cache tuner, through simulation of a synthesized version of their cache tuner written in VHDL. 

Ghosh et al.15 uses an analytical model to efficiently explore cache size and associativity and directly computes a cache configuration to meet the designers’ performance constraints. 

The authors observed that varying the cache size had the largest average impact on energy and miss rate – changing the cache size can impact the energy by a factor of two or more. 

Their search heuristic is quite effective: it searches on average only 5.8 configurations, compared to 27 configurations for an exhaustive approach. 

the authors extended the heuristic described in Section 3.3 for a twolevel cache by tuning the level-one cache while holding the level-two cache at the smallest size, then tuning the level-two cache using the same heuristic.