What is the basic intuition behind the interlaced heuristic?

The basic intuition behind their heuristic is that interlacing the exploration allows for better modeling and tuning of the interdependencies between the different levels of cache hierarchy.

How long did it take to generate the data for the nine benchmarks?

It took over one month of continual simulation time on an UltraSparc compute server to generate the data for their nine benchmarks.

Why are directmapped caches popular in embedded microprocessor architecture?

Directmapped caches are popular in embedded microprocessor architecture due to their simplicity and good hit rates for many applications.

What is the way to simulate powerstone9 and mediabench18?

The authors simulated numerous Powerstone9 and MediaBench18 benchmarks using SimpleScalar19, a cycle-accurate simulator that includes a MIPS-like microprocessor model, to obtain the number of cache accesses and cache misses for each benchmark and configuration explored.

What is the main reason why the FV cache was proposed?

The FV cache was proposed based on the observation that a small number of distinct frequently occurring data values often occupy a large portion of program memory data spaces and therefore account for a large portion of memory accesses27.

What is the way to store FVs?

Instead of synthesizing the FVs on-chip, a register file may be used to store FVs so that they can be rewritten on each activation of a different program.

What is the target architecture for the two-level cache tuning heuristic?

The target architecture for their two-level cache tuning heuristic contains separate level one instruction and data caches and separate level two instruction and data caches.

How did the authors obtain the power consumed by their cache tuner?

the authors obtained the power consumed by their cache tuner, through simulation of a synthesized version of their cache tuner written in VHDL.

How many configurations do the authors use to search?

Their search heuristic is quite effective: it searches on average only 5.8 configurations, compared to 27 configurations for an exhaustive approach.

How did the authors extend the heuristic for a two level cache?

the authors extended the heuristic described in Section 3.3 for a twolevel cache by tuning the level-one cache while holding the level-two cache at the smallest size, then tuning the level-two cache using the same heuristic.

(Open Access) Tuning Caches to Applications for Low-Energy Embedded Systems (2004) | Ann Gordon-Ross

Q: What are the contributions in this paper?

The authors discuss four methods for tuning a microprocessors ’ cache subsystem to the needs of any executing application for low-energy embedded systems. The authors introduce onchip hardware implementing an efficient cache tuning heuristic that can automatically, transparently, and dynamically tune a configurable level-one cache ’ s total size, associativity and line size to an executing application. The authors show that a victim buffer can be very effective as a configurable parameter in a memory hierarchy.

Chapter

TUNING CACHES TO APPLICATIONS FOR

LOW-ENERGY EMBEDDED SYSTEMS

Ann Gordon-Ross

, Chuanjun Zhang

, Frank Vahid

1,3

, and Nikil Dutt

Department of Computer Science and Engineering, University of California, Riverside;

Department of Electrical Engineering, University of California, Riverside;

Also with the

Center for Embedded Computer Systems at UC Irvine;

Center for Embedded Computer

Systems, School of Information and Computer Science, University of California, Irvine.

Abstract: The power consumed by the memory hierarchy of a microprocessor can

contribute to as much as 50% of the total microprocessor system power, and is

thus a good candidate for power and energy optimizations. We discuss four

methods for tuning a microprocessors’ cache subsystem to the needs of any

executing application for low-energy embedded systems. We introduce on-

chip hardware implementing an efficient cache tuning heuristic that can

automatically, transparently, and dynamically tune a configurable level-one

cache’s total size, associativity and line size to an executing application. We

extend the single-level cache tuning heuristic for a two-level cache using a

methodology applicable to both a simulation-based exploration environment

and a hardware-based system prototyping environment. We show that a victim

buffer can be very effective as a configurable parameter in a memory

hierarchy. We reduce static energy dissipation of on-chip data cache by

compressing the frequent values that widely exist in a data cache memory.

Key words: Cache; configurable; architecture tuning; low power; low energy; embedded

systems; on-chip CAD; dynamic optimization; cache hierarchy; cache

exploration; cache optimization; victim buffer; frequent value.

1. INTRODUCTION

The power consumed by the memory hierarchy of a microprocessor can

contribute to 50% or more of total microprocessor system power

. Such a

large contributor to power is a good candidate for power and energy

optimization. The design of the caches in a memory hierarchy plays a major

role in the memory hierarchy’s power and performance.

Chapter 6

Tuning cache design parameters to the needs of a particular application

or program region can save energy. Cache design parameters include: cache

size, meaning the total number of data byte storage; cache associativity,

meaning the number of tag and data ways simultaneously read per cache

access; cache line size, meaning the number of bytes in a block when

moving data between cache and the next memory level; and victim buffer

use, meaning a small fully-associative buffer storing recently-evicted cache

data lines. Every application has different cache requirements that cannot be

efficiently satisfied with one predetermined cache configuration. For

instance, different applications have vastly different spatial and temporal

locality and thus have different requirements

with respect to cache size,

cache line size, cache associativity, victim buffer configuration, etc. In

addition to tunable cache parameters, widely existing frequent values in data

caches for some applications can enable data encoding within the cache for

reduced power consumption. We define cache tuning as the task of

choosing the best configuration of cache design parameters for a particular

application, or for a particular phase of an application, such that

performance, power and/or energy are optimized.

New technologies enable cache tuning. Core-based processors allow a

designer to choose a particular cache configuration

3-7

. Some processor

designs allow caches to be configured during system reset or even during

runtime

2,8,9

Manual tuning of the cache is hard. A single-level cache may have many

tens of different cache configurations, and interdependent multi-level caches

may have thousands of cache configurations. The configuration space gets

even larger if other dependent configurable architecture parameters are

considered, such as bus and processor parameters. Exhaustively searching

the space may be too slow even if fully automated. With possible average

energy savings of over 40% through tuning

2,10

, we sought to develop

automated cache tuning methods.

In this chapter, we discuss four methods of cache tuning for energy

savings. We discuss an in-system method for automatically, transparently,

and dynamically tuning a level-one cache; an automatic tuning methodology

for two-level caches applicable to both a simulation-based exploration

environment or a hardware-based prototyping environment; a configurable

victim buffer; and a data cache that encodes frequent data values.

6. tuning Caches To Applications for Low-Energy Embedded Systems

2. BACKGROUND – TUNABLE CACHE

PARAMETERS

Many methods exist for configuring a single level of cache to a particular

application during design time and in-system during runtime. Cache

configuration can be specified during design time for many commercial soft

cores from MIPS

, ARM

, and Arc

and for environments such as Tensilica’s

Xtensa processor generator

and Altera’s Nios embedded processor system

Configurable cache hardware also exists to assist in cache configuration.

Motorola’s M*CORE

processors offer way configuration which allows the

ways of a unified data/instruction cache to individually be specified as either

data or instruction ways. Additionally, ways may be shut down entirely.

Way shut-down is further explored by Albonesi

to reduce dynamic power

by an average of 40%. An adaptive cache line size methodology is proposed

by Veidenbaum et al.

to reduce memory traffic by more than 50%.

Exhaustive search methods may be used to find optimal cache

configurations, but the time required for an exhaustive search is often

prohibitive. Several tools do exist for assisting designers in tuning a single

level of cache. Platune

is a framework for tuning configurable system-on-

a-chip (SOC) platforms. Platune offers many configurable parameters

beyond just cache parameters, and prunes the search space by isolating

interdependent parameters from independent parameters. The level one

cache parameters, being dependent, are explored exhaustively.

Heuristic methods exist to prune the search space of the configurable

cache. Palesi et al.

improves upon the exhaustive search used in Platune by

using a genetic algorithm to produce comparable results in less time. Zhang

et al.

presents a cache configuration exploration methodology wherein a

cache exploration component searches configurations in order of their

impact on energy, and produces a list of Pareto-optimal points representing

reasonable tradeoffs in energy and performance. Ghosh et al.

uses an

analytical model to efficiently explore cache size and associativity and

directly computes a cache configuration to meet the designers’ performance

constraints.

Few methods exist for tuning multiple levels of a cache hierarchy.

Balasubramonian et al.

proposes a hardware-based cache configuration

management algorithm to improve memory hierarchy performance while

considering energy consumption. An average reduction in memory hierarchy

energy of 43% can be achieved with a configurable level two and level three

cache hierarchy coupled with a conventional level one cache.

Chapter 6

3. A SELF-TUNING LEVEL ONE CACHE

ARCHITECTURE

Tuning a cache to a particular application can be a cumbersome task left

for designers even with the advent of recent computer-aided design (CAD)

tuning aids. Large configuration spaces may take a designer weeks or

months to explore and with a small time-to-market, lengthy tuning iterations

may not be feasible. We propose to move the CAD environment on-chip,

eliminating designer effort for cache tuning. We introduce on-chip hardware

implementing an efficient heuristic that automatically, transparently, and

dynamically tunes the cache to the executing program to reduce energy

3.1 Configurable Cache Architecture

The on-chip hardware tunes four cache parameters in the level-one cache:

cache line size (64, 32, or 16 bytes), cache size (8, 4, or 2 Kbytes),

associativity (4, 2, or 1-way), and cache way prediction (on or off). Way

prediction is a method for reducing set-associative cache energy, in which

one way is initially accessed, and other ways accessed only upon a miss.

Tuner

Micro-

processor

Off chip

Memory

Figure 6-1. Self-tuning cache architecture

The exploration space is quite large, necessitating an efficient exploration

heuristic implemented with specialized tuning hardware, as illustrated in

Figure 6-1. The tuning phase may be activated during a special software-

selected tuning mode, during startup of a task, whenever a program phase

change is detected, or at fixed time intervals. The choice of approach is

orthogonal to the design of the self-tuning architecture itself.

The cache architecture supports a certain range of configurations

. The

base level-one cache of 8 Kbytes consists of four banks that can operate as

four ways. A special configuration register allows the ways to be

concatenated to form either a direct-mapped or 2-way set associative 8

Kbyte cache. The configuration register may also be configured to shut

down ways, resulting in a 4 Kbyte direct-mapped or 2-way set associative

cache or a 2 Kbyte direct-mapped cache. Specifically, due to the bank layout

for way shut down, 2 Kbyte 2- or 4-way set associative and 4 Kbyte 4-way

6. tuning Caches To Applications for Low-Energy Embedded Systems

set associative caches are not possible using the configurable cache

hardware.

3.2 Heuristic Development Through Analysis

A naïve tuning approach would simply try all possible combinations of

configurable parameters in an arbitrary order. For each configuration, the

miss rate can be measured and used to estimate the energy consumption of

the particular cache configuration. After all configurations are executed, the

approach would simply choose the configuration with the lowest energy

consumption. However, such an exhaustive method may involve the

inspection of too many configurations. Therefore, we wish to develop a

cache tuning heuristic that minimizes the number of configurations explored.

When developing a good heuristic, the parameter (cache size, line size,

associativity, or way prediction) with the largest impact in performance and

energy would likely be the best parameter to search first. We analyzed each

parameter to determine the parameter’s impact on miss rate and energy by

fixing three parameters and varying the third.

We observed that varying the cache size had the largest average impact

on energy and miss rate – changing the cache size can impact the energy by

a factor of two or more. From our analysis, we developed a search heuristic

that first determines the best cache size, determines the best line size, then

the best associativity, and finally, if the best associativity is greater than one,

our heuristic determines whether to use way prediction or not.

3.3 Search Heuristic

The heuristic developed based on the importance of parameters is

summarized below:

1. Begin with a 2 Kbyte, direct-mapped cache with a 16 byte line size.

Increase the cache size to 4 Kbytes. If the increase in cache size causes a

decrease in energy consumption, increase the cache size to 8 Kbytes.

Choose the cache size with the best energy consumption.

2. For the best cache size determined in step 1, increase the line size from

16 bytes to 32 bytes. If the increase in line size causes a decrease in

energy consumption, increase the line size to 64 bytes. Choose the line

size with the best energy consumption.

3. For the best cache size determined in step 1 and the best line size

determined in step 2, increase the associativity to 2 ways. If the increase

in associativity causes a decrease in energy consumption, increase the

Tuning Caches to Applications for Low-Energy Embedded Systems

Figures

Citations

Automatic application-specific microarchitecture reconfiguration

A Run-time Reconfigurable Cache Architecture.

Adaptive Cache Infrastructure: Supporting dynamic Program Changes following dynamic Program Behavior

Automatic application-specific microarchitecture reconfiguration

Performance Optimization of Memory Intensive Applications on FPGA Accelerator

References

MediaBench: a tool for evaluating and synthesizing multimedia and communications systems

Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

CACTI: an enhanced cache access and cycle time model

Selective cache ways: on-demand cache resource allocation

Evaluating future microprocessors : The SimpleScalar tool set

Related Papers (5)

Fast configurable-cache tuning with a unified second-level cache

A self-tuning cache architecture for embedded systems

A highly configurable cache for low energy embedded systems

Energy-efficient reconfigurable cache architectures for accelerator-enabled embedded systems

Custom instruction filter cache synthesis for low-power embedded systems

Frequently Asked Questions (14)

Q1. What are the contributions in this paper?

Q2. What is the parameter to search first?

Q3. What is the basic intuition behind the interlaced heuristic?

Q4. How long did it take to generate the data for the nine benchmarks?

Q5. Why are directmapped caches popular in embedded microprocessor architecture?

Q6. What is the way to simulate powerstone9 and mediabench18?

Q7. What is the main reason why the FV cache was proposed?

Q8. What is the way to store FVs?

Q9. What is the target architecture for the two-level cache tuning heuristic?

Q10. How did the authors obtain the power consumed by their cache tuner?

Q11. How does Ghosh et al.15 use an analytical model to efficiently explore cache?

Q12. What is the impact of varying the cache size on energy and miss rate?

Q13. How many configurations do the authors use to search?

Q14. How did the authors extend the heuristic for a two level cache?