# Static Energy Reduction Techniques for Microprocessor Caches

Heather Hanson, Student Member, IEEE, M. S. Hrishikesh, Student Member, IEEE, Vikas Agarwal, Stephen W. Keckler, Member, IEEE, and Doug Burger, Member, IEEE

*Abstract*—Microprocessor performance has been improved by increasing the capacity of on-chip caches. However, the performance gain comes at the price of static energy consumption due to subthreshold leakage current in cache memory arrays. This paper compares three techniques for reducing static energy consumption in on-chip level-1 and level-2 caches. One technique employs low-leakage transistors in the memory cell. Another technique, power supply switching, can be used to turn off memory cells and discard their contents. A third alternative is dynamic threshold modulation, which places memory cells in a standby state that preserves cell contents. In our experiments, we explore the energy and performance tradeoffs of these techniques. We also investigate the sensitivity of microprocessor performance and energy consumption to additional cache latency caused by leakage-reduction techniques.

Index Terms—Dual- $V_t$ , gated- $V_{dd}$ , leakage current, low-power design, multithreshold-CMOS (MTCMOS), power-consumption model, static energy.

#### I. INTRODUCTION

**C** ONTINUED improvements in integrated circuit fabrication technology have enabled the number of transistors in microprocessors to more than double each generation. A vast majority of transistors in modern microprocessors are used for on-chip storage, including level-1 and level-2 caches, and meta-state, such as renaming registers, numerous predictor structures, and trace caches. As leakage current increases with each process technology generation, the energy consumption of memory structures will increase dramatically. In this paper, we explore the energy/performance tradeoffs of three leakage-reduction techniques for on-chip level-1 and level-2 caches.

One method,  $dual-V_t$ , employs slower transistors with a higher threshold voltage, and hence, lower leakage, in SRAM arrays. Transistors in the remainder of the cache circuit have a lower threshold voltage for faster switching speed. This dual- $V_t$  method decreases subthreshold leakage currents but increases the cell access time compared with an SRAM composed of fast, leaky transistors [1], [2]. Another method dynamically adjusts the effective size of the array by employing a circuit

Manuscript received June 3, 2002; revised September 10, 2002. This work was supported in part by Intel and IBM, in part by the NSF CADRE Program, Grant EIA-9975286, and in part by a Grant from the Intel Research Council.

H. Hanson, M. S. Hrishikesh, and V. Agarwal are with the Electrical and Computer Engineering Department, University of Texas, Austin, TX 78712–3993 USA.

S. W. Keckler and D. Burger are with the Computer Architecture and Technology (CART) Laboratory, Computer Science Department, University of Texas, Austin, TX 78712–3993 USA (e-mail: cart@cs.utexas.edu).

Digital Object Identifier 10.1109/TVLSI.2003.812370

technique dubbed *gated-V<sub>dd</sub>*. In this scheme, a low-leakage transistor is used to selectively shut off the power supply to a subset of SRAM cells [3]. Thus, the capacity of the array adjusts dynamically as the amount of active information in the cache changes throughout the duration of the program.

A third technique, multithreshold CMOS (*MTCMOS*), dynamically changes the threshold voltage by modulating the backgate bias voltage [4], [5]. With this technique, memory cells can be placed into a low-leakage "sleep" mode yet still retain their state. Cells in the active mode are accessed at full speed, while accesses to cells in the sleep mode must wait until the cell has been awakened by adjusting the bias voltage. While the MTCMOS technique has been implemented for an entire SRAM [5], we examine this idea using fine grain control of each cache line.

The fundamental circuits for leakage reduction have been introduced by other researchers; our contributions in this paper are to examine the energy/performance tradeoffs of these techniques applied to the memory hierarchy of a modern microprocessor. This paper is an extension of our prior work in [6] and is organized as follows. Section II introduces leakage current and its effects on cache energy. Section III describes the three methods for reducing leakage current in memory cells; Section IV explains our experimental methodology. Results of the experiments and a comparison of these techniques are presented in Section V. Section VI highlights relevant related work and is followed by concluding remarks in Section VII.

## II. LEAKAGE CURRENT

Power consumption in a digital integrated circuit is governed by

$$P = \alpha C V^2 f + I_{\text{off}} V \tag{1}$$

where  $\alpha$  is the average switching activity factor of the transistors, C is capacitance, V is the power-supply voltage, f is the clock frequency, and  $I_{\text{off}}$  is the leakage current. The first term of the equation is dynamic power and the second term is static power. Smaller feature sizes in each generation of silicon process technologies have been accompanied by reduced power supply voltages that have helped mitigate the impact of increased transistor counts and higher clock frequencies on dynamic power. However, as the power-supply voltage decreases, threshold voltages of the transistors must also decrease to achieve fast switching speeds and sufficient noise margins. Subthreshold leakage current  $I_{\text{off}}$  is dependent



Fig. 1. Projected leakage power of level-2 caches through technology generations.

on temperature T and transistor threshold voltage  $V_t$ , illustrated by the following relation:

$$I_{\text{off}} \propto e^{(-V_t/T)}.$$
 (2)

Thus, lower-threshold voltages lead to increased subthreshold leakage current and increased static power [7]. Most previous efforts at power reduction have focused on dynamic power sources because static power due to leakage current has been a small fraction of the total power dissipated by a chip. However, as transistor threshold voltages are reduced, subthreshold leakage current increases dramatically. Fig. 1 shows estimated static power consumption due to leakage current in large secondary caches through five technology generations. In this chart, cache capacities are scaled from 1 MB to 16 MB, reflecting high-performance microprocessor cache sizes projected by [8]. Four leakage-current scaling models are charted: a linear projection from [9] for 180-100 nm that is extended to the 50-nm node, two experimental leakage models based on our SPICE models for high  $V_t$  (low leakage) and low  $V_t$  (high performance) devices, and a projection based on the static power estimates for high-performance transistors from [10]. In these models, supply voltages are scaled from 1.6 V down to 0.6 V across the technology generations. The high-performance roadmap projection is charted for 25 °C, while the other projections reflect a circuit temperature of 110 °C. Note that due to the exponential dependence on temperature, leakage current from the roadmap model would be higher if it were also plotted for 110 °C. While estimates of leakage current vary due to different scaling assumptions, each projection shows that if left unchecked, leakage current and static power will increase as feature sizes and threshold voltages decrease.

# **III. LEAKAGE REDUCTION TECHNIQUES**

This section describes our implementation of each leakage reduction strategy and our experimental methodology to simulate each technique applied to the level-1 instruction cache (IL1), level-1 data cache (DL1), and level-2 cache (L2). Table I summarizes the primary advantages and disadvantages of the three techniques for reducing leakage energy.

# A. Static Threshold Selection: Dual- $V_t$

The dual- $V_t$  technique employs transistors with higher threshold voltages in memory cells and faster, leakier transis-

TABLE I SUMMARY OF LEAKAGE REDUCTION TECHNIQUES

| Technique                                | Benefit                                                                 | Detriment                                                                              |  |  |  |
|------------------------------------------|-------------------------------------------------------------------------|----------------------------------------------------------------------------------------|--|--|--|
| Dual- $V_t$<br>Gated- $V_{dd}$<br>MTCMOS | no additional circuitry<br>simple circuit<br>no additional cache misses | each read access is slower<br>additional cache misses<br>complex circuitry with diodes |  |  |  |
|                                          |                                                                         | Vdd Vdd+                                                                               |  |  |  |



Fig. 2. Gated-V<sub>dd</sub> and MTCMOS SRAM cell schematics.

tors elsewhere within the SRAM. This technique requires no additional control circuitry and can substantially reduce the leakage current when compared to low- $V_t$  devices. The amount of leakage current is engineered at design time, rather than controlled dynamically during operation. No data are discarded and no additional cache misses are incurred. However, high- $V_t$  transistors have slower switching speeds and lower current drive. In our experiments, we consider an additional cycle of access time for SRAMs composed of these high-threshold devices.

# B. Power Supply Switching: Gated-V<sub>dd</sub>

The gated- $V_{dd}$  technique interposes a high-threshold transistor between the circuit and one of the power supply rails. This study uses an n-channel transister (nFET) as the control mechanism to take advantage of the greater current reduction from the stacking effect of the NFETs in the SRAM cell and bitline pass gates [3]. The left circuit in Fig. 2 shows the schematic of a gated- $V_{dd}$  SRAM cell with an NFET selectively connecting the cell to the ground rail. When the active signal is asserted, the SRAM cell operates normally, but when active is deasserted, the cell is disconnected from the ground and the state contained within the cell is lost. The activation transistor and the control mechanism for active can be shared by all cells within a cache line to minimize the extra area needed by the control transistor. We assume that this power supply gating transistor is sized so that the increase in memory array access time is negligible.

### C. Dynamic Threshold Modulation: MTCMOS

Leakage current may also be reduced by dynamically raising the transistor threshold voltage, typically by modulating the back-gate bias voltage. A technique amenable to fine-grain control is the auto-backgate-controlled multithreshold-CMOS (which we will refer to as MTCMOS ), as shown in the right circuit of Fig. 2 [4], [5]. During normal operation, when sleep is deasserted, the SRAM is connected to  $V_{dd}$  and ground and back-gate voltages are set to the appropriate power rails. When sleep is activated, the p-channel transister (pFET) wells are biased using an alternative power supply voltage,  $V_{dd+}$ , at a higher voltage level than the source terminals. Increasing the negative source-substrate voltage potential increases the effective threshold voltage for the pFETs. Diodes allow the voltage levels of source terminals of the NFETs to increase by two diode drop voltages while the NFET well remains at ground, increasing the source-substrate voltage potential and raising the effective  $V_t$  for the NFETs. Thus, all transistors experience higher threshold voltages and a corresponding drop in leakage current. As with gated- $V_{dd}$ , we assume that any increase in memory array access time is negligible while sleep is not asserted.

The advantage of adjusting the threshold voltage dynamically, rather than gating the power supply, is that memory cell values are preserved during sleep mode, so there are no additional cache misses caused by accessing a line in the low-power mode. This technique provides an opportunity to reduce static power consumption without incurring the cost in time and energy to retrieve data from another level of the memory hierarchy. The disadvantages of MTCMOS include an additional power-supply voltage that must be distributed throughout the array, larger electric fields placed across the transistor gates during sleep mode that may adversely affect reliability, and a latency penalty to awaken a line that is in the sleep mode before the data can be accessed.

## D. Decay Intervals

Energy-saving techniques such as gated- $V_{dd}$  and MTCMOS that disable cache lines rely on two properties of the data stored in caches. First, only a small fraction of the information in the cache is *live*, meaning that it will be referenced again before being replaced or over-written. In our experiments, we found that only 1%–30% of a 2 MB level-2 cache holds live data, depending on the application. Even in level-1 caches, less than half of the cache contains useful data across our benchmark suite. Second, most lines that will be reused are accessed within a relatively short time interval.

Cache lines containing information that is either not useful or will not be accessed for a long time can be put into an idle, low-leakage mode to save energy without a significant effect on processor performance. We determine which lines to place in an idle mode in the gated- $V_{dd}$  and MTCMOS methods by measuring inter-access times, similar to Kaxiras *et al.* [11], [12] who proposed low-frequency counters to measure the time since last reference for every cache line. A read or write to a cache line resets its counter; when the counter reaches its maximum value after a duration named the *decay interval*, the line is deactivated.

## IV. EXPERIMENTAL METHODOLOGY

To evaluate the effectiveness of the leakage-reduction techniques, we modified a version of the SimpleScalar simulator [13]. We added the capability to discard cache lines or put them to sleep after a specified decay interval had passed since the last access to the cache line.

# A. Simulation Methodology

Our benchmark suite for this study consists of five SPEC2000 benchmarks that represent a range of cache usage characteristics: *gcc, eon, equake, mcf,* and *vpr*. The benchmarks are compiled for the alpha instruction set. The simulation execution core is configured as a 4-wide superscalar pipeline organization roughly comparable to the Compaq Alpha 21 264. The memory hierarchy consists of a 64 kB, two-way set associative level-1 instruction cache with a single-cycle hit latency, a 64 kB, two-way set associative level-1 data cache with a three-cycle hit latency, and a unified 2-MB four-way level-2 cache with a 12-cycle hit latency. The level-1 caches have cache line sizes of 64 B, and the level-2 cache line size is 128 B. In the gated- $V_{dd}$  and MTCMOS techniques, data bits may be placed into an idle mode and cache tags are kept in the active state to provide fast lookup times.

In each experiment, we applied a leakage reduction technique to one cache and simulated benchmark execution with our modified SimpleScalar simulator. The simulations executed 1 billion instructions after fast-forwarding through the first 500 million instructions. We measured instructions-per-cycle (IPC), active and inactive durations for each cache line, the number of hits and misses in each level of the hierarchy, and the number of times any cache line is enabled or disabled. For gated- $V_{dd}$ , disabling a cache line is equivalent to switching off the power supply, while for MTCMOS, it is equivalent to placing the cache line into sleep mode. We calculated the total energy by multiplying these measured quantities by the relevant static and dynamic energy parameters described below and summing the energy consumed by individual components of the system.

## B. Energy Parameters

Leakage currents and energy values were measured with the HSPICE circuit simulator. Physical parameters used in this study originally targeted a 70-nm process and corresponding clock rate of 16 fanout-of-four inverter delays [14]. With information now available in [10], the process parameters used in this study are more closely aligned with 100-nm technology parameters. We retained the original data, and have renamed the technology generation to reflect industrial trends.

Table II summarizes the experimental parameters used in this study. In this table,  $I_{\text{max}}$  and  $I_{\text{min}}$  are leakage currents when SRAM cells are active and disabled, respectively. The SRAM cell circuit and Level 3 HSPICE transistor models are adapted from the cache tool CACTI 2.0 [15], with parameters scaled for the 100-nm technology generation. In each experiment,  $V_t = 0.4$  V for high threshold voltage transistors and  $V_t = 0.2$  V for low threshold voltage transistors.  $E_{\text{switch}}$ approximates the energy required to switch the cell between active and inactive modes.  $E_{IL1}$ ,  $E_{DL1}$ , and  $E_{L2}$  represent the energy to read data from the level-1 instruction cache, level-1 data cache, and level-2 caches, respectively, based on a modified version of CACTI 2.0 [15] and our projected process

|                       | 100nm Technology |          | Per-Bit Leakage Current (110 C) |           | Per-Bit Trans. Energy | Dynamic Energy Per Cache Access |           |          |            |
|-----------------------|------------------|----------|---------------------------------|-----------|-----------------------|---------------------------------|-----------|----------|------------|
| Technique             | Clock Rate       | $V_{dd}$ | $I_{max}$                       | $I_{min}$ | $E_{switch}$          | $E_{IL1}$                       | $E_{DL1}$ | $E_{L2}$ | $E_{pins}$ |
| 2                     | (GHz)            | (Volts)  | (nA)                            | (nA)      | (fJ)                  | (nJ)                            | (nJ)      | (nJ)     | (nJ)       |
| Baseline              | 2.5              | 0.75     | 1941                            | -         | -                     | 0.07                            | 0.07      | 4.5      | 0.9        |
| Dual- $V_t$           | 2.5              | 0.75     | · -                             | 26        | -                     | 0.07                            | 0.07      | 4.5      | 0.9        |
| Gated-V <sub>DD</sub> | 2.5              | 0.75     | 1939                            | 9.7       | 0.35                  | 0.07                            | 0.07      | 4.5      | 0.9        |
| MTCMOS                | 2.5              | 0.75     | 1941                            | 12        | 50                    | 0.07                            | 0.07      | 4.5      | 0.9        |

 TABLE II

 EXPERIMENTAL PARAMETERS FOR ENERGY CALCULATIONS

parameters. We estimate the energy to drive the I/O pins with a simple model based on the following equation [16]:

$$E_{\rm pin} = 1.3 C_{\rm pin} V_{\rm pin}^2. \tag{3}$$

We set  $C_{\text{pin}} = 10 \text{ pF}$ , according to the multichip module estimates in [16] and use a value for the pin supply voltage of  $V_{\text{pin}} = 1.5 \text{ V}[17]$ . With a 32-bit address bus, this results in an energy cost of 0.9 nJ per off-chip access. We account only for the pin energy that is expended in driving the address to the pins of the CPU, and not energy expended to receive data.

The total dynamic energy is calculated as the number of cache accesses multiplied by the appropriate energy per access parameter, plus the number of transitions into and out of idle mode multiplied by the energy per transition (for MTCMOS and gated- $V_{dd}$  techniques). To compute the dynamic energy expended in cache accesses, we make the following approximations:

- Level-1 cache miss energy is equal to two cache hit accesses, one to detect the miss and one to load new data.
- Level-2 cache miss energy is equal to two cache hit accesses plus the energy to drive an address to 32 address pins for off-chip memory.
- Power consumed outside the CPU chip is not included in this study.

Static energy is computed as the product of static power per cycle and the number of cycles of program execution. In this paper, we focus only on the leakage in the cache memory arrays; this approximation neglects the leakage current due to the small fraction of transistors in the peripheral circuitry. The total energy is the sum of dynamic and static energy calculations.

Energy consumption and performance of the leakage-reduction techniques are compared to a baseline case to evaluate the experimental techniques' effectiveness in static energy reduction and performance. Implementation details specific to this baseline and the experimental techniques are outlined below.

*Baseline:* The baseline for comparison in this study is a highperformance cache without leakage current control. Each transistor in the SRAM cell has a threshold voltage of 0.2 V, with a high leakage current of  $I_{\rm max}$  at all times. The baseline case has the maximum performance and maximum energy consumption for the set of experiments.

 $Dual-V_t$ : Though the dual- $V_t$  technique has low-leakage transistors in memory cells and high-leakage transistors elsewhere, we account for static energy only in the memory array, and thus only use the reduced-leakage current,  $I_{\min}$ . The dual- $V_t$  technique does not transition between idle and active states and thus does not incur extra cache misses or additional time to access sleeping cells.

Gated- $V_{dd}$ : For the gated- $V_{dd}$  technique,  $I_{max}$  is the leakage current when the memory cell is in the active state, and  $I_{min}$  is the leakage current when the memory cell is disconnected from the power supplies. The gating transistor has a high threshold voltage of 0.4 V, and the other SRAM cell transistors' threshold voltages are the low- $V_t$  value of 0.2 V. The value of  $E_{switch}$  is based on the gate capacitance of the activation transistor and the wire capacitance to reach all of the cells in the cache line. Only "clean" lines that do not require a write back to the memory hierarchy are disabled; "dirty" lines that are not accessed before the decay interval expires are kept in the active state.

*MTCMOS:* The circuit design for the MTCMOS technique is adapted from [4]. In our example, the leakage current for MTCMOS SRAM arrays is controlled on the granularity of a cache line rather than the full cache. The transistors in our SRAM cells have a  $V_t$  of 0.2 V, and the total voltage drop across the diodes is 3.2 volts. The second power supply,  $V_{dd}$ +, is 3.3 V.  $I_{max}$  is the leakage current when the memory cell is awake, and  $I_{min}$  is the leakage current when the cells have transitioned into sleep mode.  $E_{switch}$  is the energy required to charge the cache line's well plus the energy consumed to discharge the source terminals of the NFETs. The time and energy to enter and exit sleep mode depend directly on the effective capacitance of the well that contains the pFETs in the SRAM cell; in this study, we vary the delay to awaken a sleeping cache line from 1 to 10 cycles to examine the sensitivity to wakeup latency.

## V. RESULTS

This section presents our experimental results and compares tradeoffs between performance and energy reduction for three leakage-reduction techniques. We analyze each technique's energy-saving potential and impact on performance using the combined energy-delay metric. Then, we explore the effects of additional cache access latency due to each leakage reduction technique.

## A. Energy-Delay

We use a metric of the energy-delay product to balance the benefits of lower leakage with the potential penalty of reduced performance. We calculate the energy-delay product as the total energy divided by IPC, which is equivalent to the product of energy and a measure of time (cycles per instruction, with a fixed number of instructions).

To evaluate the gated- $V_{dd}$  and MTCMOS strategies, we observed each technique's performance throughout a range of decay intervals, and chose intervals that resulted in the minimum energy-delay product. The best-case decay interval depends upon program cache access patterns and circuit

| Level-1 Instruction Cache |                |       |                  |                    |                    |                      |  |
|---------------------------|----------------|-------|------------------|--------------------|--------------------|----------------------|--|
| Technique                 | Decay Interval | IPC   | Total Energy(J)  | Dynamic Energy (J) | Leakage Energy (J) | Energy-Delay (E/IPC) |  |
| Baseline                  |                | 1.645 | 4.688            | 4.539              | 0.141              | 2.663                |  |
| Dual- $V_t$               | -              | 0.680 | 4.525            | 4.520              | 0.005              | 6.181                |  |
| Gated-V <sub>dd</sub>     | 64K            | 1.641 | 4.584            | 4.539              | 0.039              | 2.613                |  |
| MTCMOS                    | 8K             | 1.644 | 4.580            | 4.539              | 0.035              | 2.607                |  |
|                           |                |       |                  |                    |                    |                      |  |
| Level-1 Data Cache        |                |       |                  |                    |                    |                      |  |
| Technique                 | Decay Interval | IPC   | Total Energy (J) | Dynamic Energy (J) | Leakage Energy (J) | Energy-Delay (E/IPC) |  |
| Baseline                  | -              | 1.645 | 1.679            | 1.530              | 0.141              | 0.942                |  |
| Dual- $V_t$               | -              | 1.540 | 1.520            | 1.518              | 0.002              | 0.898                |  |
| Gated- $V_{dd}$           | 64K            | 1.643 | 1.571            | 1.531              | 0.030              | 0.885                |  |
| MTCMOS                    | 1K             | 1.639 | 1.547            | 1.530              | 0.017              | 0.874                |  |
|                           |                |       |                  |                    |                    |                      |  |
| Level-2 Unified Cache     |                |       |                  |                    |                    |                      |  |
| Technique                 | Decay Interval | IPC   | Total Energy (J) | Dynamic Energy (J) | Leakage Energy (J) | Energy-Delay (E/IPC) |  |
| Baseline                  | -              | 1.645 | 4.540            | 0.004              | 4.513              | 2.424                |  |
| Dual- $V_t$               |                | 1.625 | 0.084            | 0.004              | 0.061              | 0.042                |  |
| Gated- $V_{dd}$           | 64K            | 1.386 | 0.239            | 0.005              | 0.225              | 0.112                |  |
| MTCMOS                    | 0              | 1.626 | 0.140            | 0.004              | 0.115              | 0.072                |  |

 TABLE III

 SUMMARY OF EXPERIMENTAL RESULTS: HARMONIC MEAN ACROSS BENCHMARK SUITE

parameters unique to each leakage-reduction technique [18]. In our study, the best decay interval for the gated- $V_{dd}$  technique was found to be 64K cycles for each cache. For the MTCMOS technique, the best decay interval is 8 K cycles for the level-1 instruction cache, 1 K cycles for the level-1 data cache, and immediate sleep mode (zero-cycle decay interval) for the level-2 cache. Table III summarizes the experimental results, reported as the harmonic mean of IPC, energy, and energy-delay product for simulated program execution across the benchmark suite.

Fig. 3 shows the total energy required for program execution for each leakage-reduction technique applied independently to one cache. The charts present data from the best decay interval in the gated- $V_{dd}$  and MTCMOS techniques. In the figures of the left column, stacked bar charts illustrate the contribution of static and dynamic energy for each benchmark. Note that in the level-1 caches, the majority of energy consumption is due to dynamic energy, whereas in level-2 caches, static energy dominates the total energy. Charts in the right column of Fig. 3 show the energy-delay product for each benchmark and highlight the variation between techniques. Each of the three leakage-reduction methods in this study achieves lower-leakage energy compared to the baseline case with high-performance SRAM cells but sacrifices performance to do so, whether by slowing cache accesses or causing delays to refetch data.

Dual- $V_t$ : The dual- $V_t$  cache is effective at reducing leakage; however, with an extra cycle of delay, the technique has a negative effect on performance for level-1 caches. The dual- $V_t$  technique reduces the static energy consumed by the IL1 cache by 96%, at the expense of reducing the IPC by over half. The energy-delay product of the dual- $V_t$  technique is more than twice that of the IL1 baseline case. Although the leakage current and, therefore, static energy is reduced, the performance penalty may be unacceptable for a dual- $V_t$  method applied to an instruction cache or other structures that rely on fast access times. The dual- $V_t$  DL1 cache reduces static energy by 98%, with an energy-delay product that is 4% better than the baseline case. In the level-2 dual- $V_t$  cache experiment, static energy decreases by 98% with negligible performance degradation and the energy-delay product improves by over a factor of 50. *Gated-V<sub>dd</sub>*: With gated- $V_{dd}$ , static energy savings are offset by the dynamic energy required to service additional misses to prematurely disabled cache lines. The total energy of the frequently accessed primary caches is dominated by dynamic energy of read accesses, and despite substantial static energy savings, the energy-delay product is only slightly better than the baseline case. The gated- $V_{dd}$  technique applied to an IL1 with a 64 K decay interval produces a 72% static energy savings, with a 2% improvement in energy-delay compared with the baseline. In the level-1 data cache, the technique had similar results: 79% reduction in static energy, with a 6% improvement in the energy-delay product. In the level-2 cache, the penalty for additional execution time creates a noticeable drop in IPC. However, the energy savings with the gated- $V_{dd}$  technique is 95%, for an overall effect of improving the energy-delay by a factor of 20.

*MTCMOS:* The MTCMOS level-1 instruction cache with an 8-K decay interval reduces static energy by 75%, an improvement in energy-delay of 2%. In the level–1 data cache, the MTCMOS technique and a 1-K decay interval decreases static energy by 88%, while improving the energy-delay product by 7%. For the level-2 cache and an aggressive sleep policy, leakage current is dramatically reduced at the expense of a slightly lower IPC. The level-2 cache with MTCMOS circuitry and an immediate sleep mode reduces static energy by 97% and improves the energy-delay product by a factor of approximately 34.

## B. Sensitivity to Delay

Although leakage reduction techniques attempt to reduce static energy consumption, the performance penalties they can impose act in opposition to such savings and can reduce the techniques' effectiveness. In particular, if a program takes more time to complete with leakage reduction techniques enabled, then all remaining leaky components of the chip will leak for a longer period of time. In this section, we investigate the effects of additional latency on processor performance and static energy consumption. In dual- $V_t$  and gated- $V_{dd}$ , delays are manifested in cache access time overhead, while the most



Fig. 3. Energy and energy-delay product for L1 and L2 caches.

interesting variable for MTCMOS is the time to wake a sleeping line.

 $Dual-V_t$ : Cache access time for dual- $V_t$  can increase if the speed reduction of the higher threshold devices in the cache is significant. Likewise, the high- $V_t$  cutoff transistor implemented in a gated- $V_{dd}$  strategy could also increase overall cache access time. The increase in access latency can extend the execution time of the program and degrade performance. Graphs in the left column of Fig. 4 show the performance degradation for processors accessing dual- $V_t$  caches as the access latency is increased

by one and two cycles. The IPC values are calculated as the harmonic mean of measured IPC results from all five benchmarks. Fig. 4(a) shows the IPC for the level-1 instruction cache drops from 1.65 to 0.41, a substantial 74% reduction in performance as the latency increases by two cycles. The processor is less sensitive to additional delays in the level-1 data cache, as illustrated in Fig. 4(b). The mean IPC values dip from 1.64 to 1.50, an average performance reduction of 4% when the level-1 data cache latency increases by two cycles. Fig. 4(c) shows that additional latency in the level-2 cache causes the least impact on perfor-



Fig. 4. IPC and energy sensitivity to access delay for level-1 and level-2 dual- $V_t$  caches.

mance, with an average of 2% decrease in IPC for two extra cycles of latency.

The right column of Fig. 4 indicates how longer access times translate into increased static energy for individual program execution. In addition, the harmonic mean over the full benchmark suite is reported in this discussion on sensitivity trends. In the level-1 instruction cache, the mean static energy increases by 157% for one additional cycle and 387% for two additional cycles of level-1 instruction cache latency. Fig. 4(d) shows how

each extra cycle of latency adds to static energy consumption for each program in the benchmark suite. The short bars in Fig. 4(e) indicate that static energy of the level-1 data cache is not as strongly affected by additional access latency. In the level-1 data cache, the static energy increases for one and two additional cycles of latency are 5% and 9%, respectively. The unified level-2 cache shows an overall 1% increase in static energy for each additional cycle of latency. Fig. 4(d) illustrates that the static energy consumption depends upon program behavior; the in-



Fig. 5. IPC and energy sensitivity to access delay for level-1 and level-2 MTCMOS caches.

crease is more pronounced in the benchmarks *mcf* and *gcc* than in *equake*.

*MTCMOS:* While MTCMOS does not suffer from additional latency to access cache lines in an awake state, its effectiveness does depend on the speed at which cache lines can be reawakened. Additional clock cycles used to awaken sleeping cache lines can extend the program execution time, with the effect of reducing processor performance and increasing the static energy expended. The wakeup transition time is determined by the cir-

cuit configuration and physical parameters; this section explores the sensitivity of the MTCMOS technique applied to primary and secondary caches as the experimental wakeup penalty is varied from one to ten cycles. Results are reported as the harmonic mean of IPC and the harmonic mean of the static energy for program execution of all benchmarks in the suite.

Graphs in the left column of Fig. 5 show the combined effect of decay interval and wakeup latency on processor performance. In Fig. 5(a)–(c), the processor's performance is plotted

as a function of the wakeup latency for four cache decay intervals: immediate sleep, 1K, 8K, and 64K cycles. Graphs in the right column of Fig. 5 show the static energy consumption expended by the processor as a function of the wakeup latency for four cache decay intervals: immediate sleep, 1K, 8K, and 64K processor cycles. Unlike the dual- $V_t$  scenario in which extra latency affects each cache access, MTCMOS caches incur extra latency only for accesses to sleeping cache lines.

An MTCMOS level-1 instruction cache causes the largest performance degradation in IPC when short decay intervals with long wakeup latencies are employed, as illustrated in Fig. 5(a). For an level-1 instruction cache with an MTCMOS immediate sleep policy, the measured IPC drops by 93% when the wakeup penalty is ten cycles compared to a wakeup penalty of one cycle. For a larger decay interval of 64K cycles, when most useful cache lines are kept awake, the IPC is reduced by less than 1% when the wakeup penalty is increased from one to ten cycles. With a decay interval of 8K, the best-case interval in this study for MTCMOS level-1 instruction caches, the IPC is 1.35% lower for a ten-cycle wakeup time. Fig. 5(d) shows that an MTCMOS level-1 instruction cache with an immediate sleep mode uses 18 times more static energy with a wakeup penalty of ten cycles than with a one-cycle penalty. However, since dynamic energy dominates the total energy for the primary caches, the total level-1 instruction cache energy consumption increases by only 3%. With a decay interval of 64 K, the program execution time is not noticeably affected, and the static energy is essentially unchanged.

The MTCMOS DL1 cache also causes performance degradation with short decay intervals. As Fig. 5(b) illustrates, an MTCMOS level-1 data cache with an immediate sleep policy causes an IPC drop of 31% from one-cycle to ten-cycle wakeup penalties. The extra execution time for this case leads to an additional 3 mJ of static energy, an 86% increase. Longer decay intervals, however, show only a slight decrease in performance, and the static energy shows more sensitivity to the decay interval than to extra latency, as seen in Fig. 5(e).

Since level 2 accesses are relatively infrequent, program execution time is only mildly extended due to waiting for sleeping level-2 cache lines to transition to the active mode. A zero-cycle decay interval leads to the largest IPC drop of 8%. With most lines in a low-leakage mode, additional processor cycles contribute only a small amount of extra leakage current. The largest static energy increase was 7% for the immediate-sleep policy. Fig. 5(e) shows that as the decay interval increases, the effect of additional latency decreases. Since static energy is the largest component of the total energy in the level-2 cache, the effect of increased static energy is an overall energy increase of 5% for the immediate-sleep configuration.

## VI. RELATED WORK

Leakage-reducing circuit techniques can be incorporated into architectural solutions that rely on the programs' use of system resources to reduce static energy. One example employs a gated- $V_{dd}$  circuit to selectively disable cache lines based on miss rates, dynamically resizing the instruction cache to a size appropriate for the currently executing program. Yang *et al.*  found that this technique reduced the energy-delay product by 62% with a 4% increase in execution time with SPEC95 benchmarks, compared to a standard cache [19].

Kaxiras *et al.* have developed improvements to the gated- $V_{dd}$  technique with an adaptive control on the gating transistor, and have shown that their technique can reduce leakage energy in level-1 caches by a factor of 5 [12]. Zhou *et al.* have proposed a low-leakage cache design named adaptive mode control that dynamically adjusts the number of cache lines turned off by the gated- $V_{dd}$  method throughout program execution to keep the number of extra cache misses caused by disabling cache lines proportional to the number of misses that would be incurred with a standard cache [20]. With adaptive mode control, a level-1 instruction cache with an average of 74% of the cache lines disabled and a level-1 data cache with an average of 50% disabled cache lines results in an IPC drop of less than 1.6%.

Recently, Flautner, *et al.* introduced a technique that in principle is similar to the cache-line level control we introduce for MTCMOS [21]. Instead of modulating the back-gate bias, their drowsy caches modulate the power supply voltage to the cache's memory cells to reduce the voltage and, thus, the leakage current, when a cache line has not been accessed for a while. The advantages to this technique are that the circuit to control leakage is simpler and is likely to enable faster transitions into and out of the sleep mode. However, according to our estimates, MTCMOS can provide an additional order of magnitude reduction in leakage current. Thus, the technique of Flautner *et al.* is better suited for latency-critical caches while MTCMOS is better suited to leakage-critical caches.

## VII. CONCLUSION

In this paper, we have explored energy and performance tradeoffs associated with three techniques for reducing static energy consumption in on-chip caches: high- $V_t$  transistors in memory arrays, power supply switching, and dynamic transistor threshold modulation.

Each of the techniques is effective in reducing energy consumption in primary and secondary caches. We found that with careful selection of decay intervals, the MTCMOS and gated- $V_{dd}$  techniques yielded better energy-delay products than the dual- $V_t$  technique in the primary caches, due to their overall lower access time. With our assumptions, both the gated- $V_{dd}$ and MTCMOS techniques improve the energy-delay product by 2% in the level-1 instruction cache, and yield an improvement of 6% and 7%, respectively, in the level-1 data cache compared to the experimental baseline. The dual- $V_t$  technique improves the energy-delay product of the level-1 data cache by 4%, and degrades energy-delay product in the level-1 instruction cache. For the secondary cache, the dual- $V_t$  technique has the best energy-delay characteristics, with a 50-fold improvement compared to the baseline case. The gated- $V_{dd}$  and MTCMOS techniques were also effective at improving the energy-delay of level-2 caches, with overall reductions of factors of 20 and 34, respectively.

However, additional latency and energy penalties contributed by the leakage reduction strategy [18], can extend program execution time and increase static energy consumption, especially when applied to the primary instruction cache. Increasing the dual- $V_t$  level-1 instruction cache access by two extra cycles results in performance degradation of 74%, and a 387% increase in static energy expenditure. For an MTCMOS level-1 instruction cache with a zero-cycle decay interval, performance drops by 93% and static energy increases by a factor of 18 when the wakeup latency is ten cycles rather than one. In the level-1 data cache, the effect of additional access time was less detrimental. A dual- $V_t$  level-1 data cache with two additional cycles of access time reduces performance by 4% and increases static energy by 9%. An MTCMOS level-1 data cache with a ten-cycle wakeup latency causes performance to drop by 31% with the shortest decay interval; longer decay intervals do not suffer such performance degradation. The unified level-2 cache is the least sensitive to additional delays, with a 2% dip in IPC for the dual- $V_t$  level-2 cache accompanied by a 2% increase in static energy; an MTCMOS level-2 cache with the worst-case of immediate sleep policy caused 8% reduction in IPC and 7% increase in static energy consumed.

This paper has emphasized static energy reduction in cache memories while considering the effect on processor performance and total energy. The same principles may be applied to other hardware structures, as well. For example, the static energy required to maintain the state of branch predictor table entries may be balanced against the dynamic energy required to execute with fewer correct predictions. Future work will include static energy analysis of other microarchitectural features and their impact on microprocessor performance and total energy.

#### References

- [1] T. McPherson, R. Averill, D. Balazich, K. Barkley, S. Carey, Y. Chan, Y. H. Chan, R. Crea, A. Dansky, R. Dwyer, A. Haen, D. Hoffman, A. Jatkowski, M. Mayo, D. Merrill, T. McNamara, G. Northrop, J. Rawlins, L. Sigal, T. Slegel, and D. Webber, "760 MHz G6 S/390 microprocessor exploiting multiple V<sub>t</sub> and copper interconnects," in *Proc. Int. Solid-State Circuits Conf.*, 2000, pp. 96–97.
- [2] K. Roy, "Leakage power reduction in low-voltage CMOS designs," in Proc. Int. Conf. Electron., Circuits Syst., 1998, pp. 167–73.
- [3] M. Powell, S. H. Yang, B. Falsafi, K. Roy, and T. N. Vijaykumar, "Gated-V<sub>dd</sub>: A circuit technique to reduce leakage in deep-submicron cache memories," in *Proc. Int. Symp. Low-Power Electron. Design*, 2000, pp. 90–95.
- [4] H. Makino, Y. Tujihashi, K. Nii, C. Morishima, Y. Hayakawa, T. Shimizu, and T. Arakawa, "An auto-backgate-controlled MT-CMOS circuit," in *Proc. Symp. VLSI Circuits*, 1998, pp. 42–43.
- [5] K. Nii, H. Makino, Y. Tujihashi, C. Morishima, Y. Hayakawa, H. Nunogami, T. Arakawa, and H. Hamano, "A low power SRAM using auto-backgate-controlled MT-CMOS," in *Proc. Int. Symp. Low-Power Electron. Design*, 1998, pp. 293–298.
- [6] H. Hanson, M. Hrishikesh, V. Agarwal, S. Keckler, and D. Burger, "Static energy reduction for microprocessor caches," in Proc. Int. Conf. Comput. Design, 2001.
- [7] J. A. Butts and G. Sohi, "A static power model for architects," in *Proc.* 33rd Annu. Int. Symp. Microarchitecture, Dec. 2000, pp. 191–201.
- [8] Int. Technol. Roadmap for Semiconductors, 2000 Update, Overall Technol. Roadmap Characteristics (2000). [Online]. Available: http://public.itrs.net/Files/2000UpdateFinal/ORTC2000final.pdf
- [9] S. Borkar, "Design challenges of technology scaling," *IEEE Micro.*, vol. 19, no. 4, pp. 23–29, July-Aug. 1999.
- [10] Int. Technol. Roadmap for Semiconductors, 2001 Edition (2001). [Online]. Available: http://public.itrs.net/Files/2001ITRS/Home.htm

- [11] S. Kaxiras, Z. Hu, G. Narlikar, and R. McLellan, "Cache-line decay: A mechanism to reduce cache leakage power," in *Proc. Workshop on Power Aware Comput. Syst.*, 2000.
- [12] S. Kaxiras, Z. Hu, and M. Martonosi, "Cache-line decay: Exploiting generational behavior to reduce leakage power," in *Proc. 28th Annu. Int. Symp. Comput.Architecture*, July 2001, pp. 240–251.
- [13] D. Burger and T. Austin, "The Simplescalar Tool Set Version 2.0," Comput. Sci. Dept., Univ. Wisconsin-Madison, 1997.
- [14] M. Horowitz, R. Ho, and K. Mai, "The future of wires," in Proc. Semiconductor Res. Corp. Workshop Interconnects SoC, May 1999.
- [15] G. Reinman and N. Jouppi. (1999) An integrated cache timing and power model. [Online] http://research.compaq.com/wrl/people/ jouppi/cacti2.pdf
- [16] D. Liu and C. Svensson, "Power consumption estimation in CMOS VLSI chips," *IEEE J. Solid-State Circuits*, vol. 29, pp. 663–660, June 1994.
- [17] "Pentium III Processor for the sc242 at 450 MHz to 1.13 GHz," Intel Corp., Order Number 244 452-008, 2000.
- [18] H. Hanson, "Comparison of Leakage Energy Reduction Techniques," Univ. Texas, Austin, Comput. Sci. Dept., TR-01-18, 2001.
- [19] S. H. Yang, M. Powell, B. Falsafi, K. Roy, and T. N. Vijaykumar, "An integrated circuit/architecture approach to reducing leakage in deep-submicron high-performance caches," in *Proc. Int. Symp. High-Performance Comput. Architecture*, 2001, pp. 147–157.
- [20] H. Zhou, M. Toburen, E. Rotenberg, and T. Conte, "Adaptive mode-control: A static-power-efficient cache design," in *Proc. Int. Conf. Parallel Architectures and Compilation Techniques*, 2001, pp. 61–72.
- [21] K. Flautner, N. S. Kim, S. Martin, D. Blaauw, and T. Mudge, "Drowsy caches: Simple techniques for reducing leakage power," in *Proc. 29th Annu. Int. Symp. Comput. Architecture*, May 2002, pp. 148–157.



**Heather Hanson** (S'99) received the B.S. degree in electrical and computer engineering, the B.A. degree in liberal arts, and the M.S degree in electrical and computer engineering from the University of Texas at Austin, in 1994 and 2001, respectively. She is currently working toward the Ph.D. degree at that same university, focusing on researching power and energy-efficient microprocessors.

She has worked at Logical Silicon Solutions, as a Circuit Designer an at Intel Corporation as a Logic Designer in Austin, TX.



**M. S. Hrishikesh** (S'01) received the B.E. degree in electrical engineering from the University of Madras, Chennai, India, in 1997 and the M.S. degree from the University of Texas at Austin in 1999. He is currently working toward the Ph.D. degree at the University of Texas at Austin.

His research interests include the scalability of processor to very small feature sizes for high performance computing. He is currently investigating clustering mechanisms for very wide issue processors.



Vikas Agarwal received the B.Tech. degree in electrical engineering from the Indian Institute of Technology, Bombay, in 1996 and the M.S. degree in electrical and computer engineering from the University of Texas at Austin in 1998. He is currently working toward the Ph.D. degree at the University of Texas at Austin.

His research interests include modeling the effect of semiconductor technology scaling on microprocessor microarchitectural structures. He is currently investigating reliability issues of large

on-chip cache structures.



**Stephen W. Keckler** (S'87–M'98) received the B.S. degree in electrical engineering from Stanford University, Palo Alto, CA, and the M.S. and Ph.D. degrees in computer science from the Massachusetts Institute of Technology, Cambridge.

Currently, he is an Assistant Professor of both computer science and electrical and computer engineering at the University of Texas at Austin, as well as an Alfred P. Sloan Research Fellow. His research interests include computer architecture, parallel and embedded processors, VLSI design, adaptive

computing, and the relationship between technology and computer systems development. He is co-director of the Computer Architecture and Technology (CART) Laboratory, where his research is currently supported by a National Science Foundation CAREER award, an IBM University Partnership award, and grants from the National Science Foundation, Intel, IBM, and DARPA.

Dr. Keckler is a Member of Sigma Xi and Phi Beta Kappa.



**Doug Burger** (S'93–M'98) received the B.S. degree from Yale University, New Haven, CT, in 1991, and the Ph.D. degree in computer science from the University of Wisconsin-Madison.

Since 1999, he been an Assistant Professor in computer science and electrical and computer engineering, University of Texas at Austin. He is Co-leader of the TRIPS project at UT-Austin, which is building the microprocessors for a new level of performance and flexibility across many application classes. His research interests include computer

architecture, compilers, operating systems, and emerging technologies. In 2000, he was the recipient of the NSF CAREER Award. He is also an IBM Center for Advanced Studies Fellow, and a Sloan Foundation Research Fellow.