What contributions have the authors mentioned in the paper "Reducing cache power with low-cost, multi-bit error-correcting codes" ?

In this paper, the authors show the significant impact of variations on refresh time and cache power consumption for large eDRAM caches. The authors propose Hi-ECC, a technique that incorporates multibit error-correcting codes to significantly reduce refresh rate.

How long does it take to read a line to be re-read?

Since the retention time of their baseline eDRAM is 30us, and each read automatically implies a refresh, the authors know that retention failures will not occur for 30us after a line has been read.

How many lines can be disabled in a particular cache set?

Since their cache is a 16-way set-associative, and since disabling all lines in a particular cache set could be catastrophic for some workloads, the authors limit the maximum number of lines that can be disabled in a particular set to 14 of the 16 ways.

How do the authors track lines that have been referenced in the last 30us?

Using a structure the authors refer to as the Recently Accessed Lines Table (RALT), the authors attempt to track lines that have been referenced in the last 30us.

(Open Access) Reducing cache power with low-cost, multi-bit error-correcting codes (2010) | Christopher B. Wilkerson

Reducing Cache Power with Low-Cost, Multi-bit

Error-Correcting Codes

Chris Wilkerson, Alaa R. Alameldeen, Zeshan Chishti,

Wei Wu, Dinesh Somasekhar, and Shih-Lien Lu

Intel Labs

Hillsboro, Oregon, USA

{chris.wilkerson, alaa.r.alameldeen, zeshan.a.chishti, wei.a.wu, dinesh.somasekhar, shih-lien.l.lu} @intel.com

ABSTRACT

Technology advancements have enabled the integration of large

on-die embedded DRAM (eDRAM) caches. eDRAM is

significantly denser than traditional SRAMs, but must be

periodically refreshed to retain data. Like SRAM, eDRAM is

susceptible to device variations, which play a role in determining

refresh time for eDRAM cells. Refresh power potentially

represents a large fraction of overall system power, particularly

during low-power states when the CPU is idle. Future designs

need to reduce cache power without incurring the high cost of

flushing cache data when entering low-power states.

In this paper, we show the significant impact of variations on

refresh time and cache power consumption for large eDRAM

caches. We propose Hi-ECC, a technique that incorporates multi-

bit error-correcting codes to significantly reduce refresh rate.

Multi-bit error-correcting codes usually have a complex decoder

design and high storage cost. Hi-ECC avoids the decoder

complexity by using strong ECC codes to identify and disable

sections of the cache with multi-bit failures, while providing

efficient single-bit error correction for the common case. Hi-ECC

includes additional optimizations that allow us to amortize the

storage cost of the code over large data words, providing the

benefit of multi-bit correction at same storage cost as a single-bit

error-correcting (SECDED) code (2% overhead). Our proposal

achieves a 93% reduction in refresh power vs. a baseline eDRAM

cache without error correcting capability, and a 66% reduction in

refresh power vs. a system using SECDED codes.

Categories and Subject Descriptors

B.3.4 [Memory Structures]: Reliability, Testing, Fault-

Tolerance.

General Terms

Design, Reliability, Power.

Keywords

ECC, Multi-Bit ECC, DRAM, eDRAM, refresh power, Vccmin.

1. INTRODUCTION

Advances in technology scaling have led to dramatic yearly

improvements in on-die cache capacity. New process

technologies have also enabled integrating DRAM on a logic

process, leading to the use of embedded DRAM (eDRAM) to

build on-die caches that are much denser than SRAM-based

caches (e.g., IBM Power 7 [14]). However, a side effect of

technology scaling is the increasing susceptibility of cache

structures to device variations [1, 27], where a few weak cells can

constrain the operating range of the whole cache.

In traditional SRAM caches, intrinsic variations force operation at

high voltages due to a few weak cells that fail at lower voltages,

and impede efforts to reduce power [29, 33]. Likewise, in

eDRAM caches, device variations affect the retention time of

individual DRAM cells, with a few particularly weak bits

determining the refresh time of the whole cache. A high refresh

rate significantly increases cache power.

Reducing power consumption is a first-order design constraint for

modern processors. In pursuit of improved power and energy

efficiency, processors implement a number of idle states to

support lower power modes. Reducing the power consumed

during idle states is particularly important because the typical

CPU spends the vast majority of its time in idle state. Many

desktop applications, such as word processors and spreadsheets,

spend much of the time waiting for I/O and tend to require the

CPU to operate only 10-20% of the time during use. Studies done

on the Intel

Core™/Core™ 2 Duo show that an idling processor

will consume an average of 0.5W-1.05W [24] depending on the

processor and frequency of idle state exits caused by events like

OS interrupts. Our analysis projects that a future processor with

128MB of eDRAM cache will consume 926mW just refreshing

the eDRAM. Based on these power numbers, we project that the

power consumption of large memory structures, like eDRAM

caches, will be the biggest contributor to overall idle power.

One popular method to reduce cache power is to power-gate large

blocks of memory at the cost of losing its state [6]. But as cache

density increases, the performance and power costs of this

approach also increase. One of the important goals in

implementing idle states is reducing power consumption, while

minimizing the transition latency into and out of the idle state. In

the Intel

Core™ 2, a transition out of the C4 idle state can take

about 100-200us [9]. In contrast, if the state in a 128MB eDRAM

cache were sacrificed to save power, it would take 4.2ms to re-

fetch the 128MB of lost data assuming full usage of the 30GB/s

bandwidth provided by system memory. In future products with

denser eDRAM caches, navigating the tradeoffs between idle exit

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for profit or commercial advantage and that

copies bear this notice and the full citation on the first page. To copy

otherwise, or republish, to post on servers or to redistribute to lists,

requires prior specific permission and/or a fee.

ISCA’10, June 19-23, 2010, Saint Malo, France.

latency and idle power consumption will become increasingly

difficult. As the capacity of embedded memory grows, the

performance and power cost of flushing this memory also grows.

A key challenge for future product designers is to enable flexible

memory structures that can operate at very low idle power,

without dramatically increasing transition latency to and from the

idle power state due to data loss.

In this paper, we evaluate a modern processor with a 128MB

eDRAM cache. We show that refresh time plays the key role in

determining the eDRAM’s power. We first explore the role of

variation-related cell failures in determining refresh time. We then

evaluate the potential for using error-correcting codes (ECC) to

mitigate refresh-related failures. Augmenting eDRAM with error-

correcting codes enables reliable cache operation with longer

refresh periods, thereby lowering system power.

We propose Hi-ECC, a practical, low-latency, low-cost, error-

correcting system that can compensate for high failure rates in

eDRAM caches. Hi-ECC implements a strong BCH code with the

ability to correct 5 and detect 6 errors (hereafter referred to as a

5EC6ED code). A traditional approach using strong ECC suffers

from two prohibitive overheads that limit its applicability. First,

building a low-latency decoder for multi-bit ECC codes is

extremely costly. Second, the storage overhead of ECC bits is

high (around 10% for a 5EC6ED ECC code for a 64 byte line).

Hi-ECC proposes architectural solutions to both problems. It uses

a simple ECC decoder optimized for the 99.5% of the lines that

require little or no correction, and provides a high latency

alternative for lines that require complex multi-bit correction. To

minimize the performance impact of processing high latency

multi-bit corrections, Hi-ECC disables lines with multi-bit

failures. Finally, Hi-ECC leverages the natural spatial locality of

the data to reduce the cost of ECC storage. We make the

following main contributions:

1. We demonstrate that device variations lead to significant

increases in cache refresh rates.

2. We propose Hi-ECC, a practical system for using strong

error correcting codes that avoids decoder complexity and

latency.

3. We show how Hi-ECC can be extended to reduce the

storage overhead of the error-correcting codes by amortizing the

cost of the code over larger data words. This allows implementing

Hi-ECC with a 2% storage overhead, comparable to that of a

single error correcting code (SECDED) over 64 byte lines.

4. For a system with a 128 MB eDRAM cache, we show that

Hi-ECC can reduce cache refresh power by 93% compared to an

eDRAM with no error correction capability, and 66% compared

to an eDRAM with SECDED, all for about the same storage

overhead as a SECDED code. When accounting for dynamic

power, an optimized Hi-ECC reduces total power by 61% relative

to SECDED.

The remainder of this paper is organized as follows. In Section 2,

we review some of the design tradeoffs for eDRAM caches,

including a discussion of retention failures and previous work on

mitigating them. Section 3 describes our proposed Hi-ECC

architecture, and is followed by a review of the mathematical

properties of BCH codes that Hi-ECC relies on in Section 4. We

describe our evaluation methodology and results in Section 5 and

conclude in Section 6.

2. BACKGROUND

Embedded DRAM technology enables smaller memory cells as

compared to SRAM cells, resulting in a three to four times

increase in memory density [21]. The higher density of eDRAM

makes it a promising candidate to replace SRAM as the last-level

on-chip cache in future high performance processors. IBM has

recently announced that its upcoming Power7 processor will use a

32 MB on-chip eDRAM cache [14]. As feature sizes continue to

decrease, even larger eDRAM caches can be incorporated on

chip. In this paper, we model a 128 MB eDRAM cache, two

technology generations ahead of IBM Power7’s eDRAM cache.

One of the main problems with eDRAM cells is that they lose

charge over time due to leakage currents. The retention time of an

eDRAM cell is defined as the length of time for which the cell

can retain its state. Cell retention time depends on the leakage

current, which, in turn, depends on the access device leakage. To

preserve the state of stored data, eDRAM cells need to be

refreshed on a periodic basis. In order to prevent failures, the

refresh period needs to be less than the cell retention time.

Because eDRAM uses fast logic transistors with a higher leakage

current than conventional DRAM, the refresh time for eDRAM is

about a thousand times shorter than conventional DRAM. For

example, Barth, et al. [2] report the refresh time of a small

eDRAM to be 40us as compared to 64ms refresh time in

commodity DRAM [22]. This low refresh time poses serious

problems for eDRAM because it not only increases the idle

power, but also leads to reduced availability.

Previous work has

shown that variations in threshold voltage cause retention times of

different DRAM cells to vary significantly [8, 10, 11, 18]. These

variations are caused predominantly by random dopant

fluctuations and manifest themselves as a random distribution of

retention times amongst eDRAM cells. We use the data published

in [18] to model the retention time distribution in Figure 1.

1.E-21

1.E-18

1.E-15

1.E-12

1.E-09

1.E-06

1.E-03

1.E+00

0 100 200 300 400 500

Refresh Time (usec)

Probablity of Failure

pfbit pfCache

30us

Figure 1: eDRAM retention time distribution

The pfbit curve in Figure 1 represents the probability of a

retention failure in a single bit cell (derived from Figure 3 in [18])

and the pfCache curve represents the failure probability of a

128MB eDRAM cache for different refresh times. Our model

assumes that bit retention failures are distributed randomly

throughout the eDRAM cache, consistent with [18]. A cache

containing even a single failure must be discarded. Therefore, the

probability of failure for the entire cache is (1 – probability of

success), where the probability of success is the probability that

each bit in the cache stays failure-free. We assume that the

pfCache must be kept at less than 1 out of 1000 to achieve

acceptable manufacturing yields [33]. Under these assumptions,

data in Figure 1 shows that the refresh time for a baseline 128MB

eDRAM cache is 30 microseconds, close to the 40 microseconds

refresh time reported in [2] for a 13.5Mbit eDRAM macro.

Refresh mechanisms in eDRAM designs typically use a single

worst case refresh period dictated by the cell with the lowest

retention time. As eDRAM capacity increases in future

generations, eDRAM idle power, dominated by refresh, will

grow. Some previous papers have proposed hardware mechanisms

to exploit retention time variations by refreshing different DRAM

cells at different refresh rates [16, 26]. Venkatesan, et al. [32]

proposed a software mechanism that allocates DRAM pages with

longer retention time before allocating pages with shorter

retention time, and then chooses a refresh period that is

determined only by the populated pages instead of the entire

DRAM. These approaches require additional storage to track

retention times and rely on memory tests to identify marginal bit

cells. In [33] Wilkerson et al propose the bit-fix algorithm,

another testing-based approach, to address the problem of high

failure rates in the context of Vccmin reduction in SRAM caches

instead of DRAM refresh time. Since test time grows

proportionately with the capacity of the memory being tested,

increasing cache capacities may limit the applicability of all

testing-based approaches.

Ghosh and Lee [10] recently proposed a SmartRefresh technique

to reduce refresh power by adding timeout counters in each

DRAM row and avoiding unnecessary refreshes for those rows

which were recently read or written. However, SmartRefresh is

ineffective during the idle mode when the cache is not being

accessed, and therefore does not improve idle power.

Another promising approach to increase DRAM refresh times is

the use of error-correcting codes (ECC) to dynamically identify

and repair bits that fail [8, 15]. This approach sets refresh time

irrespective of the weakest bits, using ECC to compensate for

failures. With this approach, a stronger error-correcting code, with

the ability to correct multiple bits, implies increased refresh time

and reduced power. However, strong ECC codes have a high

storage and complexity overhead which limit their applicability.

In the following two sections, we propose an architectural

mechanism that uses strong ECC codes with a low storage and

complexity overhead.

3. STRONG ECC ARCHITECTURE

When designing a large eDRAM cache, a designer strives to

minimize eDRAM power consumption in the low-power

operating modes without penalizing performance in the normal

operating mode. To achieve this objective, we propose Hi-ECC,

which implements a multi-bit error-correcting code with very

small area, latency, and power overheads.

We propose a system with a large (128MB) eDRAM last level

cache. In a baseline configuration with no error correction

capability, the time between refreshes for such a cache will be 30

microseconds, leading to a significant amount of power consumed

even when the processor is idle. Refresh power can be reduced by

flushing and power gating the cache during the low-power

operating modes. This, however, causes a significant performance

penalty when the processor wakes up from the idle mode since it

will need to reload the cache, thereby incurring a large number of

cold start misses. Alternatively, we can lower refresh power

consumption by decreasing the refresh frequency (i.e., increasing

time between refreshes). However, as we show in Figure 1,

decreasing refresh frequency implies the need to tolerate a higher

number of failures for each cache line. Implementing a strong

error-correcting code with the capability to correct multiple errors

is necessary to achieve this goal.

At the core of Hi-ECC is a strong 5EC6ED (five bit correction,

six bit detection) BCH code. We explain implementation details

for BCH codes in Section 4. Traditional implementations of a

5EC6ED BCH code would suffer from two key drawbacks: high

storage overhead for the code itself, and high decoder complexity

and latency. In this section, we describe how Hi-ECC addresses

both of these drawbacks. Since our implementation requires

architectural changes that would increase dynamic power, we also

propose an architectural optimization to lower the impact on the

cache dynamic power.

3.1 Reducing Storage Overhead

The storage required for a 5EC6ED code for a 64B cache line is

51 bits, a 10% overhead. Since the cache occupies a large portion

of the die area (50% or higher), augmenting the eDRAM cache

with 5EC6ED code will significantly increase the die area and

cost. In contrast, a single error correcting, double error detecting

(SECDED) code for a 64B line requires 11 bits, an overhead of

around 2%. Our goal is to implement the 5EC6ED code with the

same storage overhead as the SECDED code.

To achieve this goal, we leverage two important properties. First,

the size of an ECC code relative to that of the data word

diminishes as the size of the data word grows, as we show in

Section 4.3. While a SECDED code for a 64B line has an 11-bit

overhead (2%), a SECDED code for a 1KB line has a 15-bit

overhead (0.18%). Second, the efficacy of a code only diminishes

slightly as the size of the data word increases. In Figure 2, we

show the failure probability (i.e., the probability that the line will

have more failures than those correctible by ECC) for three

different codes: SECDED on a 64B line, SECDED on a 1KB line,

and double error correcting, triple error detecting code

(DECTED) on a 1KB line. At very low refresh times, failure rates

of the SECDED 64B line and the SECDED 1KB line are very

close. DECTED-1KB (with only a 29-bit overhead, 0.35%) has a

lower failure probability than both SECDED codes, except at high

refresh times, such as 500us, where it is very close to the

SECDED-64B code. By choosing a stronger code and amortizing

the cost of the code over larger cache lines, we can improve our

ability to tolerate failures with a very small storage overhead.

Our Hi-ECC design implements a 5EC6ED code on each 1KB

line (5EC6ED-1KB), requiring an additional 71 bits (0.87%

overhead) for each line to store the code. In Figure 3, we compare

the refresh time of a 128MB cache augmented with Hi-ECC to the

baseline configuration with no error correction capability as well

as a configuration using a SECDED code for each 64B sub-block

(SECDED-64B). Like previous work that focused on SRAM [33],

we assume that the refresh time will be chosen such that no more

than 1E-03 (i.e., 1/1000) of the caches will fail. The baseline

configuration with no failure mitigation must operate at the

baseline refresh time of 30us. Adding a SECDED code allows a

5X increase in refresh time to 150us. Hi-ECC allows us to

increase the refresh time to 440us (almost a 15X reduction in

refresh frequency compared to the baseline).

1.E-21

1.E-18

1.E-15

1.E-12

1.E-09

1.E-06

1.E-03

1.E+00

0 100 200 300 400 500

Refresh Time

Probability of Failure

SECDED 1KB

SECDED 64B

DECTED 1KB

Figure 2. Comparing bit failure probabilities for three

code/line size combinations. DECTED on 1KB lines achieves

higher refresh time than SECDED on 64B lines

1.E-21

1.E-18

1.E-15

1.E-12

1.E-09

1.E-06

1.E-03

1.E+00

0 100 200 300 400 500

Refresh Time

Probability of Failure

Base

SECDED-64B

Hi-ECC: 5EC6ED-1KB

30us 150us

440us

Figure 3. Hi-ECC achieves a higher refresh time than

SECDED at the same failure probability

3.2 Reducing Latency

A hardware implementation of a 5EC6ED code is very complex

and imposes a long decoding latency penalty, proportional to both

the number of error bits corrected and the number of data bits

(Section 4.1). If the full strength encoding/decoding was required

for every cache access, this could significantly increase cache

access latency. However, our proposal leverages the fact that

error-prone portions of the cache can be disabled, avoiding the

high latency of decode during typical operation.

The Hi-ECC technique relies on a simple, one cycle ECC block to

correct a single bit error, and an un-pipelined, high-latency ECC

processing block to correct multiple bit failures in a cache line [7,

20]. When a line is read from the cache, a simple decoder

generates the syndrome for the line, which includes information

on whether it has zero, one, or a higher number of errors (Section

4.2). If the line has zero or one bit failures, the simple ECC

decoder can perform the correction in a single cycle. Figure 4

shows a high-level block diagram for Hi-ECC. The block referred

to as Quick-ECC contains the syndrome generation logic and the

error correction logic for lines with zero or one failures. The

Quick-ECC block also classifies lines into two groups based on

the syndrome: those that require complex multi-bit error

correction and those that have zero or one errors. Lines that

require multi-bit error correction are forwarded to a high latency

(potentially hundreds of cycles) ECC processing unit that

performs error correction using either software or a simple state

machine. This allows us to simplify the design at the expense of

increased latency for lines with two or more failures. Lines that

require one or less error corrections can be immediately corrected

and forwarded to the unit requesting the line.

Figure 4: Block diagram for Hi-ECC

The high latency of handling multi-bit failures could significantly

reduce performance. To avoid incurring this latency, problematic

lines could be completely disabled or a mechanism such as bit-fix

[33] could be integrated as shown by the dotted box labeled

optional repair in Figure 4. This guarantees that the performance

penalty of multi-bit decoding is incurred only once, the first time

a failure is identified. The frequency of failures plays a role in

the disable strategy that we choose. Low multi-bit failure rates

motivate a simple approach such as disabling cache lines

containing multi-bit failures. On the other hand, cache line

disable will result in unacceptable cache capacity loss if multi-bit

failure rates are high. In this case, a more complex mechanism

such as bit-fix might be used to minimize the capacity lost to

disabling.

Figure 5 shows the probability that N (X-axis) or more lines have

multi-bit failures for a 128MB eDRAM cache at the refresh time

we propose (440us). On average, a 128MB eDRAM will have 750

1KB lines with multi-bit failures that need to be disabled,

(0.573% of all lines). As highlighted in the figure, the probability

that 900 or more lines (0.7% of all cache lines) will exhibit multi-

bit errors is 6.77x10

-8

. For comparison, Hi-ECC augmented with

a simplified version of bit-fix [33] that repairs a single additional

bit per cache line requires an additional 13 bits per line (0.13%

overhead). However, this 13-bit overhead enables efficient

correction of lines with 2-bit errors and reduces the number of

lines that need to be disabled. Disabling lines with only 3 or more

errors reduces the average number of disabled lines from 750 to

28 (0.02% of all lines), with a probability of 5.95x10

-8

that 60 or

more lines contain three or more errors. Although adding bit-fix

reduces wasted cache capacity, the improvement over simple line

disable is marginal for the failure rates in our model and doesn’t

justify the additional latency and complexity. As a result, the rest

TAG/ECC

ARRAY

eDRAM

Address

optional

repair

Quick

ECC

> 1

fail?

High latency

ECC

processing

of this paper will focus on the Hi-ECC approach that relies solely

on line disable for lines with multi-bit (two or more) errors.

Due to the implementation of our eDRAM cache, there are

restrictions to how many and which lines can be disabled. Since

our cache is a 16-way set-associative, and since disabling all lines

in a particular cache set could be catastrophic for some

workloads, we limit the maximum number of lines that can be

disabled in a particular set to 14 of the 16 ways. We also limit the

maximum number of failing lines to 900 to quantify overhead.

With a refresh time of 440us, the probability of at least one of the

lines containing more than 5 failing bits is 6.21x10

-4

(Figure 3),

the probability of more than 900 multi-bit failures (disabled lines)

is 6.77x10

-8

(Figure 5), and the probability of more than 14 multi-

bit failures in a single set is 1.12x10

-61

. This indicates that

disabling cache lines with multi-bit failures will have little effect

on the overall probability that our cache meets the quality

requirements at 440us. The total storage overhead for our

approach is 1.58% including a 0.88% overhead for the code and a

single per-line disable bit, and a 0.7% overhead for the 900

disabled lines.

1.E-12

1.E-09

1.E-06

1.E-03

1.E+00

0 200 400 600 800

# lines (N)

Probability # lines will fail

mean (750 lines)

pfail 900 lines

6.77E-8

mean (28 lines)

w/ bit-fix

pfail 60 lines

5.95E-8

w/o bit-fix

Figure 5. The distribution of failing lines for a 128MB Cache

with 1KB lines with (w/) and without (w/o) bit-fix.

3.3 Reducing Dynamic Power

Our Hi-ECC proposal uses larger cache line sizes to reduce the

area cost of strong ECC codes. However, larger line sizes

introduce some additional challenges. Although 1KB is a

reasonable line size for a large embedded memory, our baseline

configuration has a much smaller L2 cache with a 64B cache line

(referred to as sub-block). Some implementation issues arise

when we read from or write to our large L3 eDRAM cache due to

the mismatch between its 1KB line size and the 64B sub-blocks

used by other caches.

Most writes to the large L3 eDRAM cache will be in the form of

smaller 64B sub-blocks generated at lower-level caches or fetched

from memory. To modify a 64B sub-block in a 1KB line, we need

to perform a read-modify-write operation since we need to

compute the ECC code. First, the old 64B block that is being

overwritten must be read, along with the ECC code for the entire

line. We then use the old data, old ECC, and new data to compute

the new ECC for the whole 1KB line. We then write the new 64B

sub-block and the new ECC. However, we do not need to read the

whole 1KB line to compute the new ECC, as explained later in

Section 4.3.

The purpose of most L3 reads will be to provide cache lines for

allocation in lower-level caches. Processing any sub-block

requires the ECC code to be processed with the entire data word

(1KB cache line) that it protects. Since each 64B sub-block must

be checked, each reference to a 64B sub-block must be

accompanied by a reference to the surrounding 64B sub-blocks.

This implies that any L3 read will access all 16 sub-blocks in the

1KB line, as well as the ECC code that they share. As an

example, if we need to read eight out of the 16 sub-blocks in one

1KB line, we must read all 16 sub-blocks eight times, for a total

of 128 sub-block reads. This large number of additional reads

causes a substantial increase in dynamic power consumption and

a drastic reduction in the useful bandwidth delivered by the

memory.

To address the extra power overhead for L3 reads, we consider

the fact that the vast majority of eDRAM failures are retention

failures. Since the retention time of our baseline eDRAM is 30us,

and each read automatically implies a refresh, we know that

retention failures will not occur for 30us after a line has been

read. Our proposal leverages this property and also the temporal

and spatial locality of the data to minimize the number the

superfluous reads. Using a structure we refer to as the Recently

Accessed Lines Table (RALT), we attempt to track lines that have

been referenced in the last 30us.

The first read to a line causes all sub-blocks in the line to be read

and checked for failures. The address of the line is then placed in

the RALT to indicate that it has recently been checked and will

remain free from retention failures for the next 30usec. As long as

the address of the line is held in the RALT, any sub-block reads

from the line can forgo ECC processing and thus avoid reading

the ECC code and other sub-blocks in the line. To operate

correctly, the RALT must ensure that none of its entries are more

than 30us old. To guarantee this, each 30us is divided into four

equal periods (P0, P1, P2, P3). Entries allocated in the RALT

during each period are marked with a 2-bit identifier to specify

the allocation period. Transitions between periods, P0 to P1 for

example, will cause all RALT entries previously allocated in P1

to be invalidated.

Each entry in the RALT consists of the following fields: a line

address to identify the line the entry is associated with; a valid bit,

a 2-bit period identifier field to indicate in which of the four

periods the line was allocated (P0, P1, P2, P3); and a 16-bit parity

consisting of one parity bit for each 64B sub-block in the line.

The RALT is direct mapped, but supports a CAM invalidate on

the 2-bit period field to allow bulk invalidates of RALT entries

during period transitions.

Figure 6 compares the implementation of the baseline L3

protection scheme (top) with that of Hi-ECC (bottom). The

baseline scheme uses a separate tag for each 1KB line and a

separate SECDED code for each 64B sub-block. To read a 64B

block, first the ECC and the block itself are read, then the ECC is

processed. In our Hi-ECC technique, the first time a sub-block is

read the entire ECC code is read along with each sub-block in the

1KB line to allow ECC processing for a single 64B block. We

update the RALT with the line address of the referenced line, a 2-

bit period ID, and a single parity bit for each sub-block. After the

Reducing cache power with low-cost, multi-bit error-correcting codes

Figures

Citations

RAIDR: Retention-Aware Intelligent DRAM Refresh

System, method, and computer program product for improving memory systems

Memory Errors in Modern Systems: The Good, The Bad, and The Ugly

Flash correct-and-refresh: Retention-aware error management for increased flash memory lifetime

Graphicionado: a high-performance and energy-efficient accelerator for graph analytics

References

Algebraic Coding Theory

Error Control Coding

A 45nm Logic Technology with High-k+Metal Gate Transistors, Strained Silicon, 9 Cu Interconnect Layers, 193nm Dry Patterning, and 100% Pb-free Packaging

Error Control Coding, Second Edition

Error-control coding for computer systems

Related Papers (5)

Multi-bit Error Tolerant Caches Using Two-Dimensional Error Coding

RAIDR: Retention-Aware Intelligent DRAM Refresh

Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs

Use ECP, not ECC, for hard failures in resistive memories

Flikker: saving DRAM refresh-power through critical data partitioning

Frequently Asked Questions (22)

Q1. What contributions have the authors mentioned in the paper "Reducing cache power with low-cost, multi-bit error-correcting codes" ?

Q2. How long does eDRAM take to be re-used?

Q3. How many microseconds will a eDRAM cache take to refresh?

Q4. What is the effect of increasing cache capacity?

Q5. What has led to the dramatic improvements in on-die cache capacity?

Q6. How long does it take to read a line to be re-read?

Q7. What is the way to reduce the idle power of eDRAM?

Q8. How many lines can be disabled in a particular cache set?

Q9. How many watts of power will an idling processor consume?

Q10. What is the common way to write to a large L3 eDRAM cache?

Q11. Why is it important to reduce power consumption during idle states?

Q12. What is the RALT's parity for each sub-block?

Q13. What is the way to increase DRAM refresh times?

Q14. What is the effect of device variations on the retention time of the cache?

Q15. What is the way to improve the speed of eDRAM refresh?

Q16. How does the SmartRefresh technique reduce refresh power?

Q17. What is the purpose of the bit-fix algorithm?

Q18. What is the effect of reading the entire 1KB line?

Q19. What is the probability of failure for the entire cache?

Q20. What is the simplest way to correct a line?

Q21. How do the authors track lines that have been referenced in the last 30us?

Q22. How many failures do the authors need to tolerate?