Why do the authors perform circuit simulations to predict the frequencies at different voltages?

Because transistor switching delay decreases with an increase in supplyvoltage, the authors perform circuit simulations to predict the frequencies at different voltages.

What contributions have the authors mentioned in the paper "Improving cache lifetime reliability at ultra-low voltages" ?

In this paper, the authors propose a novel adaptive technique to improve cache lifetime reliability and enable low voltage operation. Furthermore, MS-ECC ’ s design can allow the operating system to adaptively change the cache size and ECC capability to adjust to system operating conditions.

What is the primary concern of the low-voltage mode?

as the low-voltage mode is normally used when the processor load is low, energy efficiency, rather than performance is the primary concern.

How do the authors determine the probability of a erratic bit failure?

Since cell stability and oxide strength play a role in both erratic bit failures and persistent failures, the authors expect that the probability of an erratic bit failure will be proportional to the probability of a voltage–dependent, persistent failure.

What is the critical path for the decoder?

The critical path for the decoder is ceil(log2(m)) levels of 2-input XOR, one level of 2:1 MUX, plus (2t+1)-input majority function.

How many errors can be corrected in low voltage mode?

To enable ultralow voltage operation, MS-ECC needs to use an error correction code whose complexity scales well with the number of error corrections.

How many bits can be corrected in a 64-bit segment?

This is intuitive since BFXECC can only correct one erratic bit or soft error in every 512 bits, while MS-ECC can correct up to four such errors in each 64-bit segment if that segment contains no persistent failures.

How can the authors reduce the number of written-back lines?

While the authors can decrease the number of written-back lines by either controlling the placement of dirty lines or choosing ECC ways dynamically, the authors leave such optimizations to future work.

(Open Access) Improving cache lifetime reliability at ultra-low voltages (2009) | Zeshan A. Chishti

Q: How many errors are required to be fixed per segment?

their evaluation shows that enabling ultra-low voltage operation requires three or more errors to be fixed per segment, even for small segment sizes.

Q: How is the probability of erratic failures sensitive to supply voltage?

The authors note that since the authors model erratic failures as a fixed proportion of persistent failures, the probability of erratic failures is sensitive to supply voltage.

Improving Cache Lifetime Reliability at Ultra-low Voltages

Zeshan Chishti, Alaa R. Alameldeen, Chris Wilkerson, Wei Wu and Shih-Lien Lu

Oregon Microarchitecture Research, Intel Labs

ABSTRACT

Voltage scaling is one of the most effective mechanisms to

reduce microprocessor power consumption. However, the

increased severity of manufacturing-induced parameter variations

at lower voltages limits voltage scaling to a minimum voltage,

Vccmin, below which a processor cannot operate reliably.

Memory cell failures in large memory structures (e.g., caches)

typically determine the Vccmin for the whole processor. Memory

failures can be persistent (i.e., failures at time zero which cause

yield loss) or non-persistent (e.g., soft errors or erratic bit

failures). Both types of failures increase as supply voltage

decreases and both need to be addressed to achieve reliable

operation at low voltages.

In this paper, we propose a novel adaptive technique to

improve cache lifetime reliability and enable low voltage

operation. This technique, multi-bit segmented ECC (MS-ECC)

addresses both persistent and non-persistent failures. Like

previous work on mitigating persistent failures, MS-ECC trades

off cache capacity for lower voltages. However, unlike previous

schemes, MS-ECC does not rely on testing to identify and isolate

defective bits, and therefore enables error tolerance for non-

persistent failures like erratic bits and soft errors at low voltages.

Furthermore, MS-ECC’s design can allow the operating system to

adaptively change the cache size and ECC capability to adjust to

system operating conditions. Compared to current designs with

single-bit correction, the most aggressive implementation for MS-

ECC enables a 30% reduction in supply voltage, reducing power

by 71% and energy per instruction by 42%.

Categories and Subject Descriptors

B.3.4 [Memory Structures]: Reliability, Testing, and Fault-

Tolerance.

General Terms

Design, Reliability

1. INTRODUCTION

As semiconductor technology continues to scale, energy

efficiency is becoming the key design concern for computer

systems. Microprocessors often use multiple power modes to

exploit the power-performance tradeoff in order to improve

energy efficiency. Many processors (e.g., the Intel

Celeron

processor [11]) have high-performance and low-power modes of

operation. In the high-performance mode, the processor uses a

high voltage and runs at a high frequency to achieve the best

performance. In the low-power mode(s), the processor runs at a

lower frequency and uses a lower voltage to conserve energy.

Such power saving features are becoming prevalent in current

processor designs.

Reducing supply voltage is one of the most effective

methods to reduce power consumption. However, as supply

voltage decreases, manufacturing-induced parameter variations

increase in severity, causing many circuits to fail. These variations

restrict voltage scaling to a minimum value, often called Vccmin

(or Vmin), which is the minimum supply voltage for a die to

operate reliably. Failures in memory cells typically determine the

Vccmin for a processor as a whole [31]. Reducing Vccmin in the

context of memory failures is important for enabling ultra-low

power modes that are more energy-efficient.

Prior work [21, 31] has previously proposed techniques to

enable ultra-low voltage cache operation in the context of high

memory cell failure rates. The proposed techniques trade off

cache capacity for reliable low voltage operation. In the high-

voltage mode of operation, cell failure rate is low and the entire

cache is available for use. During the low-power, low-voltage

mode, cache size is sacrificed to increase reliability. These

techniques enable a significant reduction in supply voltage.

However, they require conducting thorough memory tests at low

voltages to isolate defective bits. Memory tests must be performed

whenever the processor boots, and the location of defective bits

needs to be stored in the memory hierarchy. While these tests can

detect voltage-dependent, persistent bit failures, other non-

persistent sources of bit failures are dynamic in nature and cannot

be detected by testing. Examples of such non-persistent bit

failures include erratic bit failures [1] and soft errors. Non-

persistent failures also increase when supply voltage decreases as

we show in Section 3. Techniques like those proposed in [21, 31]

cannot mitigate non-persistent errors and therefore must rely on a

voltage guardband to enable reliable cache operation.

Other prior work addresses non-persistent, transient bit

failures at normal operating voltages. Examples of such work

include error-correcting schemes such as the two-dimensional

ECC proposal by Kim et al. [14]. These techniques effectively

tolerate multiple bit errors due to non-persistent faults. However,

prior work focused only on failure rates at normal operating

voltages, and did not address persistent failures at low voltages

which result in yield loss.

To enable a cache design that tolerates both non-persistent

and persistent failures at low voltages, we need a unified solution

that does not rely on testing. We propose using redundancy to

enable ultra-low voltage cache operation. Our solution, multi-bit

segmented ECC (MS-ECC) employs error correction codes to

tolerate both persistent and non-persistent bit failures at low

voltages. During low-voltage operation, some ways in each cache

set are used to store ECC check bits for the remaining ways,

thereby increasing reliability against high failure rates. The

number of ways used for storing ECC can be adaptively chosen

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or

distributed for profit or commercial advantage and that copies bear this notice and

the full citation on the first page. To copy otherwise, or republish, to post on

servers or to redistribute to lists, requires prior specific permission and/or a fee.

Micro’09, December 12–16, 2009, New York, NY, USA.

by the operating system on the basis of the desired reliability

level, which in turn depends on the target Vccmin. Increasing the

number of ECC ways increases the redundancy and thus the

reliability but decreases the cache capacity available during low-

voltage operation.

To simplify ECC implementation, MS-ECC divides a cache

line into multiple small segments and corrects errors on a per-

segment basis. Performing error correction at finer granularities

enables more errors to be fixed with lower latency and

complexity. To further reduce the logic complexity of error

correction, we leverage the previously proposed Orthogonal Latin

Square Codes (OLSC) [9]. OLSC enables faster encoding and

decoding than traditional ECC, at the cost of more check bits.

Furthermore, OLSC uses modular error correction hardware

which can be used, adaptively, to correct a varying numbers of

errors. This adaptive design can be used to trade off reliability for

performance in the low-voltage operating mode. If performance is

insensitive to cache size in the low-voltage operating mode as

shown in [31], then the design should target maximum reliability.

If some application-specific performance is sensitive to cache

size, then the error correction capability can be sacrificed to

increase cache size.

This paper makes the following main contributions:

1. To our knowledge, our paper is the first to quantify the

impact of both persistent and non-persistent (erratic, soft

errors) failures on cache yield loss and lifetime reliability.

2. We propose a novel error tolerance mechanism, multi-bit

segmented ECC (MS-ECC), which uses Orthogonal Latin

Square Codes to reduce Vccmin by supporting multi-bit error

correction for small cache line segments and cache tags.

3. We propose an adaptive mechanism that enables a variable

part of the cache to be used for error correction. This

mechanism can correct 1-4 errors for each 64-bit segment,

where a higher correction capability increases reliability at

the expense of sacrificing a bigger percentage of the cache

size.

4. We show that the most aggressive implementation of MS-

ECC can achieve reliable cache operation at 520 mV,

incurring minimal additional latency, while sacrificing half

of the cache capacity at low voltages. Compared to previous

schemes, our proposal addresses both persistent and non-

persistent failures, and reduces the overhead of thorough

testing at boot time.

In the remainder of this paper, we discuss the impact of bit

failures on Vccmin in Section 2. We describe two types of non-

persistent failures and their impact on lifetime reliability in

Section 3. We discuss our proposed technique in detail in Section

4. We introduce the experimental methodology in Section 5 and

evaluate our technique in Section 6. We conclude in Section 7.

2. BACKGROUND

2.1 Bit Failures and Vccmin

Large SRAM caches make up a significant percentage of

transistors in a microprocessor die. Parameter variations, induced

by imperfections in the semiconductor process, limit the minimum

operational supply voltage to Vccmin, below which an SRAM cell

fails to operate reliably. For each of the SRAM caches, the bit

with the highest Vccmin determines the Vccmin of the cache as a

whole [31].

Bit failures can be classified into two broad categories:

persistent and non-persistent. The first category contains the

majority of bit failures, where bits exhibit persistent failing

behavior. Several papers [2, 15, 31] have analyzed persistent

failures in detail and have shown that intra-die random dopant

fluctuations (or RDF) play a primary role in these failures. Prior

work has also demonstrated that persistent failures exhibit a

strong dependence on supply voltage. For example, Kulkarni et al.

[15] show that the bit failure rate increases exponentially with a

decrease in voltage. Since these types of failures can be reliably

identified using standard memory testing methodologies, we refer

to them as testable failures. Memory tests are performed on each

die before it is shipped to a customer, and dies with irreparable

failures identified by the memory tests are disposed of. As a

result, persistent failures typically contribute to yield loss but do

not play a direct role in determining the lifetime reliability of a

microprocessor. The lifetime reliability of a processor is usually

represented by FIT (Failures In Time), the number of failures that

occur in a billion hours for a particular unit.

The second category of failures consists of non-persistent bit

failures, where bits exhibit sporadic failing behavior. Failures

resulting from particle strikes (soft errors) as well as erratic bit

failures, both discussed in greater detail in Section 3, are examples

of this category. Since these failures are non-persistent and occur

randomly, they can’t be identified with memory tests. As a result,

these failures don’t directly contribute to yield loss, instead

contributing directly to a unit’s FIT rate. We classify failures of

this type as non-testable failures.

2.2 Related Work

One solution to improve cache reliability is to implement

true column and/or row redundancy by adding multiple spare

rows and/or columns to the cache array [24]. This solution is able

to tolerate errors that cause a few rows or columns to become

defective. However, it cannot deal with thousands of randomly

distributed cache bits in large caches which would become

defective in the low voltage mode due to high cell failure rates.

Kim et al. [14] propose a scheme to use two-dimensional

ECC to correct multi-bit errors. This scheme is tailored to deal

with clustered multi-bit failures that cause contiguous bits across

multiple rows and columns to fail concurrently. The ability of this

scheme to correct errors is strongly dependent upon the location

of defective bits. While this scheme is able to tolerate correlated

failures that affect several contiguous bits, it cannot tolerate

failures in multiple randomly distributed bits in each cache line

that become defective at low voltages due to random dopant

fluctuations.

Another solution to improve cache reliability at low voltages

is to change the SRAM cell design. The designer can upsize the

transistors or use cell design variants such as the 8T, 10T, and ST

SRAM cells [15]. However, the resulting Vccmin reduction

comes at the cost of significant increases in area (e.g., 100% area

increase for the ST cell). Furthermore, larger SRAM cells result in

increased leakage power in both high-performance and low-power

modes.

A recent paper [31] proposed two architectural schemes to

enable cache operation at ultra-low voltages. The first scheme,

word-disable, disables 32-bit words that contain one or more

defective bits. Physical lines in two consecutive ways combine to

form one logical line, where only non-failing words are used. A

similar scheme was proposed by [21]. The second scheme, bit-fix,

uses a quarter of the cache ways to store the location of defective

bits in other cache ways, as well as the correct value of these bits.

That work focused on testable, voltage dependent, persistent

failures, and evaluated Vccmin in the context of yield loss.

Vccmin was defined as the voltage at which the cache is

functional in 999 of every 1000 dies. However, the paper did not

evaluate the impact of FIT rate on Vccmin, although the authors

acknowledged that additional ECC and supply voltage guardbands

might be required. In comparison, our work addresses both

persistent failures and non-persistent failures, and extends the

previously proposed failure probability model to account for the

impact of FIT rates on Vccmin.

Because several previous papers [2, 15, 31] have discussed

the causes of persistent cell failures in detail and attributed them

to random intra-die dopant fluctuations (RDF), we omit the

discussion of persistent failures and instead focus on non-

persistent bit failures in the next section.

3. NON-PERSISTENT BIT FAILURES

3.1 Soft Errors

Soft errors have increased in significance in recent years due

to the increasing number of devices per die while the soft error

rate per device remained constant or slightly decreased across

technology generations [8, 16, 25]. This increase, as well as future

expected increases in soft error rates (SER), triggered research to

mitigate the impact of soft errors on the correctness of a program

execution. Computer architecture research has recently focused on

providing architecture-level solutions to mitigate soft errors [2,

19, 26, 30, 32].

Our target of running a processor in a low-power mode at

low voltages further exacerbates the soft error problem. Previous

measurement studies have shown that reducing supply voltage

increases the soft error rate exponentially [12, 13, 22, 29]. These

measurements for SRAM cells and flip-flop designs from multiple

vendors all confirm that soft error rates will increase at lower

voltages. This increase is caused primarily by the exponential

relationship between the soft error rate and the charge stored at a

particular node, which in turn changes linearly with supply

voltage [29].

In our studies, we used the data measured by Ünlü, et al. [29]

to estimate the soft error rate per SRAM bit. We extrapolated this

data to lower voltages. However, since this data was measured by

inducing neutrons from a nuclear reactor, we scaled the soft error

rates by a factor of one billion to estimate the soft error rate under

normal conditions at sea-level [33].

While previous measurements show that soft error rates

increase exponentially with reduction in supply voltage, the rate

of increase is limited to 2.5x-3x for every 500mV decrease in

supply voltage. This soft error rate increase is much lower than

the increase in persistent failures, where a similar decrease in

supply voltage leads to an increase in failure probability of more

than a billion times [15, 31]. Figure 1 compares the probability of

different types of cell failures as a function of supply voltage. The

persistent failure probabilities in Figure 1 are based on results

reported in [15] which were obtained with circuit simulations

validated against measured data. Compared to voltage-dependent,

persistent failures, Figure 1 shows that soft errors are more

significant at higher voltages. At low voltages, however,

persistent and erratic failures significantly overshadow soft errors.

3.2 Erratic Bit Failures

Erratic bit failures have played a key role in setting Vccmin

in the past, and they are likely to re-emerge as a reliability

concern in the future [1, 6]. Erratic behavior in the Vccmin of an

SRAM cell can occur when an NMOS pull down device in an

SRAM cell experiences soft breakdown. In soft breakdown,

random telegraph signal noise in the gate oxide leakage of the

NMOS device can cause the SRAM cell Vccmin to erratically

fluctuate by as much as 200mV [1]. Due to their random nature,

erratic cells may escape standard testing and appear as normal

cells, but may cause bit failures later. Erratic behavior in SRAM

cells depends strongly on process parameters. Agostinelli et al.

[1] report discovering erraticism on the 90nm process technology

node, but were able to mitigate it successfully. However, the

authors point out that continued device scaling is likely to re-

introduce erraticism in future process technologies.

Detailed information on erratic bit failures is scarce and good

physical models are non-existent. Despite the lack of good

physical models, it is important to consider erratic bit failures

especially when considering operation at low voltages. In our

studies, we use a very simple model for erratic bit failures and

address the sensitivity of erratic bit failures to process parameters

by modeling two hypothetical processes with both high and low

rates of erraticism. Since cell stability and oxide strength play a

role in both erratic bit failures and persistent failures, we expect

that the probability of an erratic bit failure will be proportional to

the probability of a voltage–dependent, persistent failure. Our

studies reflect this by setting the probability of an erratic bit

failure as a fixed percentage of persistent failures. Furthermore,

Figure 1. Probability of Persistent Failures, Soft Errors and Erratic Failures per Hour vs. V

cc.

1.00E-15

1.00E-12

1.00E-09

1.00E-06

1.00E-03

1.00E+00

0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85

Vcc (V)

Pfail (probability of failure)

persistent

SER

low-erratic

hi-erratic

low-erratic process

high-erratic proce ss

SER

persistent failures

we expect that the frequency and severity of erratic bit failures

will be highly process-dependent. To reflect process sensitivity,

we model both a high-erratic process with a high rate of erratic bit

failures, and a low-erratic process with a low rate of erratic bit

failures. Our high-erratic process produces one erratic bit failure

for every ten persistent failures. The probability of an erratic

failure on our low-erratic process is 1000 times lower, or one

erratic failure for every 10,000 persistent failures. We model the

frequency and duration of erratic bit failures in the same way that

we model soft error failures; the probability of an erratic bit

failure reflects the probability of an erratic failure in an hour; and

we assume that the erratic failure (or soft error) lasts an hour.

Figure 1 shows the probability of different types of failures

as a function of voltage. We note that since we model erratic

failures as a fixed proportion of persistent failures, the probability

of erratic failures is sensitive to supply voltage. It is also worth

noting that since Figure 1 shows different failure rates on the

same Y-axis, the increase in SER (2.5x – 3x for 500 mV decrease

in supply voltage) appears flat relative to the much larger increase

in persistent errors (higher than one billion times for 500 mV

decrease in supply voltage).

3.3 Comprehensive Yield Loss/FIT Model for

Cache Failures

To account for the impact of FIT rate on Vccmin, we use a

comprehensive model for cache failures that includes the impact

of voltage–dependent, persistent failures on yield loss as well as

the impact of soft error rates (SER) and erratic bits on FIT rate.

Our model for yield loss is similar to the model evaluated in [31],

but we extend the model with a time component to model failures

in time. To calculate the FIT rate of our cache, we divide the

cache lifetime into discrete 1-hour periods and compute the

probability of a bit failing during each 1-hour period.

We conservatively assume that SER failures in the cache will

last an hour before the bit is either rewritten or scrubbed. We

assume the same duration for erratic bit failures. Figure 1 shows

the probability of SER and erratic failures per hour, and the

probability that a bit suffers from a voltage dependent persistent

failure. Using these three probabilities, we can determine the

overall probability that a bit will fail for one of these three reasons

in a 1-hour period. For longer periods (e.g., 250 hours), we

combine the probabilities of failure for each hour to get the

overall probability of failure. The probability of failure for the

entire cache is (1 – probability of success), where the probability

of success is the probability that each bit in the cache stays

failure-free for every hour in the period. Figure 2 shows the pfail

of a base 2MB cache as derived from our model. There are two

pfail curves for each cache type: without ECC (BASE-YL, BASE-

FIT) and with ECC (1-ECC_YL, 1-ECC-FIT). The line marked

BASE-YL refers to pfail of the cache due solely to persistent

failures. A pfail of 10

-3

corresponds to a yield loss of 0.1%. A

separate line, marked BASE-FIT, indicates the probability that the

cache will fail during a 250 hour period. A pfail of 10

-3

on that

line corresponds to one failure every 250,000 hours, or a FIT rate

of 4000 failures in a 10

hour period. In the remainder of our

paper, we target a yield loss of 0.1% as suggested in [31] and a

FIT rate of 4000 as suggested in [26]. To meet these requirements

both FIT and YL pfails must be less than or equal to 10

-3

Data in Figure 2 shows that although the simple 2MB cache

(BASE) meets yield loss requirements at 830mV, FIT

requirements are not met. Adding 1-bit ECC to the cache enables

it to meet yield loss requirements at 670mV and FIT requirements

at 725mV. Since both FIT requirements and yield loss

requirements must be met, Vccmin for this cache would be set at

725mV. In the next section, we propose a technique to reduce

Vccmin further.

4. MULTI-BIT ERROR CORRECTION FOR

VCCMIN REDUCTION

Redundancy is the most widely used approach to increase

reliability. There are several options to implement redundancy.

Employing error detection and correction codes is one of the

methods that improve the reliability of a system through

information redundancy. Single Error Correction, Double Error

Detection (SECDED) codes such as the Hamming code [17] are

well known and have been used in memory chips and on-chip

caches due to their simplicity. With continued transistor scaling,

multi-bit error correcting codes are becoming more important for

large on-chip memory arrays. For example, a Double Error

Correction Triple Error Detection (DECTED) code can be

designed based on a binary BCH code [4]. While SECDED and

DECTED codes provide sufficient redundancy to deal with bit

failures at normal operating voltages, neither technique provides

enough redundancy to allow cache operation at an ultra-low

1.00E-15

1.00E-12

1.00E-09

1.00E-06

1.00E-03

1.00E+00

0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9

Vcc (V)

Pfail (probability of failure)

BASE-YL BASE-FIT

1-ECC-YL 1-ECC-FIT

Arrow indicates guardband sh ift du e to low

e rratic and SER.

BASE Cache me e ts YL re quire me nts but not FIT

require me nts.

Figure 2. FIT Rate and Yield Loss (YL) for a Baseline 2 MB Cache, and a Cache with SECDED ECC.

voltage level (e.g., 500 mV). Wilkerson et al. [31] showed that

enabling 500mV operation in a 2MB cache requires 10-bit error

correction for each 512-bit cache line. Extending conventional

ECC to correct 10-bit errors over an entire 512-bit cache line

requires significant area, latency, and power overheads [5, 14],

which we briefly describe later in this section. We next propose

our multi-bit segmented ECC that achieves the same purpose with

lower overhead.

4.1 Multi-bit Segmented ECC

To enable correcting a large number of bits with manageable

complexity and latency overhead, we propose multi-bit segmented

ECC (MS-ECC). In this technique, we trade off cache capacity for

reliability at low voltage. The main idea behind MS-ECC is to

correct multi-bit errors by implementing error correction at finer

granularity segments within a cache line. In the high-voltage,

high- performance operating mode, the entire cache capacity is

available for use by the processor, and conventional ECC is used.

In the low voltage, low power operating mode, a portion of the

cache is used to store additional ECC information on granularities

finer than a cache line, thereby enabling more errors to be fixed

on a per-line basis. Because the size of each segment is smaller

than the entire cache line, the latency and complexity for MS-

ECC is significantly less than conventional (un-segmented) ECC.

Example: Consider a 2MB 8-way L2 cache with 64-byte

lines. In the low voltage mode, we divide the eight physical ways

in each set amongst (i) data ways and (ii) ECC ways. The ratio of

data ways to ECC ways depends on the desired reliability level,

which in turn depends on the target Vccmin. If the operating

system decides to operate at a higher reliability level, this ratio

would be adaptively increased to increase the redundancy and

decrease the cache capacity available during low voltage

operation. We analyze the impact of this ratio on Vccmin and

performance in Section 6.2. For this example, assume that there is

one ECC way for every data way (50% cache capacity available

in low voltage mode). We use a fixed mapping to associate data

ways with their corresponding ECC ways (Figure 3(a)): physical

way 1 stores the ECC for physical way 0, physical way 3 stores

the ECC for physical way 2, and so on. Thus, in the low-voltage

mode, the cache effectively becomes a 1MB 4-way set-associative

cache.

We divide each data way into multiple segments and store

the ECC for each segment in the corresponding ECC way. Figure

3 shows an example of multi-bit segmented ECC with eight 64-bit

segments in each 512-bit line. On a read hit (Figure 3(b)), we

fetch both the data line and the corresponding ECC line. There are

separate ECC decoders for each of the eight segments that decode

segments in parallel by using information from both the data and

ECC ways. The decoded segments are then concatenated to obtain

the entire 512-bit line. On a write hit to the L2 cache (Figure

3(c)), we first use the ECC encoders to obtain the ECC for the

data line. Like the ECC decoders, there are separate encoders for

each segment that perform ECC encoding in parallel. We then

write the new data to the data line and the new ECC to the

corresponding ECC line. A similar encoding is performed when a

new line is brought into the L2 cache upon a cache miss. We note

that each cache access in the low-voltage mode requires access to

both the data way and the ECC way. To avoid increasing cache

latency, we assume double-width buses for concurrent transfer of

the data and ECC ways to the ECC encoder/decoder.

Alternatively, one could also read the two cache lines

sequentially, thereby incurring additional latency overhead in the

low-voltage mode.

4.2 Orthogonal Latin Square Codes (OLSC)

Performing error correction at finer granularities helps in

decreasing the complexity of multi-bit error correction. However,

our evaluation shows that enabling ultra-low voltage operation

requires three or more errors to be fixed per segment, even for

small segment sizes. Conventional error correction codes based on

BCH codes usually fix one (SECDED) or two errors (DECTED).

These codes have been optimized for storage overhead (i.e.,

Figure 3. Example of Multi-Bit Segmented ECC with Eight 64-bit Segments.

Improving cache lifetime reliability at ultra-low voltages

Figures

Citations

Understanding Reduced-Voltage Operation in Modern DRAM Devices: Experimental Characterization, Analysis, and Mechanisms

Energy-efficient cache design using variable-strength error-correcting codes

Fault-tolerant iterative methods via selective reliability.

Dynamic reduction of voltage margins by leveraging on-chip ECC in Itanium II processors

Archipelago: A polymorphic cache design for enabling robust near-threshold operation

References

Fundamentals of Modern VLSI Devices

Error Control Coding

Modeling the effect of technology trends on the soft error rate of combinational logic

On a class of error correcting binary group codes

The impact of intrinsic device fluctuations on CMOS SRAM cell stability

Related Papers (5)

Trading off Cache Capacity for Reliability to Enable Low Voltage Operation

Multi-bit Error Tolerant Caches Using Two-Dimensional Error Coding

Reducing cache power with low-cost, multi-bit error-correcting codes

A 160 mV Robust Schmitt Trigger Based Subthreshold SRAM

A process-tolerant cache architecture for improved yield in nanoscale technologies

Frequently Asked Questions (20)

Q1. Why do the authors perform circuit simulations to predict the frequencies at different voltages?

Q2. What contributions have the authors mentioned in the paper "Improving cache lifetime reliability at ultra-low voltages" ?

Q3. What is the purpose of reducing Vccmin in the context of memory failures?

Q4. What is the way to improve cache reliability?

Q5. What is the primary concern of the low-voltage mode?

Q6. How many errors are required to be fixed per segment?

Q7. How do the authors determine the probability of a erratic bit failure?

Q8. Why do erratic cells appear as normal cells?

Q9. What is the critical path for the decoder?

Q10. How is the probability of erratic failures sensitive to supply voltage?

Q11. How many errors can be corrected in low voltage mode?

Q12. What is the role of erratic bit failures in the future?

Q13. Why is the MSECC technique so important?

Q14. What is the minimum voltage for a die to operate reliably?

Q15. How much increase is the soft error rate for a given voltage?

Q16. How many errors can be corrected for each 64-bit segment?

Q17. How many random bits can a cache handle?

Q18. How many bits can be corrected in a 64-bit segment?

Q19. How can the authors reduce the number of written-back lines?

Q20. How does the model for erratic bit failures work?