scispace - formally typeset
Open AccessProceedings ArticleDOI

Improving cache lifetime reliability at ultra-low voltages

Reads0
Chats0
TLDR
A novel adaptive technique to improve cache lifetime reliability and enable low voltage operation, multi-bit segmented ECC (MS-ECC), which addresses both persistent and non-persistent failures.
Abstract
Voltage scaling is one of the most effective mechanisms to reduce microprocessor power consumption However, the increased severity of manufacturing-induced parameter variations at lower voltages limits voltage scaling to a minimum voltage, Vccmin, below which a processor cannot operate reliably Memory cell failures in large memory structures (eg, caches) typically determine the Vccmin for the whole processor Memory failures can be persistent (ie, failures at time zero which cause yield loss) or non-persistent (eg, soft errors or erratic bit failures) Both types of failures increase as supply voltage decreases and both need to be addressed to achieve reliable operation at low voltages In this paper, we propose a novel adaptive technique to improve cache lifetime reliability and enable low voltage operation This technique, multi-bit segmented ECC (MS-ECC) addresses both persistent and non-persistent failures Like previous work on mitigating persistent failures, MS-ECC trades off cache capacity for lower voltages However, unlike previous schemes, MS-ECC does not rely on testing to identify and isolate defective bits, and therefore enables error tolerance for non-persistent failures like erratic bits and soft errors at low voltages Furthermore, MS-ECC's design can allow the operating system to adaptively change the cache size and ECC capability to adjust to system operating conditions Compared to current designs with single-bit correction, the most aggressive implementation for MS-ECC enables a 30% reduction in supply voltage, reducing power by 71% and energy per instruction by 42%

read more

Content maybe subject to copyright    Report

Improving Cache Lifetime Reliability at Ultra-low Voltages
Zeshan Chishti, Alaa R. Alameldeen, Chris Wilkerson, Wei Wu and Shih-Lien Lu
Oregon Microarchitecture Research, Intel Labs
ABSTRACT
Voltage scaling is one of the most effective mechanisms to
reduce microprocessor power consumption. However, the
increased severity of manufacturing-induced parameter variations
at lower voltages limits voltage scaling to a minimum voltage,
Vccmin, below which a processor cannot operate reliably.
Memory cell failures in large memory structures (e.g., caches)
typically determine the Vccmin for the whole processor. Memory
failures can be persistent (i.e., failures at time zero which cause
yield loss) or non-persistent (e.g., soft errors or erratic bit
failures). Both types of failures increase as supply voltage
decreases and both need to be addressed to achieve reliable
operation at low voltages.
In this paper, we propose a novel adaptive technique to
improve cache lifetime reliability and enable low voltage
operation. This technique, multi-bit segmented ECC (MS-ECC)
addresses both persistent and non-persistent failures. Like
previous work on mitigating persistent failures, MS-ECC trades
off cache capacity for lower voltages. However, unlike previous
schemes, MS-ECC does not rely on testing to identify and isolate
defective bits, and therefore enables error tolerance for non-
persistent failures like erratic bits and soft errors at low voltages.
Furthermore, MS-ECC’s design can allow the operating system to
adaptively change the cache size and ECC capability to adjust to
system operating conditions. Compared to current designs with
single-bit correction, the most aggressive implementation for MS-
ECC enables a 30% reduction in supply voltage, reducing power
by 71% and energy per instruction by 42%.
Categories and Subject Descriptors
B.3.4 [Memory Structures]: Reliability, Testing, and Fault-
Tolerance.
General Terms
Design, Reliability
1. INTRODUCTION
As semiconductor technology continues to scale, energy
efficiency is becoming the key design concern for computer
systems. Microprocessors often use multiple power modes to
exploit the power-performance tradeoff in order to improve
energy efficiency. Many processors (e.g., the Intel
®
Celeron
®
processor [11]) have high-performance and low-power modes of
operation. In the high-performance mode, the processor uses a
high voltage and runs at a high frequency to achieve the best
performance. In the low-power mode(s), the processor runs at a
lower frequency and uses a lower voltage to conserve energy.
Such power saving features are becoming prevalent in current
processor designs.
Reducing supply voltage is one of the most effective
methods to reduce power consumption. However, as supply
voltage decreases, manufacturing-induced parameter variations
increase in severity, causing many circuits to fail. These variations
restrict voltage scaling to a minimum value, often called Vccmin
(or Vmin), which is the minimum supply voltage for a die to
operate reliably. Failures in memory cells typically determine the
Vccmin for a processor as a whole [31]. Reducing Vccmin in the
context of memory failures is important for enabling ultra-low
power modes that are more energy-efficient.
Prior work [21, 31] has previously proposed techniques to
enable ultra-low voltage cache operation in the context of high
memory cell failure rates. The proposed techniques trade off
cache capacity for reliable low voltage operation. In the high-
voltage mode of operation, cell failure rate is low and the entire
cache is available for use. During the low-power, low-voltage
mode, cache size is sacrificed to increase reliability. These
techniques enable a significant reduction in supply voltage.
However, they require conducting thorough memory tests at low
voltages to isolate defective bits. Memory tests must be performed
whenever the processor boots, and the location of defective bits
needs to be stored in the memory hierarchy. While these tests can
detect voltage-dependent, persistent bit failures, other non-
persistent sources of bit failures are dynamic in nature and cannot
be detected by testing. Examples of such non-persistent bit
failures include erratic bit failures [1] and soft errors. Non-
persistent failures also increase when supply voltage decreases as
we show in Section 3. Techniques like those proposed in [21, 31]
cannot mitigate non-persistent errors and therefore must rely on a
voltage guardband to enable reliable cache operation.
Other prior work addresses non-persistent, transient bit
failures at normal operating voltages. Examples of such work
include error-correcting schemes such as the two-dimensional
ECC proposal by Kim et al. [14]. These techniques effectively
tolerate multiple bit errors due to non-persistent faults. However,
prior work focused only on failure rates at normal operating
voltages, and did not address persistent failures at low voltages
which result in yield loss.
To enable a cache design that tolerates both non-persistent
and persistent failures at low voltages, we need a unified solution
that does not rely on testing. We propose using redundancy to
enable ultra-low voltage cache operation. Our solution, multi-bit
segmented ECC (MS-ECC) employs error correction codes to
tolerate both persistent and non-persistent bit failures at low
voltages. During low-voltage operation, some ways in each cache
set are used to store ECC check bits for the remaining ways,
thereby increasing reliability against high failure rates. The
number of ways used for storing ECC can be adaptively chosen
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or
distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. To copy otherwise, or republish, to post on
servers or to redistribute to lists, requires prior specific permission and/or a fee.
Micro’09, December 12–16, 2009, New York, NY, USA.
Copyright © 2009 ACM 978-1-60558-798-1/09/12…$10.00.

by the operating system on the basis of the desired reliability
level, which in turn depends on the target Vccmin. Increasing the
number of ECC ways increases the redundancy and thus the
reliability but decreases the cache capacity available during low-
voltage operation.
To simplify ECC implementation, MS-ECC divides a cache
line into multiple small segments and corrects errors on a per-
segment basis. Performing error correction at finer granularities
enables more errors to be fixed with lower latency and
complexity. To further reduce the logic complexity of error
correction, we leverage the previously proposed Orthogonal Latin
Square Codes (OLSC) [9]. OLSC enables faster encoding and
decoding than traditional ECC, at the cost of more check bits.
Furthermore, OLSC uses modular error correction hardware
which can be used, adaptively, to correct a varying numbers of
errors. This adaptive design can be used to trade off reliability for
performance in the low-voltage operating mode. If performance is
insensitive to cache size in the low-voltage operating mode as
shown in [31], then the design should target maximum reliability.
If some application-specific performance is sensitive to cache
size, then the error correction capability can be sacrificed to
increase cache size.
This paper makes the following main contributions:
1. To our knowledge, our paper is the first to quantify the
impact of both persistent and non-persistent (erratic, soft
errors) failures on cache yield loss and lifetime reliability.
2. We propose a novel error tolerance mechanism, multi-bit
segmented ECC (MS-ECC), which uses Orthogonal Latin
Square Codes to reduce Vccmin by supporting multi-bit error
correction for small cache line segments and cache tags.
3. We propose an adaptive mechanism that enables a variable
part of the cache to be used for error correction. This
mechanism can correct 1-4 errors for each 64-bit segment,
where a higher correction capability increases reliability at
the expense of sacrificing a bigger percentage of the cache
size.
4. We show that the most aggressive implementation of MS-
ECC can achieve reliable cache operation at 520 mV,
incurring minimal additional latency, while sacrificing half
of the cache capacity at low voltages. Compared to previous
schemes, our proposal addresses both persistent and non-
persistent failures, and reduces the overhead of thorough
testing at boot time.
In the remainder of this paper, we discuss the impact of bit
failures on Vccmin in Section 2. We describe two types of non-
persistent failures and their impact on lifetime reliability in
Section 3. We discuss our proposed technique in detail in Section
4. We introduce the experimental methodology in Section 5 and
evaluate our technique in Section 6. We conclude in Section 7.
2. BACKGROUND
2.1 Bit Failures and Vccmin
Large SRAM caches make up a significant percentage of
transistors in a microprocessor die. Parameter variations, induced
by imperfections in the semiconductor process, limit the minimum
operational supply voltage to Vccmin, below which an SRAM cell
fails to operate reliably. For each of the SRAM caches, the bit
with the highest Vccmin determines the Vccmin of the cache as a
whole [31].
Bit failures can be classified into two broad categories:
persistent and non-persistent. The first category contains the
majority of bit failures, where bits exhibit persistent failing
behavior. Several papers [2, 15, 31] have analyzed persistent
failures in detail and have shown that intra-die random dopant
fluctuations (or RDF) play a primary role in these failures. Prior
work has also demonstrated that persistent failures exhibit a
strong dependence on supply voltage. For example, Kulkarni et al.
[15] show that the bit failure rate increases exponentially with a
decrease in voltage. Since these types of failures can be reliably
identified using standard memory testing methodologies, we refer
to them as testable failures. Memory tests are performed on each
die before it is shipped to a customer, and dies with irreparable
failures identified by the memory tests are disposed of. As a
result, persistent failures typically contribute to yield loss but do
not play a direct role in determining the lifetime reliability of a
microprocessor. The lifetime reliability of a processor is usually
represented by FIT (Failures In Time), the number of failures that
occur in a billion hours for a particular unit.
The second category of failures consists of non-persistent bit
failures, where bits exhibit sporadic failing behavior. Failures
resulting from particle strikes (soft errors) as well as erratic bit
failures, both discussed in greater detail in Section 3, are examples
of this category. Since these failures are non-persistent and occur
randomly, they can’t be identified with memory tests. As a result,
these failures don’t directly contribute to yield loss, instead
contributing directly to a unit’s FIT rate. We classify failures of
this type as non-testable failures.
2.2 Related Work
One solution to improve cache reliability is to implement
true column and/or row redundancy by adding multiple spare
rows and/or columns to the cache array [24]. This solution is able
to tolerate errors that cause a few rows or columns to become
defective. However, it cannot deal with thousands of randomly
distributed cache bits in large caches which would become
defective in the low voltage mode due to high cell failure rates.
Kim et al. [14] propose a scheme to use two-dimensional
ECC to correct multi-bit errors. This scheme is tailored to deal
with clustered multi-bit failures that cause contiguous bits across
multiple rows and columns to fail concurrently. The ability of this
scheme to correct errors is strongly dependent upon the location
of defective bits. While this scheme is able to tolerate correlated
failures that affect several contiguous bits, it cannot tolerate
failures in multiple randomly distributed bits in each cache line
that become defective at low voltages due to random dopant
fluctuations.
Another solution to improve cache reliability at low voltages
is to change the SRAM cell design. The designer can upsize the
transistors or use cell design variants such as the 8T, 10T, and ST
SRAM cells [15]. However, the resulting Vccmin reduction
comes at the cost of significant increases in area (e.g., 100% area
increase for the ST cell). Furthermore, larger SRAM cells result in
increased leakage power in both high-performance and low-power
modes.
A recent paper [31] proposed two architectural schemes to
enable cache operation at ultra-low voltages. The first scheme,
word-disable, disables 32-bit words that contain one or more
defective bits. Physical lines in two consecutive ways combine to
form one logical line, where only non-failing words are used. A
similar scheme was proposed by [21]. The second scheme, bit-fix,
uses a quarter of the cache ways to store the location of defective
bits in other cache ways, as well as the correct value of these bits.
That work focused on testable, voltage dependent, persistent
failures, and evaluated Vccmin in the context of yield loss.

Vccmin was defined as the voltage at which the cache is
functional in 999 of every 1000 dies. However, the paper did not
evaluate the impact of FIT rate on Vccmin, although the authors
acknowledged that additional ECC and supply voltage guardbands
might be required. In comparison, our work addresses both
persistent failures and non-persistent failures, and extends the
previously proposed failure probability model to account for the
impact of FIT rates on Vccmin.
Because several previous papers [2, 15, 31] have discussed
the causes of persistent cell failures in detail and attributed them
to random intra-die dopant fluctuations (RDF), we omit the
discussion of persistent failures and instead focus on non-
persistent bit failures in the next section.
3. NON-PERSISTENT BIT FAILURES
3.1 Soft Errors
Soft errors have increased in significance in recent years due
to the increasing number of devices per die while the soft error
rate per device remained constant or slightly decreased across
technology generations [8, 16, 25]. This increase, as well as future
expected increases in soft error rates (SER), triggered research to
mitigate the impact of soft errors on the correctness of a program
execution. Computer architecture research has recently focused on
providing architecture-level solutions to mitigate soft errors [2,
19, 26, 30, 32].
Our target of running a processor in a low-power mode at
low voltages further exacerbates the soft error problem. Previous
measurement studies have shown that reducing supply voltage
increases the soft error rate exponentially [12, 13, 22, 29]. These
measurements for SRAM cells and flip-flop designs from multiple
vendors all confirm that soft error rates will increase at lower
voltages. This increase is caused primarily by the exponential
relationship between the soft error rate and the charge stored at a
particular node, which in turn changes linearly with supply
voltage [29].
In our studies, we used the data measured by Ünlü, et al. [29]
to estimate the soft error rate per SRAM bit. We extrapolated this
data to lower voltages. However, since this data was measured by
inducing neutrons from a nuclear reactor, we scaled the soft error
rates by a factor of one billion to estimate the soft error rate under
normal conditions at sea-level [33].
While previous measurements show that soft error rates
increase exponentially with reduction in supply voltage, the rate
of increase is limited to 2.5x-3x for every 500mV decrease in
supply voltage. This soft error rate increase is much lower than
the increase in persistent failures, where a similar decrease in
supply voltage leads to an increase in failure probability of more
than a billion times [15, 31]. Figure 1 compares the probability of
different types of cell failures as a function of supply voltage. The
persistent failure probabilities in Figure 1 are based on results
reported in [15] which were obtained with circuit simulations
validated against measured data. Compared to voltage-dependent,
persistent failures, Figure 1 shows that soft errors are more
significant at higher voltages. At low voltages, however,
persistent and erratic failures significantly overshadow soft errors.
3.2 Erratic Bit Failures
Erratic bit failures have played a key role in setting Vccmin
in the past, and they are likely to re-emerge as a reliability
concern in the future [1, 6]. Erratic behavior in the Vccmin of an
SRAM cell can occur when an NMOS pull down device in an
SRAM cell experiences soft breakdown. In soft breakdown,
random telegraph signal noise in the gate oxide leakage of the
NMOS device can cause the SRAM cell Vccmin to erratically
fluctuate by as much as 200mV [1]. Due to their random nature,
erratic cells may escape standard testing and appear as normal
cells, but may cause bit failures later. Erratic behavior in SRAM
cells depends strongly on process parameters. Agostinelli et al.
[1] report discovering erraticism on the 90nm process technology
node, but were able to mitigate it successfully. However, the
authors point out that continued device scaling is likely to re-
introduce erraticism in future process technologies.
Detailed information on erratic bit failures is scarce and good
physical models are non-existent. Despite the lack of good
physical models, it is important to consider erratic bit failures
especially when considering operation at low voltages. In our
studies, we use a very simple model for erratic bit failures and
address the sensitivity of erratic bit failures to process parameters
by modeling two hypothetical processes with both high and low
rates of erraticism. Since cell stability and oxide strength play a
role in both erratic bit failures and persistent failures, we expect
that the probability of an erratic bit failure will be proportional to
the probability of a voltage–dependent, persistent failure. Our
studies reflect this by setting the probability of an erratic bit
failure as a fixed percentage of persistent failures. Furthermore,
Figure 1. Probability of Persistent Failures, Soft Errors and Erratic Failures per Hour vs. V
cc.
1.00E-15
1.00E-12
1.00E-09
1.00E-06
1.00E-03
1.00E+00
0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85
Vcc (V)
Pfail (probability of failure)
persistent
SER
low-erratic
hi-erratic
low-erratic process
high-erratic proce ss
SER
persistent failures

we expect that the frequency and severity of erratic bit failures
will be highly process-dependent. To reflect process sensitivity,
we model both a high-erratic process with a high rate of erratic bit
failures, and a low-erratic process with a low rate of erratic bit
failures. Our high-erratic process produces one erratic bit failure
for every ten persistent failures. The probability of an erratic
failure on our low-erratic process is 1000 times lower, or one
erratic failure for every 10,000 persistent failures. We model the
frequency and duration of erratic bit failures in the same way that
we model soft error failures; the probability of an erratic bit
failure reflects the probability of an erratic failure in an hour; and
we assume that the erratic failure (or soft error) lasts an hour.
Figure 1 shows the probability of different types of failures
as a function of voltage. We note that since we model erratic
failures as a fixed proportion of persistent failures, the probability
of erratic failures is sensitive to supply voltage. It is also worth
noting that since Figure 1 shows different failure rates on the
same Y-axis, the increase in SER (2.5x – 3x for 500 mV decrease
in supply voltage) appears flat relative to the much larger increase
in persistent errors (higher than one billion times for 500 mV
decrease in supply voltage).
3.3 Comprehensive Yield Loss/FIT Model for
Cache Failures
To account for the impact of FIT rate on Vccmin, we use a
comprehensive model for cache failures that includes the impact
of voltage–dependent, persistent failures on yield loss as well as
the impact of soft error rates (SER) and erratic bits on FIT rate.
Our model for yield loss is similar to the model evaluated in [31],
but we extend the model with a time component to model failures
in time. To calculate the FIT rate of our cache, we divide the
cache lifetime into discrete 1-hour periods and compute the
probability of a bit failing during each 1-hour period.
We conservatively assume that SER failures in the cache will
last an hour before the bit is either rewritten or scrubbed. We
assume the same duration for erratic bit failures. Figure 1 shows
the probability of SER and erratic failures per hour, and the
probability that a bit suffers from a voltage dependent persistent
failure. Using these three probabilities, we can determine the
overall probability that a bit will fail for one of these three reasons
in a 1-hour period. For longer periods (e.g., 250 hours), we
combine the probabilities of failure for each hour to get the
overall probability of failure. The probability of failure for the
entire cache is (1 probability of success), where the probability
of success is the probability that each bit in the cache stays
failure-free for every hour in the period. Figure 2 shows the pfail
of a base 2MB cache as derived from our model. There are two
pfail curves for each cache type: without ECC (BASE-YL, BASE-
FIT) and with ECC (1-ECC_YL, 1-ECC-FIT). The line marked
BASE-YL refers to pfail of the cache due solely to persistent
failures. A pfail of 10
-3
corresponds to a yield loss of 0.1%. A
separate line, marked BASE-FIT, indicates the probability that the
cache will fail during a 250 hour period. A pfail of 10
-3
on that
line corresponds to one failure every 250,000 hours, or a FIT rate
of 4000 failures in a 10
9
hour period. In the remainder of our
paper, we target a yield loss of 0.1% as suggested in [31] and a
FIT rate of 4000 as suggested in [26]. To meet these requirements
both FIT and YL pfails must be less than or equal to 10
-3
.
Data in Figure 2 shows that although the simple 2MB cache
(BASE) meets yield loss requirements at 830mV, FIT
requirements are not met. Adding 1-bit ECC to the cache enables
it to meet yield loss requirements at 670mV and FIT requirements
at 725mV. Since both FIT requirements and yield loss
requirements must be met, Vccmin for this cache would be set at
725mV. In the next section, we propose a technique to reduce
Vccmin further.
4. MULTI-BIT ERROR CORRECTION FOR
VCCMIN REDUCTION
Redundancy is the most widely used approach to increase
reliability. There are several options to implement redundancy.
Employing error detection and correction codes is one of the
methods that improve the reliability of a system through
information redundancy. Single Error Correction, Double Error
Detection (SECDED) codes such as the Hamming code [17] are
well known and have been used in memory chips and on-chip
caches due to their simplicity. With continued transistor scaling,
multi-bit error correcting codes are becoming more important for
large on-chip memory arrays. For example, a Double Error
Correction Triple Error Detection (DECTED) code can be
designed based on a binary BCH code [4]. While SECDED and
DECTED codes provide sufficient redundancy to deal with bit
failures at normal operating voltages, neither technique provides
enough redundancy to allow cache operation at an ultra-low
1.00E-15
1.00E-12
1.00E-09
1.00E-06
1.00E-03
1.00E+00
0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9
Vcc (V)
Pfail (probability of failure)
BASE-YL BASE-FIT
1-ECC-YL 1-ECC-FIT
Arrow indicates guardband sh ift du e to low
e rratic and SER.
BASE Cache me e ts YL re quire me nts but not FIT
require me nts.
Figure 2. FIT Rate and Yield Loss (YL) for a Baseline 2 MB Cache, and a Cache with SECDED ECC.

voltage level (e.g., 500 mV). Wilkerson et al. [31] showed that
enabling 500mV operation in a 2MB cache requires 10-bit error
correction for each 512-bit cache line. Extending conventional
ECC to correct 10-bit errors over an entire 512-bit cache line
requires significant area, latency, and power overheads [5, 14],
which we briefly describe later in this section. We next propose
our multi-bit segmented ECC that achieves the same purpose with
lower overhead.
4.1 Multi-bit Segmented ECC
To enable correcting a large number of bits with manageable
complexity and latency overhead, we propose multi-bit segmented
ECC (MS-ECC). In this technique, we trade off cache capacity for
reliability at low voltage. The main idea behind MS-ECC is to
correct multi-bit errors by implementing error correction at finer
granularity segments within a cache line. In the high-voltage,
high- performance operating mode, the entire cache capacity is
available for use by the processor, and conventional ECC is used.
In the low voltage, low power operating mode, a portion of the
cache is used to store additional ECC information on granularities
finer than a cache line, thereby enabling more errors to be fixed
on a per-line basis. Because the size of each segment is smaller
than the entire cache line, the latency and complexity for MS-
ECC is significantly less than conventional (un-segmented) ECC.
Example: Consider a 2MB 8-way L2 cache with 64-byte
lines. In the low voltage mode, we divide the eight physical ways
in each set amongst (i) data ways and (ii) ECC ways. The ratio of
data ways to ECC ways depends on the desired reliability level,
which in turn depends on the target Vccmin. If the operating
system decides to operate at a higher reliability level, this ratio
would be adaptively increased to increase the redundancy and
decrease the cache capacity available during low voltage
operation. We analyze the impact of this ratio on Vccmin and
performance in Section 6.2. For this example, assume that there is
one ECC way for every data way (50% cache capacity available
in low voltage mode). We use a fixed mapping to associate data
ways with their corresponding ECC ways (Figure 3(a)): physical
way 1 stores the ECC for physical way 0, physical way 3 stores
the ECC for physical way 2, and so on. Thus, in the low-voltage
mode, the cache effectively becomes a 1MB 4-way set-associative
cache.
We divide each data way into multiple segments and store
the ECC for each segment in the corresponding ECC way. Figure
3 shows an example of multi-bit segmented ECC with eight 64-bit
segments in each 512-bit line. On a read hit (Figure 3(b)), we
fetch both the data line and the corresponding ECC line. There are
separate ECC decoders for each of the eight segments that decode
segments in parallel by using information from both the data and
ECC ways. The decoded segments are then concatenated to obtain
the entire 512-bit line. On a write hit to the L2 cache (Figure
3(c)), we first use the ECC encoders to obtain the ECC for the
data line. Like the ECC decoders, there are separate encoders for
each segment that perform ECC encoding in parallel. We then
write the new data to the data line and the new ECC to the
corresponding ECC line. A similar encoding is performed when a
new line is brought into the L2 cache upon a cache miss. We note
that each cache access in the low-voltage mode requires access to
both the data way and the ECC way. To avoid increasing cache
latency, we assume double-width buses for concurrent transfer of
the data and ECC ways to the ECC encoder/decoder.
Alternatively, one could also read the two cache lines
sequentially, thereby incurring additional latency overhead in the
low-voltage mode.
4.2 Orthogonal Latin Square Codes (OLSC)
Performing error correction at finer granularities helps in
decreasing the complexity of multi-bit error correction. However,
our evaluation shows that enabling ultra-low voltage operation
requires three or more errors to be fixed per segment, even for
small segment sizes. Conventional error correction codes based on
BCH codes usually fix one (SECDED) or two errors (DECTED).
These codes have been optimized for storage overhead (i.e.,
Figure 3. Example of Multi-Bit Segmented ECC with Eight 64-bit Segments.

Citations
More filters
Proceedings ArticleDOI

Understanding Reduced-Voltage Operation in Modern DRAM Devices: Experimental Characterization, Analysis, and Mechanisms

TL;DR: This paper takes a comprehensive approach to understanding and exploiting the latency and reliability characteristics of modern DRAM when the supply voltage is lowered below the nominal voltage level specified by manufacturers.
Proceedings ArticleDOI

Energy-efficient cache design using variable-strength error-correcting codes

TL;DR: This paper proposes a novel cache architecture that uses variable-strength error-correcting codes (VS-ECC), which significantly reduces power and energy, avoids significant reductions in cache capacity, incurs little area overhead, and avoids large increases in latency and bandwidth.

Fault-tolerant iterative methods via selective reliability.

TL;DR: This work shows that if the system lets applications apply reliability selectively, they can develop iterations that compute the right answer despite faults, and illustrates convergence for a sample algorithm, Fault-Tolerant GMRES, for representative test problems and fault rates.
Proceedings ArticleDOI

Dynamic reduction of voltage margins by leveraging on-chip ECC in Itanium II processors

TL;DR: This paper presents a new mechanism for dynamically reducing voltage margins while maintaining the chip operating frequency constant, and uses correctable error reports raised by the hardware to identify the lowest, safe operating voltage.
Proceedings ArticleDOI

Archipelago: A polymorphic cache design for enabling robust near-threshold operation

TL;DR: This work proposes a highly flexible fault-tolerant cache design, Archipelago, that by reconfiguring its internal organization can efficiently tolerate the large number of SRAM failures that arise when operating in the near-threshold region.
References
More filters
Book

Fundamentals of Modern VLSI Devices

Yuan Taur, +1 more
TL;DR: In this article, the authors highlight the intricate interdependencies and subtle tradeoffs between various practically important device parameters, and also provide an in-depth discussion of device scaling and scaling limits of CMOS and bipolar devices.
Book

Error Control Coding

Daniel Costello, +1 more
TL;DR: Error Control Coding (2nd Edition) by Shu Lin, Shu, Costello, Daniel J. Costello Jr. and a great selection of similar New, Used and Collectible.
Proceedings ArticleDOI

Modeling the effect of technology trends on the soft error rate of combinational logic

TL;DR: An end-to-end model is described and validated that enables us to compute the soft error rates (SER) for existing and future microprocessor-style designs and predicts that the SER per chip of logic circuits will increase nine orders of magnitude from 1992 to 2011 and at that point will be comparable to the SERper chip of unprotected memory elements.
Journal ArticleDOI

On a class of error correcting binary group codes

TL;DR: A general method of constructing error correcting binary group codes is obtained and an example is worked out to illustrate the method of construction.
Journal ArticleDOI

The impact of intrinsic device fluctuations on CMOS SRAM cell stability

TL;DR: In this paper, the reduction in CMOS SRAM cell static noise margin due to intrinsic threshold voltage fluctuations in uniformly doped minimum-geometry cell MOSFETs is investigated using compact physical and stochastic models.
Related Papers (5)
Frequently Asked Questions (20)
Q1. Why do the authors perform circuit simulations to predict the frequencies at different voltages?

Because transistor switching delay decreases with an increase in supplyvoltage, the authors perform circuit simulations to predict the frequencies at different voltages. 

In this paper, the authors propose a novel adaptive technique to improve cache lifetime reliability and enable low voltage operation. Furthermore, MS-ECC ’ s design can allow the operating system to adaptively change the cache size and ECC capability to adjust to system operating conditions. 

Reducing Vccmin in the context of memory failures is important for enabling ultra-low power modes that are more energy-efficient. 

One solution to improve cache reliability is to implement true column and/or row redundancy by adding multiple spare rows and/or columns to the cache array [24]. 

as the low-voltage mode is normally used when the processor load is low, energy efficiency, rather than performance is the primary concern. 

their evaluation shows that enabling ultra-low voltage operation requires three or more errors to be fixed per segment, even for small segment sizes. 

Since cell stability and oxide strength play a role in both erratic bit failures and persistent failures, the authors expect that the probability of an erratic bit failure will be proportional to the probability of a voltage–dependent, persistent failure. 

Due to their random nature, erratic cells may escape standard testing and appear as normal cells, but may cause bit failures later. 

The critical path for the decoder is ceil(log2(m)) levels of 2-input XOR, one level of 2:1 MUX, plus (2t+1)-input majority function. 

The authors note that since the authors model erratic failures as a fixed proportion of persistent failures, the probability of erratic failures is sensitive to supply voltage. 

To enable ultralow voltage operation, MS-ECC needs to use an error correction code whose complexity scales well with the number of error corrections. 

Erratic bit failures have played a key role in setting Vccmin in the past, and they are likely to re-emerge as a reliability concern in the future [1, 6]. 

Because the size of each segment is smaller than the entire cache line, the latency and complexity for MSECC is significantly less than conventional (un-segmented) ECC. 

These variations restrict voltage scaling to a minimum value, often called Vccmin (or Vmin), which is the minimum supply voltage for a die to operate reliably. 

While previous measurements show that soft error rates increase exponentially with reduction in supply voltage, the rate of increase is limited to 2.5x-3x for every 500mV decrease in supply voltage. 

This mechanism can correct 1-4 errors for each 64-bit segment, where a higher correction capability increases reliability at the expense of sacrificing a bigger percentage of the cache size. 

it cannot deal with thousands of randomly distributed cache bits in large caches which would become defective in the low voltage mode due to high cell failure rates. 

This is intuitive since BFXECC can only correct one erratic bit or soft error in every 512 bits, while MS-ECC can correct up to four such errors in each 64-bit segment if that segment contains no persistent failures. 

While the authors can decrease the number of written-back lines by either controlling the placement of dirty lines or choosing ECC ways dynamically, the authors leave such optimizations to future work. 

To account for the impact of FIT rate on Vccmin, the authors use a comprehensive model for cache failures that includes the impact of voltage–dependent, persistent failures on yield loss as well as the impact of soft error rates (SER) and erratic bits on FIT rate.