scispace - formally typeset
Open AccessProceedings ArticleDOI

Reducing cache power with low-cost, multi-bit error-correcting codes

Reads0
Chats0
TLDR
The significant impact of variations on refresh time and cache power consumption for large eDRAM caches is shown and Hi-ECC, a technique that incorporates multi-bit error-correcting codes to significantly reduce refresh rate, is proposed.
Abstract
Technology advancements have enabled the integration of large on-die embedded DRAM (eDRAM) caches. eDRAM is significantly denser than traditional SRAMs, but must be periodically refreshed to retain data. Like SRAM, eDRAM is susceptible to device variations, which play a role in determining refresh time for eDRAM cells. Refresh power potentially represents a large fraction of overall system power, particularly during low-power states when the CPU is idle. Future designs need to reduce cache power without incurring the high cost of flushing cache data when entering low-power states. In this paper, we show the significant impact of variations on refresh time and cache power consumption for large eDRAM caches. We propose Hi-ECC, a technique that incorporates multi-bit error-correcting codes to significantly reduce refresh rate. Multi-bit error-correcting codes usually have a complex decoder design and high storage cost. Hi-ECC avoids the decoder complexity by using strong ECC codes to identify and disable sections of the cache with multi-bit failures, while providing efficient single-bit error correction for the common case. Hi-ECC includes additional optimizations that allow us to amortize the storage cost of the code over large data words, providing the benefit of multi-bit correction at same storage cost as a single-bit error-correcting (SECDED) code (2% overhead). Our proposal achieves a 93% reduction in refresh power vs. a baseline eDRAM cache without error correcting capability, and a 66% reduction in refresh power vs. a system using SECDED codes.

read more

Content maybe subject to copyright    Report

Reducing Cache Power with Low-Cost, Multi-bit
Error-Correcting Codes
Chris Wilkerson, Alaa R. Alameldeen, Zeshan Chishti,
Wei Wu, Dinesh Somasekhar, and Shih-Lien Lu
Intel Labs
Hillsboro, Oregon, USA
{chris.wilkerson, alaa.r.alameldeen, zeshan.a.chishti, wei.a.wu, dinesh.somasekhar, shih-lien.l.lu} @intel.com
ABSTRACT
Technology advancements have enabled the integration of large
on-die embedded DRAM (eDRAM) caches. eDRAM is
significantly denser than traditional SRAMs, but must be
periodically refreshed to retain data. Like SRAM, eDRAM is
susceptible to device variations, which play a role in determining
refresh time for eDRAM cells. Refresh power potentially
represents a large fraction of overall system power, particularly
during low-power states when the CPU is idle. Future designs
need to reduce cache power without incurring the high cost of
flushing cache data when entering low-power states.
In this paper, we show the significant impact of variations on
refresh time and cache power consumption for large eDRAM
caches. We propose Hi-ECC, a technique that incorporates multi-
bit error-correcting codes to significantly reduce refresh rate.
Multi-bit error-correcting codes usually have a complex decoder
design and high storage cost. Hi-ECC avoids the decoder
complexity by using strong ECC codes to identify and disable
sections of the cache with multi-bit failures, while providing
efficient single-bit error correction for the common case. Hi-ECC
includes additional optimizations that allow us to amortize the
storage cost of the code over large data words, providing the
benefit of multi-bit correction at same storage cost as a single-bit
error-correcting (SECDED) code (2% overhead). Our proposal
achieves a 93% reduction in refresh power vs. a baseline eDRAM
cache without error correcting capability, and a 66% reduction in
refresh power vs. a system using SECDED codes.
Categories and Subject Descriptors
B.3.4 [Memory Structures]: Reliability, Testing, Fault-
Tolerance.
General Terms
Design, Reliability, Power.
Keywords
ECC, Multi-Bit ECC, DRAM, eDRAM, refresh power, Vccmin.
1. INTRODUCTION
Advances in technology scaling have led to dramatic yearly
improvements in on-die cache capacity. New process
technologies have also enabled integrating DRAM on a logic
process, leading to the use of embedded DRAM (eDRAM) to
build on-die caches that are much denser than SRAM-based
caches (e.g., IBM Power 7 [14]). However, a side effect of
technology scaling is the increasing susceptibility of cache
structures to device variations [1, 27], where a few weak cells can
constrain the operating range of the whole cache.
In traditional SRAM caches, intrinsic variations force operation at
high voltages due to a few weak cells that fail at lower voltages,
and impede efforts to reduce power [29, 33]. Likewise, in
eDRAM caches, device variations affect the retention time of
individual DRAM cells, with a few particularly weak bits
determining the refresh time of the whole cache. A high refresh
rate significantly increases cache power.
Reducing power consumption is a first-order design constraint for
modern processors. In pursuit of improved power and energy
efficiency, processors implement a number of idle states to
support lower power modes. Reducing the power consumed
during idle states is particularly important because the typical
CPU spends the vast majority of its time in idle state. Many
desktop applications, such as word processors and spreadsheets,
spend much of the time waiting for I/O and tend to require the
CPU to operate only 10-20% of the time during use. Studies done
on the Intel
®
Core™/Core™ 2 Duo show that an idling processor
will consume an average of 0.5W-1.05W [24] depending on the
processor and frequency of idle state exits caused by events like
OS interrupts. Our analysis projects that a future processor with
128MB of eDRAM cache will consume 926mW just refreshing
the eDRAM. Based on these power numbers, we project that the
power consumption of large memory structures, like eDRAM
caches, will be the biggest contributor to overall idle power.
One popular method to reduce cache power is to power-gate large
blocks of memory at the cost of losing its state [6]. But as cache
density increases, the performance and power costs of this
approach also increase. One of the important goals in
implementing idle states is reducing power consumption, while
minimizing the transition latency into and out of the idle state. In
the Intel
®
Core™ 2, a transition out of the C4 idle state can take
about 100-200us [9]. In contrast, if the state in a 128MB eDRAM
cache were sacrificed to save power, it would take 4.2ms to re-
fetch the 128MB of lost data assuming full usage of the 30GB/s
bandwidth provided by system memory. In future products with
denser eDRAM caches, navigating the tradeoffs between idle exit
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
ISCA’10, June 19-23, 2010, Saint Malo, France.
Copyright 2010 ACM 978-1-4503-7/10/06…$10.00.

latency and idle power consumption will become increasingly
difficult. As the capacity of embedded memory grows, the
performance and power cost of flushing this memory also grows.
A key challenge for future product designers is to enable flexible
memory structures that can operate at very low idle power,
without dramatically increasing transition latency to and from the
idle power state due to data loss.
In this paper, we evaluate a modern processor with a 128MB
eDRAM cache. We show that refresh time plays the key role in
determining the eDRAM’s power. We first explore the role of
variation-related cell failures in determining refresh time. We then
evaluate the potential for using error-correcting codes (ECC) to
mitigate refresh-related failures. Augmenting eDRAM with error-
correcting codes enables reliable cache operation with longer
refresh periods, thereby lowering system power.
We propose Hi-ECC, a practical, low-latency, low-cost, error-
correcting system that can compensate for high failure rates in
eDRAM caches. Hi-ECC implements a strong BCH code with the
ability to correct 5 and detect 6 errors (hereafter referred to as a
5EC6ED code). A traditional approach using strong ECC suffers
from two prohibitive overheads that limit its applicability. First,
building a low-latency decoder for multi-bit ECC codes is
extremely costly. Second, the storage overhead of ECC bits is
high (around 10% for a 5EC6ED ECC code for a 64 byte line).
Hi-ECC proposes architectural solutions to both problems. It uses
a simple ECC decoder optimized for the 99.5% of the lines that
require little or no correction, and provides a high latency
alternative for lines that require complex multi-bit correction. To
minimize the performance impact of processing high latency
multi-bit corrections, Hi-ECC disables lines with multi-bit
failures. Finally, Hi-ECC leverages the natural spatial locality of
the data to reduce the cost of ECC storage. We make the
following main contributions:
1. We demonstrate that device variations lead to significant
increases in cache refresh rates.
2. We propose Hi-ECC, a practical system for using strong
error correcting codes that avoids decoder complexity and
latency.
3. We show how Hi-ECC can be extended to reduce the
storage overhead of the error-correcting codes by amortizing the
cost of the code over larger data words. This allows implementing
Hi-ECC with a 2% storage overhead, comparable to that of a
single error correcting code (SECDED) over 64 byte lines.
4. For a system with a 128 MB eDRAM cache, we show that
Hi-ECC can reduce cache refresh power by 93% compared to an
eDRAM with no error correction capability, and 66% compared
to an eDRAM with SECDED, all for about the same storage
overhead as a SECDED code. When accounting for dynamic
power, an optimized Hi-ECC reduces total power by 61% relative
to SECDED.
The remainder of this paper is organized as follows. In Section 2,
we review some of the design tradeoffs for eDRAM caches,
including a discussion of retention failures and previous work on
mitigating them. Section 3 describes our proposed Hi-ECC
architecture, and is followed by a review of the mathematical
properties of BCH codes that Hi-ECC relies on in Section 4. We
describe our evaluation methodology and results in Section 5 and
conclude in Section 6.
2. BACKGROUND
Embedded DRAM technology enables smaller memory cells as
compared to SRAM cells, resulting in a three to four times
increase in memory density [21]. The higher density of eDRAM
makes it a promising candidate to replace SRAM as the last-level
on-chip cache in future high performance processors. IBM has
recently announced that its upcoming Power7 processor will use a
32 MB on-chip eDRAM cache [14]. As feature sizes continue to
decrease, even larger eDRAM caches can be incorporated on
chip. In this paper, we model a 128 MB eDRAM cache, two
technology generations ahead of IBM Power7’s eDRAM cache.
One of the main problems with eDRAM cells is that they lose
charge over time due to leakage currents. The retention time of an
eDRAM cell is defined as the length of time for which the cell
can retain its state. Cell retention time depends on the leakage
current, which, in turn, depends on the access device leakage. To
preserve the state of stored data, eDRAM cells need to be
refreshed on a periodic basis. In order to prevent failures, the
refresh period needs to be less than the cell retention time.
Because eDRAM uses fast logic transistors with a higher leakage
current than conventional DRAM, the refresh time for eDRAM is
about a thousand times shorter than conventional DRAM. For
example, Barth, et al. [2] report the refresh time of a small
eDRAM to be 40us as compared to 64ms refresh time in
commodity DRAM [22]. This low refresh time poses serious
problems for eDRAM because it not only increases the idle
power, but also leads to reduced availability.
Previous work has
shown that variations in threshold voltage cause retention times of
different DRAM cells to vary significantly [8, 10, 11, 18]. These
variations are caused predominantly by random dopant
fluctuations and manifest themselves as a random distribution of
retention times amongst eDRAM cells. We use the data published
in [18] to model the retention time distribution in Figure 1.
1.E-21
1.E-18
1.E-15
1.E-12
1.E-09
1.E-06
1.E-03
1.E+00
0 100 200 300 400 500
Refresh Time (usec)
Probablity of Failure
pfbit pfCache
30us
Figure 1: eDRAM retention time distribution
The pfbit curve in Figure 1 represents the probability of a
retention failure in a single bit cell (derived from Figure 3 in [18])
and the pfCache curve represents the failure probability of a
128MB eDRAM cache for different refresh times. Our model
assumes that bit retention failures are distributed randomly
throughout the eDRAM cache, consistent with [18]. A cache
containing even a single failure must be discarded. Therefore, the

probability of failure for the entire cache is (1 – probability of
success), where the probability of success is the probability that
each bit in the cache stays failure-free. We assume that the
pfCache must be kept at less than 1 out of 1000 to achieve
acceptable manufacturing yields [33]. Under these assumptions,
data in Figure 1 shows that the refresh time for a baseline 128MB
eDRAM cache is 30 microseconds, close to the 40 microseconds
refresh time reported in [2] for a 13.5Mbit eDRAM macro.
Refresh mechanisms in eDRAM designs typically use a single
worst case refresh period dictated by the cell with the lowest
retention time. As eDRAM capacity increases in future
generations, eDRAM idle power, dominated by refresh, will
grow. Some previous papers have proposed hardware mechanisms
to exploit retention time variations by refreshing different DRAM
cells at different refresh rates [16, 26]. Venkatesan, et al. [32]
proposed a software mechanism that allocates DRAM pages with
longer retention time before allocating pages with shorter
retention time, and then chooses a refresh period that is
determined only by the populated pages instead of the entire
DRAM. These approaches require additional storage to track
retention times and rely on memory tests to identify marginal bit
cells. In [33] Wilkerson et al propose the bit-fix algorithm,
another testing-based approach, to address the problem of high
failure rates in the context of Vccmin reduction in SRAM caches
instead of DRAM refresh time. Since test time grows
proportionately with the capacity of the memory being tested,
increasing cache capacities may limit the applicability of all
testing-based approaches.
Ghosh and Lee [10] recently proposed a SmartRefresh technique
to reduce refresh power by adding timeout counters in each
DRAM row and avoiding unnecessary refreshes for those rows
which were recently read or written. However, SmartRefresh is
ineffective during the idle mode when the cache is not being
accessed, and therefore does not improve idle power.
Another promising approach to increase DRAM refresh times is
the use of error-correcting codes (ECC) to dynamically identify
and repair bits that fail [8, 15]. This approach sets refresh time
irrespective of the weakest bits, using ECC to compensate for
failures. With this approach, a stronger error-correcting code, with
the ability to correct multiple bits, implies increased refresh time
and reduced power. However, strong ECC codes have a high
storage and complexity overhead which limit their applicability.
In the following two sections, we propose an architectural
mechanism that uses strong ECC codes with a low storage and
complexity overhead.
3. STRONG ECC ARCHITECTURE
When designing a large eDRAM cache, a designer strives to
minimize eDRAM power consumption in the low-power
operating modes without penalizing performance in the normal
operating mode. To achieve this objective, we propose Hi-ECC,
which implements a multi-bit error-correcting code with very
small area, latency, and power overheads.
We propose a system with a large (128MB) eDRAM last level
cache. In a baseline configuration with no error correction
capability, the time between refreshes for such a cache will be 30
microseconds, leading to a significant amount of power consumed
even when the processor is idle. Refresh power can be reduced by
flushing and power gating the cache during the low-power
operating modes. This, however, causes a significant performance
penalty when the processor wakes up from the idle mode since it
will need to reload the cache, thereby incurring a large number of
cold start misses. Alternatively, we can lower refresh power
consumption by decreasing the refresh frequency (i.e., increasing
time between refreshes). However, as we show in Figure 1,
decreasing refresh frequency implies the need to tolerate a higher
number of failures for each cache line. Implementing a strong
error-correcting code with the capability to correct multiple errors
is necessary to achieve this goal.
At the core of Hi-ECC is a strong 5EC6ED (five bit correction,
six bit detection) BCH code. We explain implementation details
for BCH codes in Section 4. Traditional implementations of a
5EC6ED BCH code would suffer from two key drawbacks: high
storage overhead for the code itself, and high decoder complexity
and latency. In this section, we describe how Hi-ECC addresses
both of these drawbacks. Since our implementation requires
architectural changes that would increase dynamic power, we also
propose an architectural optimization to lower the impact on the
cache dynamic power.
3.1 Reducing Storage Overhead
The storage required for a 5EC6ED code for a 64B cache line is
51 bits, a 10% overhead. Since the cache occupies a large portion
of the die area (50% or higher), augmenting the eDRAM cache
with 5EC6ED code will significantly increase the die area and
cost. In contrast, a single error correcting, double error detecting
(SECDED) code for a 64B line requires 11 bits, an overhead of
around 2%. Our goal is to implement the 5EC6ED code with the
same storage overhead as the SECDED code.
To achieve this goal, we leverage two important properties. First,
the size of an ECC code relative to that of the data word
diminishes as the size of the data word grows, as we show in
Section 4.3. While a SECDED code for a 64B line has an 11-bit
overhead (2%), a SECDED code for a 1KB line has a 15-bit
overhead (0.18%). Second, the efficacy of a code only diminishes
slightly as the size of the data word increases. In Figure 2, we
show the failure probability (i.e., the probability that the line will
have more failures than those correctible by ECC) for three
different codes: SECDED on a 64B line, SECDED on a 1KB line,
and double error correcting, triple error detecting code
(DECTED) on a 1KB line. At very low refresh times, failure rates
of the SECDED 64B line and the SECDED 1KB line are very
close. DECTED-1KB (with only a 29-bit overhead, 0.35%) has a
lower failure probability than both SECDED codes, except at high
refresh times, such as 500us, where it is very close to the
SECDED-64B code. By choosing a stronger code and amortizing
the cost of the code over larger cache lines, we can improve our
ability to tolerate failures with a very small storage overhead.
Our Hi-ECC design implements a 5EC6ED code on each 1KB
line (5EC6ED-1KB), requiring an additional 71 bits (0.87%
overhead) for each line to store the code. In Figure 3, we compare
the refresh time of a 128MB cache augmented with Hi-ECC to the
baseline configuration with no error correction capability as well
as a configuration using a SECDED code for each 64B sub-block
(SECDED-64B). Like previous work that focused on SRAM [33],
we assume that the refresh time will be chosen such that no more
than 1E-03 (i.e., 1/1000) of the caches will fail. The baseline
configuration with no failure mitigation must operate at the

baseline refresh time of 30us. Adding a SECDED code allows a
5X increase in refresh time to 150us. Hi-ECC allows us to
increase the refresh time to 440us (almost a 15X reduction in
refresh frequency compared to the baseline).
1.E-21
1.E-18
1.E-15
1.E-12
1.E-09
1.E-06
1.E-03
1.E+00
0 100 200 300 400 500
Refresh Time
Probability of Failure
SECDED 1KB
SECDED 64B
DECTED 1KB
Figure 2. Comparing bit failure probabilities for three
code/line size combinations. DECTED on 1KB lines achieves
higher refresh time than SECDED on 64B lines
1.E-21
1.E-18
1.E-15
1.E-12
1.E-09
1.E-06
1.E-03
1.E+00
0 100 200 300 400 500
Refresh Time
Probability of Failure
Base
SECDED-64B
Hi-ECC: 5EC6ED-1KB
30us 150us
440us
Figure 3. Hi-ECC achieves a higher refresh time than
SECDED at the same failure probability
3.2 Reducing Latency
A hardware implementation of a 5EC6ED code is very complex
and imposes a long decoding latency penalty, proportional to both
the number of error bits corrected and the number of data bits
(Section 4.1). If the full strength encoding/decoding was required
for every cache access, this could significantly increase cache
access latency. However, our proposal leverages the fact that
error-prone portions of the cache can be disabled, avoiding the
high latency of decode during typical operation.
The Hi-ECC technique relies on a simple, one cycle ECC block to
correct a single bit error, and an un-pipelined, high-latency ECC
processing block to correct multiple bit failures in a cache line [7,
20]. When a line is read from the cache, a simple decoder
generates the syndrome for the line, which includes information
on whether it has zero, one, or a higher number of errors (Section
4.2). If the line has zero or one bit failures, the simple ECC
decoder can perform the correction in a single cycle. Figure 4
shows a high-level block diagram for Hi-ECC. The block referred
to as Quick-ECC contains the syndrome generation logic and the
error correction logic for lines with zero or one failures. The
Quick-ECC block also classifies lines into two groups based on
the syndrome: those that require complex multi-bit error
correction and those that have zero or one errors. Lines that
require multi-bit error correction are forwarded to a high latency
(potentially hundreds of cycles) ECC processing unit that
performs error correction using either software or a simple state
machine. This allows us to simplify the design at the expense of
increased latency for lines with two or more failures. Lines that
require one or less error corrections can be immediately corrected
and forwarded to the unit requesting the line.
Figure 4: Block diagram for Hi-ECC
The high latency of handling multi-bit failures could significantly
reduce performance. To avoid incurring this latency, problematic
lines could be completely disabled or a mechanism such as bit-fix
[33] could be integrated as shown by the dotted box labeled
optional repair in Figure 4. This guarantees that the performance
penalty of multi-bit decoding is incurred only once, the first time
a failure is identified. The frequency of failures plays a role in
the disable strategy that we choose. Low multi-bit failure rates
motivate a simple approach such as disabling cache lines
containing multi-bit failures. On the other hand, cache line
disable will result in unacceptable cache capacity loss if multi-bit
failure rates are high. In this case, a more complex mechanism
such as bit-fix might be used to minimize the capacity lost to
disabling.
Figure 5 shows the probability that N (X-axis) or more lines have
multi-bit failures for a 128MB eDRAM cache at the refresh time
we propose (440us). On average, a 128MB eDRAM will have 750
1KB lines with multi-bit failures that need to be disabled,
(0.573% of all lines). As highlighted in the figure, the probability
that 900 or more lines (0.7% of all cache lines) will exhibit multi-
bit errors is 6.77x10
-8
. For comparison, Hi-ECC augmented with
a simplified version of bit-fix [33] that repairs a single additional
bit per cache line requires an additional 13 bits per line (0.13%
overhead). However, this 13-bit overhead enables efficient
correction of lines with 2-bit errors and reduces the number of
lines that need to be disabled. Disabling lines with only 3 or more
errors reduces the average number of disabled lines from 750 to
28 (0.02% of all lines), with a probability of 5.95x10
-8
that 60 or
more lines contain three or more errors. Although adding bit-fix
reduces wasted cache capacity, the improvement over simple line
disable is marginal for the failure rates in our model and doesn’t
justify the additional latency and complexity. As a result, the rest
C
P
U
TAG/ECC
ARRAY
eDRAM
Address
optional
repair
Quick
ECC
> 1
fail?
High latency
ECC
processing

of this paper will focus on the Hi-ECC approach that relies solely
on line disable for lines with multi-bit (two or more) errors.
Due to the implementation of our eDRAM cache, there are
restrictions to how many and which lines can be disabled. Since
our cache is a 16-way set-associative, and since disabling all lines
in a particular cache set could be catastrophic for some
workloads, we limit the maximum number of lines that can be
disabled in a particular set to 14 of the 16 ways. We also limit the
maximum number of failing lines to 900 to quantify overhead.
With a refresh time of 440us, the probability of at least one of the
lines containing more than 5 failing bits is 6.21x10
-4
(Figure 3),
the probability of more than 900 multi-bit failures (disabled lines)
is 6.77x10
-8
(Figure 5), and the probability of more than 14 multi-
bit failures in a single set is 1.12x10
-61
. This indicates that
disabling cache lines with multi-bit failures will have little effect
on the overall probability that our cache meets the quality
requirements at 440us. The total storage overhead for our
approach is 1.58% including a 0.88% overhead for the code and a
single per-line disable bit, and a 0.7% overhead for the 900
disabled lines.
1.E-12
1.E-09
1.E-06
1.E-03
1.E+00
0 200 400 600 800
# lines (N)
Probability # lines will fail
mean (750 lines)
pfail 900 lines
6.77E-8
mean (28 lines)
w/ bit-fix
pfail 60 lines
5.95E-8
w/o bit-fix
Figure 5. The distribution of failing lines for a 128MB Cache
with 1KB lines with (w/) and without (w/o) bit-fix.
3.3 Reducing Dynamic Power
Our Hi-ECC proposal uses larger cache line sizes to reduce the
area cost of strong ECC codes. However, larger line sizes
introduce some additional challenges. Although 1KB is a
reasonable line size for a large embedded memory, our baseline
configuration has a much smaller L2 cache with a 64B cache line
(referred to as sub-block). Some implementation issues arise
when we read from or write to our large L3 eDRAM cache due to
the mismatch between its 1KB line size and the 64B sub-blocks
used by other caches.
Most writes to the large L3 eDRAM cache will be in the form of
smaller 64B sub-blocks generated at lower-level caches or fetched
from memory. To modify a 64B sub-block in a 1KB line, we need
to perform a read-modify-write operation since we need to
compute the ECC code. First, the old 64B block that is being
overwritten must be read, along with the ECC code for the entire
line. We then use the old data, old ECC, and new data to compute
the new ECC for the whole 1KB line. We then write the new 64B
sub-block and the new ECC. However, we do not need to read the
whole 1KB line to compute the new ECC, as explained later in
Section 4.3.
The purpose of most L3 reads will be to provide cache lines for
allocation in lower-level caches. Processing any sub-block
requires the ECC code to be processed with the entire data word
(1KB cache line) that it protects. Since each 64B sub-block must
be checked, each reference to a 64B sub-block must be
accompanied by a reference to the surrounding 64B sub-blocks.
This implies that any L3 read will access all 16 sub-blocks in the
1KB line, as well as the ECC code that they share. As an
example, if we need to read eight out of the 16 sub-blocks in one
1KB line, we must read all 16 sub-blocks eight times, for a total
of 128 sub-block reads. This large number of additional reads
causes a substantial increase in dynamic power consumption and
a drastic reduction in the useful bandwidth delivered by the
memory.
To address the extra power overhead for L3 reads, we consider
the fact that the vast majority of eDRAM failures are retention
failures. Since the retention time of our baseline eDRAM is 30us,
and each read automatically implies a refresh, we know that
retention failures will not occur for 30us after a line has been
read. Our proposal leverages this property and also the temporal
and spatial locality of the data to minimize the number the
superfluous reads. Using a structure we refer to as the Recently
Accessed Lines Table (RALT), we attempt to track lines that have
been referenced in the last 30us.
The first read to a line causes all sub-blocks in the line to be read
and checked for failures. The address of the line is then placed in
the RALT to indicate that it has recently been checked and will
remain free from retention failures for the next 30usec. As long as
the address of the line is held in the RALT, any sub-block reads
from the line can forgo ECC processing and thus avoid reading
the ECC code and other sub-blocks in the line. To operate
correctly, the RALT must ensure that none of its entries are more
than 30us old. To guarantee this, each 30us is divided into four
equal periods (P0, P1, P2, P3). Entries allocated in the RALT
during each period are marked with a 2-bit identifier to specify
the allocation period. Transitions between periods, P0 to P1 for
example, will cause all RALT entries previously allocated in P1
to be invalidated.
Each entry in the RALT consists of the following fields: a line
address to identify the line the entry is associated with; a valid bit,
a 2-bit period identifier field to indicate in which of the four
periods the line was allocated (P0, P1, P2, P3); and a 16-bit parity
consisting of one parity bit for each 64B sub-block in the line.
The RALT is direct mapped, but supports a CAM invalidate on
the 2-bit period field to allow bulk invalidates of RALT entries
during period transitions.
Figure 6 compares the implementation of the baseline L3
protection scheme (top) with that of Hi-ECC (bottom). The
baseline scheme uses a separate tag for each 1KB line and a
separate SECDED code for each 64B sub-block. To read a 64B
block, first the ECC and the block itself are read, then the ECC is
processed. In our Hi-ECC technique, the first time a sub-block is
read the entire ECC code is read along with each sub-block in the
1KB line to allow ECC processing for a single 64B block. We
update the RALT with the line address of the referenced line, a 2-
bit period ID, and a single parity bit for each sub-block. After the

Citations
More filters
Journal ArticleDOI

RAIDR: Retention-Aware Intelligent DRAM Refresh

TL;DR: This paper proposes RAIDR (Retention-Aware Intelligent DRAM Refresh), a low-cost mechanism that can identify and skip unnecessary refreshes using knowledge of cell retention times and group DRAM rows into retention time bins and apply a different refresh rate to each bin.
Patent

System, method, and computer program product for improving memory systems

TL;DR: In this paper, a system, method, and computer program product for a memory system is described, which includes a first semiconductor platform including at least one first circuit, and at least two additional semiconductor platforms stacked with the first and additional circuits.
Proceedings ArticleDOI

Memory Errors in Modern Systems: The Good, The Bad, and The Ugly

TL;DR: This study uses data from two leadership-class high-performance computer systems to analyze the reliability impact of hardware resilience schemes that are deployed in current systems and finds that counting errors instead of faults, a common practice among researchers and data center operators, can lead to incorrect conclusions about system reliability.
Proceedings ArticleDOI

Flash correct-and-refresh: Retention-aware error management for increased flash memory lifetime

TL;DR: New techniques that can tolerate high bit error rates without requiring prohibitively strong ECC are developed, called Flash Correct-and-Refresh (FCR), which provide 46× average lifetime improvement on a variety of workloads at no additional hardware cost.
Proceedings ArticleDOI

Graphicionado: a high-performance and energy-efficient accelerator for graph analytics

TL;DR: Graphicionado augments the vertex programming paradigm, allowing different graph analytics applications to be mapped to the same accelerator framework, while maintaining flexibility through a small set of reconfigurable blocks, for high-performance, energy-efficient processing of graph analytics workloads.
References
More filters
Book

Algebraic Coding Theory

TL;DR: This is the revised edition of Berlekamp's famous book, "Algebraic Coding Theory," originally published in 1968, wherein he introduced several algorithms which have subsequently dominated engineering practice in this field.
Book

Error Control Coding

Daniel Costello, +1 more
TL;DR: Error Control Coding (2nd Edition) by Shu Lin, Shu, Costello, Daniel J. Costello Jr. and a great selection of similar New, Used and Collectible.
Book

Error-control coding for computer systems

TL;DR: Error- control coding for computer systems, Error-control coding for computers systems, and more.
Related Papers (5)
Frequently Asked Questions (22)
Q1. What contributions have the authors mentioned in the paper "Reducing cache power with low-cost, multi-bit error-correcting codes" ?

In this paper, the authors show the significant impact of variations on refresh time and cache power consumption for large eDRAM caches. The authors propose Hi-ECC, a technique that incorporates multibit error-correcting codes to significantly reduce refresh rate. 

Because eDRAM uses fast logic transistors with a higher leakage current than conventional DRAM, the refresh time for eDRAM is about a thousand times shorter than conventional DRAM. 

In a baseline configuration with no error correction capability, the time between refreshes for such a cache will be 30 microseconds, leading to a significant amount of power consumed even when the processor is idle. 

Since test time grows proportionately with the capacity of the memory being tested, increasing cache capacities may limit the applicability of all testing-based approaches. 

New process technologies have also enabled integrating DRAM on a logic process, leading to the use of embedded DRAM (eDRAM) to build on-die caches that are much denser than SRAM-based caches (e.g., IBM Power 7 [14]). 

Since the retention time of their baseline eDRAM is 30us, and each read automatically implies a refresh, the authors know that retention failures will not occur for 30us after a line has been read. 

SmartRefresh is ineffective during the idle mode when the cache is not being accessed, and therefore does not improve idle power. 

Since their cache is a 16-way set-associative, and since disabling all lines in a particular cache set could be catastrophic for some workloads, the authors limit the maximum number of lines that can be disabled in a particular set to 14 of the 16 ways. 

Duo show that an idling processor will consume an average of 0.5W-1.05W [24] depending on the processor and frequency of idle state exits caused by events like OS interrupts. 

Most writes to the large L3 eDRAM cache will be in the form of smaller 64B sub-blocks generated at lower-level caches or fetched from memory. 

Reducing the power consumed during idle states is particularly important because the typical CPU spends the vast majority of its time in idle state. 

Each entry in the RALT consists of the following fields: a line address to identify the line the entry is associated with; a valid bit, a 2-bit period identifier field to indicate in which of the four periods the line was allocated (P0, P1, P2, P3); and a 16-bit parity consisting of one parity bit for each 64B sub-block in the line. 

Another promising approach to increase DRAM refresh times is the use of error-correcting codes (ECC) to dynamically identify and repair bits that fail [8, 15]. 

in eDRAM caches, device variations affect the retention time of individual DRAM cells, with a few particularly weak bits determining the refresh time of the whole cache. 

With this approach, a stronger error-correcting code, with the ability to correct multiple bits, implies increased refresh time and reduced power. 

Ghosh and Lee [10] recently proposed a SmartRefresh technique to reduce refresh power by adding timeout counters in each DRAM row and avoiding unnecessary refreshes for those rows which were recently read or written. 

Wilkerson et al propose the bit-fix algorithm, another testing-based approach, to address the problem of high failure rates in the context of Vccmin reduction in SRAM caches instead of DRAM refresh time. 

This large number of additional reads causes a substantial increase in dynamic power consumption and a drastic reduction in the useful bandwidth delivered by the memory. 

theprobability of failure for the entire cache is (1 – probability of success), where the probability of success is the probability that each bit in the cache stays failure-free. 

When a line is read from the cache, a simple decoder generates the syndrome for the line, which includes information on whether it has zero, one, or a higher number of errors (Section 4.2). 

Using a structure the authors refer to as the Recently Accessed Lines Table (RALT), the authors attempt to track lines that have been referenced in the last 30us. 

as the authors show in Figure 1, decreasing refresh frequency implies the need to tolerate a higher number of failures for each cache line.