scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

A case for small row buffers in non-volatile main memories

30 Sep 2012-Vol. 2012, pp 484-485
TL;DR: It is found that on a multi-core system, reducing the row buffer size can greatly reduce main memory dynamic energy compared to a DRAM baseline with large row sizes, without greatly affecting endurance, and for some NVM technologies, leads to improved performance.
Abstract: DRAM-based main memories have read operations that destroy the read data, and as a result, must buffer large amounts of data on each array access to keep chip costs low. Unfortunately, system-level trends such as increased memory contention in multi-core architectures and data mapping schemes that improve memory parallelism lead to only a small amount of the buffered data to be accessed. This makes buffering large amounts of data on every memory array access energy-inefficient; yet organizing DRAM chips to buffer small amounts of data is costly, as others have shown [11]. Emerging non-volatile memories (NVMs) such as PCM, STT-RAM, and RRAM, however, do not have destructive read operations, opening up opportunities for employing small row buffers without incurring additional area penalty and/or design complexity. In this work, we discuss and evaluate architectural changes to enable small row buffers at a low cost in NVMs. We find that on a multi-core system, reducing the row buffer size can greatly reduce main memory dynamic energy compared to a DRAM baseline with large row sizes, without greatly affecting endurance, and for some NVM technologies, leads to improved performance.

Summary (2 min read)

Introduction

  • Abstract—DRAM-based main memories have read operations that destroy the read data, and as a result, must buffer large amounts of data on each array access to keep chip costs low.
  • Emerging non-volatile memories (NVMs) such as PCM, STT-RAM, and RRAM, however, do not have destructive read operations, opening up opportunities for employing small row buffers without incurring additional area penalty and/or design complexity.
  • Over time, this charge leaks, causing the stored data to be lost.
  • As a result, the performance benefit of large row buffers may decrease in multi-core systems.

II. MOTIVATION

  • Emerging NVM technologies have several promising attributes compared to existing memory technologies such as SRAM (used in on-chip caches), DRAM, and Flash.
  • NVMs provide cost advantages compared to SRAM and DRAM, and latency advantages compared to Flash.
  • Typical DRAM chip micro-architectures (JEDEC-standard DDRtype SDRAM) are divided into banks that consist of rows and columns .
  • Comparing the 1- and 8-core row-interleaved data, the authors see that while row interleaving does enable more row buffer locality, its benefits diminish as memory system contention increases with more cores: row buffer hit rate is less than 50% for row interleaving even with large, 1KB rows.

III. A SMALL ROW BUFFER NVM ARCHITECTURE

  • Figure 1(b) shows the organization of their NVM architecture.
  • Compared to a traditional DRAM organization, the physical placement of the row buffer and the column multiplexer (part of the I/O gating circuitry in DRAM designs) are swapped in the data path (shown in gray).
  • This rearrangement makes better use of resources by sharing a smaller number of sense amplifiers (the devices which store bits in the row buffer) among multiple bitlines.
  • Note that this is not possible in DRAM (without reducing the row size) because a sense amplifier for each bit in the row is required in DRAM to restore the charge of the cell after it is read.
  • Unlike DRAM, however, their organization requires decoding both the row address and the column address during a RAS command, so that only a subset of the row containing the bits of interest will be selected, sensed, and stored in the row buffer.

IV. RESULTS

  • The authors modify their memory simulator timings according to those in Table I for PCM and STT-RAM.
  • The authors evaluate 31 multiprogrammed workloads composed of SPEC, TPC, and STREAM benchmarks.
  • Note that this reduction is achieved despite worse underlying technology parameters 2For more details, please refer to their accompanying tech report [5].
  • For a given memory technology, reducing the row buffer size does not greatly affect system performance due to the already low row buffer locality present on their multi-core system .
  • NVM cells have a limited lifetime in terms of the number of times they can be written to before their ability to store data fails, also known as Durability.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

A Case for Small Row Buffers in Non-Volatile Main Memories
Justin Meza
Jing Li
Onur Mutlu
Carnegie Mellon University
IBM T.J. Watson Research Center
{meza,onur}@cmu.edu jli@us.ibm.com
Abstract—DRAM-based main memories have read operations that
destroy the read data, and as a result, must buffer large amounts of data
on each array access to keep chip costs low. Unfortunately, system-level
trends such as increased memory contention in multi-core architectures
and data mapping schemes that improve memory parallelism lead to only
a small amount of the buffered data to be accessed. This makes buffering
large amounts of data on every memory array access energy-inefficient;
yet organizing DRAM chips to buffer small amounts of data is costly, as
others have shown [11].
Emerging non-volatile memories (NVMs) such as PCM, STT-RAM,
and RRAM, however, do not have destructive read operations, opening
up opportunities for employing small row buffers without incurring
additional area penalty and/or design complexity. In this work, we
discuss and evaluate architectural changes to enable small row buffers
at a low cost in NVMs. We find that on a multi-core system, reducing
the row buffer size can greatly reduce main memory dynamic energy
compared to a DRAM baseline with large row sizes, without greatly
affecting endurance, and for some NVM technologies, leads to improved
performance.
I. INTRODUCTION
Modern main memory is composed of dynamic random-access
memory (DRAM). A DRAM cell stores data as charge on a capacitor.
Over time, this charge leaks, causing the stored data to be lost. To
prevent this, data stored in DRAM must be periodically read out and
rewritten, a process called refreshing. In addition, reading data stored
in a DRAM cell destroys its state, requiring data to be later restored,
leading to increased cell access time and energy. For this reason,
DRAM devices require buffering data which are read. To keep costs
low, the buffering circuitry in DRAM devices is amortized among
large rows of cells, in peripheral storage called the row buffer, at
least one per bank [2]. Refreshing data and buffering large amounts of
data wastes energy in DRAM devices, causing main memory power
to constitute a significant fraction of the total system power.
Data fetched into the row buffer, however, can be accessed at much
lower latencies and less energy than accessing the DRAM memory
array. Therefore, large row buffer sizes can improve performance
and efficiency if many accesses can be served in the same row.
Unfortunately, there are several reasons why such row buffer locality
can be low in systems: (1) some applications inherently do not have
significant locality within rows (e.g., random access applications),
(2) as more cores are placed on chip, applications running on those
cores interfere with each other in the row buffers, leading to reduced
locality, especially if the memory scheduling policy is unaware of
applications’ interference in the row buffers [7], as also observed
by others [10, 11], and (3) interleaving techniques that improve
parallelism in the memory system (e.g., cache block interleaving)
tend to reduce row buffer locality because they stripe consecutive
cache blocks across different banks. As a result, the performance
benefit of large row buffers may decrease in multi-core systems.
New non-volatile memory (NVM) technologies, such as phase-
change memory (PCM), spin-transfer torque RAM (STT-RAM), and
resistive RAM (RRAM), on the other hand, provide non-destructive
reads and do not require refreshing and restoring their data after
sensing. This is because NVMs do not store their data as charge, and
thus their data persists after being read. This not only eliminates the
refresh problem of DRAM devices but also opens up opportunities for
employing smaller row buffers in NVMs without incurring additional
area penalty and/or design complexity.
II. MOTIVATION
Emerging NVM technologies have several promising attributes
compared to existing memory technologies such as SRAM (used in
on-chip caches), DRAM, and Flash. For example, NVMs provide cost
advantages compared to SRAM and DRAM, and latency advantages
compared to Flash. Importantly, these NVMs feature non-destructive
read operations, which DRAM does not have (i.e., data sensing does
not destroy the contents of cells).
Typical DRAM chip micro-architectures (JEDEC-standard DDR-
type SDRAM) are divided into banks that consist of rows (wordlines)
and columns (bitlines). Due to physical pin limitations, all the
information required to service a memory request must be supplied
over multiple commands. The Row Address Strobe (RAS) command
sends the row and bank address to select one of the banks and a row
within that bank. Then, an entire row (usually 1 to 2KB per chip)
is read out into latch-based sense amplifiers which comprise the row
buffer [2]. The Column Address Strobe (CAS) command then selects
a subset (i.e., column) of data from the row buffer (8B in a DDR3
×8 device [6]). Thus, a DRAM access first fetches many kilobytes
of data into the row buffer (RAS) and, in the worst case, uses only
a tiny portion of it (CAS). If multiple columns of the row buffer are
needed, multiple consecutive CAS commands can be issued, which
amortizes the cost of fetching the large row into the row buffer.
To illustrate how much buffered data is actually used in real
applications, Figure 1a shows average row buffer locality (row hit
rate) when employing various row buffer sizes on several system con-
figurations using the FR-FCFS scheduling policy [8].
1
In particular,
we show 1- and 8-core systems employing two different schemes
for mapping data in main memory: (1) row interleaving, which
places consecutive memory addresses in the same row, and (2) block
interleaving, which stripes data in consecutive memory addresses
(usually cache blocks) across different banks. Row interleaving helps
exploit row buffer locality by enabling data with spatial locality
to reside in the same row buffer, while block interleaving aims
to improve memory parallelism by enabling concurrent access of
memory channels/banks for consecutive memory addresses.
Comparing the 1- and 8-core row-interleaved data, we see that
while row interleaving does enable more row buffer locality, its
benefits diminish as memory system contention increases with more
cores: row buffer hit rate is less than 50% for row interleaving even
with large, 1KB rows. Block interleaving reduces row buffer locality
over row interleaving, to less than 10% in the 8-core case. While
it is clear that row locality is lower on multi-core systems, what
is less obvious is how row buffer size affects system-level tradeoffs,
such as energy-efficiency, performance, and durability, in NVM main
memories. This work evaluates these tradeoffs.
8B 16B 32B 64B 128B 256B 512B
1KB
Per DRAM-Chip Row Buffer Size
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Row Buffer Locality
1-core, row interleaved
8-core, row interleaved
1-core, cache block interleaved
8-core, cache block interleaved
(a)
Bank N
Bank 2
Bank 1
Bank 0
Memory Array
Row Buffer
Column Decoder
and Multiplexer
Row Decoder
...
...
...
(b)
Fig. 1: Row size affects row locality (a); our NVM architecture (b).
1
Application-aware memory request scheduling policies (e.g., [1, 3, 7])
provide better performance, but they can reduce row buffer locality.
1

8B 16B 32B 64B 128B 256B 512B
1KB
Baseline
Per DRAM-Chip Row Buffer Size
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
Normalized Memory Energy
STT-RAM
DRAM
PCM
(a) Memory energy with block interleaving.
8B 16B 32B 64B 128B 256B 512B
1KB
Per DRAM-Chip Row Buffer Size
0
1
2
3
4
5
Weighted Speedup
STT-RAM
DRAM
PCM
(b) Performance with block interleaving.
8B 16B 32B 64B 128B 256B 512B
1KB
Per DRAM-Chip Row Buffer Size
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
Normalized Writes
No cache
32MB cache
(c) Writes with and without a 32MB cache.
Fig. 2: Multi-core results for energy (normalized to DRAM with 1KB rows), performance, and number of writes (normalized to 1KB rows).
III. A SMALL ROW BUFFER NVM ARCHITECTURE
Figure 1(b) shows the organization of our NVM architecture. Com-
pared to a traditional DRAM organization, the physical placement of
the row buffer and the column multiplexer (part of the I/O gating
circuitry in DRAM designs) are swapped in the data path (shown in
gray). This rearrangement makes better use of resources by sharing a
smaller number of sense amplifiers (the devices which store bits in the
row buffer) among multiple bitlines. Note that this is not possible in
DRAM (without reducing the row size) because a sense amplifier for
each bit in the row is required in DRAM to restore the charge of the
cell after it is read. Unlike DRAM, however, our organization requires
decoding both the row address and the column address during a RAS
command, so that only a subset of the row containing the bits of
interest will be selected, sensed, and stored in the row buffer. During
a CAS command, the data bits from the row buffer corresponding to
the desired column are further selected by the I/O gating circuitry
and sent to a prefetch buffer.
2
While related prior work [4] employed multiple, narrow rows
in a PCM main memory for reducing array reads and writes, it
focused on (1) a traditional DRAM data path design, (2) an iso-
area reorganization, requiring more area overhead than our technique
which employs smaller row buffers, and (3) assumed a standard
DRAM protocol for device access.
IV. RESULTS
We developed a cycle-accurate DDR3 memory simulator which we
use as part of an in-house x86 multi-core simulator, whose front-end
is based on Pin. We modify our memory simulator timings according
to those in Table I for PCM and STT-RAM. We show results for an
8-core system with different memories and row buffer sizes, though
reducing row buffer size in DRAM incurs significant area overhead
and chip cost, as discussed in [10, 11], which we do not evaluate.
We evaluate 31 multiprogrammed workloads composed of SPEC,
TPC, and STREAM benchmarks. We will focus on a DRAM chip
micro-architecture with 1KB row buffers and block interleaving as
our baseline (our findings are similar for row interleaving [5]).
Technology Energy (Read/Write) Latency (Read/Write)
PCM 2×/100× 5×/10×
STT-RAM 0.5×/1× 1×/1×
TABLE I: NVM array parameters, relative to DRAM.
Energy (Figure 2a): In all cases, reducing the row buffer size can
significantly reduce memory energy consumption, though there are
diminishing marginal returns. The diminishing marginal returns are
because, as the row buffer size decreases, memory energy becomes
dominated by the energy required to transfer data between the row
buffer and I/O pads during read and write operations.
A modest row buffer size of 64B per chip leads to 47%/67%
less main memory energy for PCM/STT-RAM, compared to an all-
DRAM main memory with large rows (1KB per chip). Note that this
reduction is achieved despite worse underlying technology parameters
2
For more details, please refer to our accompanying tech report [5].
than DRAM (cf. Table I) because the energy saved by reducing the
row buffer size more than makes up for the higher average memory
array access energy. Hence, an NVM main memory with smaller
row buffers can significantly reduce memory energy consumption
compared to a DRAM baseline with large row buffers.
Performance (Figure 2b): We evaluate the performance of our
system using the weighted speedup metric [9] (higher is better). For
a given memory technology, reducing the row buffer size does not
greatly affect system performance due to the already low row buffer
locality present on our multi-core system (cf. Figure 1a). Interestingly,
with similar technology-dependent timing parameters as DRAM, an
STT-RAM main memory can achieve better performance because our
new organization enables a more efficient access protocol (detailed
in [5]) which eliminates the precharge delay incurred on row buffer
misses, and relaxes the t
RRD
and t
FAW
timing parameters to enable
more banks to be accessed simultaneously.
Durability (Figure 2c): NVM cells have a limited lifetime in terms
of the number of times they can be written to before their ability
to store data fails. We examine the effects of different row buffer
sizes on device durability with and without a small 32MB e-DRAM
cache to a PCM main memory. We find that with or without a cache,
decreasing the row buffer size has only a small effect on the number
of NVM writes performed due to the low row buffer locality present
in the system. In contrast, the addition of a reasonably-sized e-DRAM
cache has a large impact on the reduction of writes, decreasing the
number of writes by 39% to 47% across the various row buffer sizes.
V. CONCLUSIONS
We showed that on a multi-core system, reducing the row buffer
size can greatly reduce main memory dynamic energy compared
to a DRAM baseline with large rows, without greatly affecting
performance and durability. Our future work includes exploring
architectural techniques which effectively leverage small row buffer
sizes for improved performance and energy-efficiency.
REFERENCES
[1] Y. Kim et al. ATLAS: A scalable and high-performance scheduling
algorithm for multiple memory controllers. HPCA ’10.
[2] Y. Kim et al. A case for exploiting subarray-level parallelism (SALP)
in DRAM. ISCA ’12.
[3] Y. Kim et al. Thread cluster memory scheduling: Exploiting differences
in memory access behavior. MICRO ’10.
[4] B. C. Lee, E. Ipek, O. Mutlu, and D. Burger. Architecting phase change
memory as a scalable DRAM alternative. ISCA ’09.
[5] J. Meza, J. Li, and O. Mutlu. A case for small row buffers in non-volatile
main memories. http://safari.ece.cmu.edu/tr/tr-2012-002.pdf.
[6] Micron. 1Gb: ×4, ×8, ×16 DDR3 SDRAM data sheet. http://download.
micron.com/pdf/datasheets/dram/ddr/1GbDDRx4x8x16.pdf.
[7] S. P. Muralidhara et al. Reducing memory interference in multicore
systems via application-aware memory channel partitioning. MICRO’11.
[8] S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens.
Memory access scheduling. ISCA ’00.
[9] A. Snavely et al. Symbiotic jobscheduling for a simultaneous multi-
threading processor. ASPLOS ’10.
[10] K. Sudan, N. Chatterjee, D. Nellans, et al. Micro-pages: increasing dram
efficiency with locality-aware data placement. ASPLOS ’10.
[11] A. N. Udipi, N. Muralimanohar, N. Chatterjee, et al. Rethinking DRAM
design and organization for energy-constrained multi-cores. ISCA ’10.
2
Citations
More filters
Posted Content
TL;DR: This paper summarizes the idea of Subarray-Level Parallelism (SALP) in DRAM, which was published in ISCA 2012, and examines the work's significance and future potential, and proposes three new mechanisms, SALP-1, SALp-2, and MASA (Multitude of Activated Subarrays), to reduce the serialization of different requests that go to the same bank.
Abstract: This paper summarizes the idea of Subarray-Level Parallelism (SALP) in DRAM, which was published in ISCA 2012, and examines the work's significance and future potential. Modern DRAMs have multiple banks to serve multiple memory requests in parallel. However, when two requests go to the same bank, they have to be served serially, exacerbating the high latency of on-chip memory. Adding more banks to the system to mitigate this problem incurs high system cost. Our goal in this work is to achieve the benefits of increasing the number of banks with a low-cost approach. To this end, we propose three new mechanisms, SALP-1, SALP-2, and MASA (Multitude of Activated Subarrays), to reduce the serialization of different requests that go to the same bank. The key observation exploited by our mechanisms is that a modern DRAM bank is implemented as a collection of subarrays that operate largely independently while sharing few global peripheral structures. Our three proposed mechanisms mitigate the negative impact of bank serialization by overlapping different components of the bank access latencies of multiple requests that go to different subarrays within the same bank. SALP-1 requires no changes to the existing DRAM structure, and needs to only reinterpret some of the existing DRAM timing parameters. SALP-2 and MASA require only modest changes (< 0.15% area overhead) to the DRAM peripheral structures, which are much less design constrained than the DRAM core. Our evaluations show that SALP-1, SALP-2 and MASA significantly improve performance for both single-core systems (7%/13%/17%) and multi-core systems (15%/16%/20%), averaged across a wide range of workloads. We also demonstrate that our mechanisms can be combined with application-aware memory request scheduling in multicore systems to further improve performance and fairness.

4 citations


Cites background from "A case for small row buffers in non..."

  • ...In addition, SALP may be applied to future emerging memory technologies as long as their banks are organized hierarchically [69, 92], similar to how a DRAM bank consists of subarrays....

    [...]

Posted Content
TL;DR: A policy is devised that caches heavily-reused data that frequently misses in the NVM row buffers into DRAM, and tracks the row buffer miss counts of recently-used rows in NVM, and caches in DRAM the rows that are predicted to incur frequent row buffer misses.
Abstract: Non-volatile memory (NVM) is a class of promising scalable memory technologies that can potentially offer higher capacity than DRAM at the same cost point. Unfortunately, the access latency and energy of NVM is often higher than those of DRAM, while the endurance of NVM is lower. Many DRAM-NVM hybrid memory systems use DRAM as a cache to NVM, to achieve the low access latency, low energy, and high endurance of DRAM, while taking advantage of the large capacity of NVM. A key question for a hybrid memory system is what data to cache in DRAM to best exploit the advantages of each technology while avoiding the disadvantages of each technology as much as possible. We propose a new memory controller design that improves hybrid memory performance and energy efficiency. We observe that both DRAM and NVM banks employ row buffers that act as a cache for the most recently accessed memory row. Accesses that are row buffer hits incur similar latencies (and energy consumption) in both DRAM and NVM, whereas accesses that are row buffer misses incur longer latencies (and higher energy consumption) in NVM than in DRAM. To exploit this, we devise a policy that caches heavily-reused data that frequently misses in the NVM row buffers into DRAM. Our policy tracks the row buffer miss counts of recently-used rows in NVM, and caches in DRAM the rows that are predicted to incur frequent row buffer misses. Our proposed policy also takes into account the high write latencies of NVM, in addition to row buffer locality and more likely places the write-intensive pages in DRAM instead of NVM.

4 citations


Cites background or result from "A case for small row buffers in non..."

  • ...For example, STT-MRAM devices can make use of a row bu er [2, 54, 77, 78], and expensive reduced-latency DRAM devices [80, 104] also make use of a row bu er....

    [...]

  • ...For example, PCM’s long cooling duration required to crystallize chalcogenide leads to high PCM write latency, high read (sensing) latency, high read energy, and high write energy compared to those of DRAM [77]....

    [...]

  • ...[77] examine the case for small row bu ers for NVM devices....

    [...]

Proceedings ArticleDOI
24 Aug 2014
TL;DR: A compiler-runtime cooperative data layout optimization approach that takes as input an irregular program that has already been optimized for cache locality and generates an output code with the same cache performance but better row-buffer locality (lower number of row- buffer misses).
Abstract: Most of the prior compiler based data locality optimization works target exclusively cache locality optimization, and row-buffer locality in DRAM banks received much less attention. In particular, to the best of our knowledge, there is no single compiler based approach that can improve row-buffer locality in executing irregular applications. This presents a critical problem considering the fact that executing irregular applications in a power and performance efficient manner will be a key requirement to extract maximum benefits from emerging multicore machines and exascale systems. Motivated by these observations, this paper makes the following contributions. First, it presents a compiler-runtime cooperative data layout optimization approach that takes as input an irregular program that has already been optimized for cache locality and generates an output code with the same cache performance but better row-buffer locality (lower number of row-buffer misses). Second, it discusses a more aggressive strategy that sacrifices some cache performance in order to further improve row-buffer performance (i.e., it trades cache performance for memory system performance). The ultimate goal of this strategy is to find the right tradeoff point between cache performance and row-buffer performance so that the overall application performance is improved. Third, the paper performs a detailed evaluation of these two approaches using both an AMD Opteron based multicore system and a multicore simulator. The experimental results, collected using five real-world irregular applications, show that (i) conventional cache optimizations do not improve row-buffer locality significantly; (ii) our first approach achieves about 9.8% execution time improvement by keeping the number of cache misses the same as a cache-optimized code but reducing the number of row-buffer misses; and (iii) our second approach achieves even higher execution time improvements (13.8% on average) by sacrificing cache performance for additional memory performance.

3 citations

Posted Content
TL;DR: This paper summarizes the idea of Adaptive-Latency DRAM, which was published in HPCA 2015, and examines the work's significance and future potential, and characterizes the extra margin that is built into the DRAM timing parameters.
Abstract: This paper summarizes the idea of Adaptive-Latency DRAM (AL-DRAM), which was published in HPCA 2015, and examines the work's significance and future potential. AL-DRAM is a mechanism that optimizes DRAM latency based on the DRAM module and the operating temperature, by exploiting the extra margin that is built into the DRAM timing parameters. DRAM manufacturers provide a large margin for the timing parameters as a provision against two worst-case scenarios. First, due to process variation, some outlier DRAM chips are much slower than others. Second, chips become slower at higher temperatures. The timing parameter margin ensures that the slow outlier chips operate reliably at the worst-case temperature, and hence leads to a high access latency. Using an FPGA-based DRAM testing platform, our work first characterizes the extra margin for 115 DRAM modules from three major manufacturers. The experimental results demonstrate that it is possible to reduce four of the most critical timing parameters by a minimum/maximum of 17.3%/54.8% at 55C while maintaining reliable operation. AL-DRAM uses these observations to adaptively select reliable DRAM timing parameters for each DRAM module based on the module's current operating conditions. AL-DRAM does not require any changes to the DRAM chip or its interface; it only requires multiple different timing parameters to be specified and supported by the memory controller. Our real system evaluations show that AL-DRAM improves the performance of memory-intensive workloads by an average of 14% without introducing any errors. Our characterization and proposed techniques have inspired several other works on analyzing and/or exploiting different sources of latency and performance variation within DRAM chips.

2 citations


Cites background from "A case for small row buffers in non..."

  • ...We believe there is signi cant potential for approaches that could reduce the latency of Phase Change Memory (PCM) [40, 80, 81, 82, 105, 135, 139, 140, 170, 172], STT-MRAM [79, 105], RRAM [169], and NAND ash memory [16,17,18,19,20,21,22,22,23,24,25,26,27,102,103,104,107]....

    [...]

Journal ArticleDOI
TL;DR: This paper presents an integer linear programming (ILP) formulation which minimizes energy consumption in the STT-RAM-based L3C exploiting the row buffer locality and the prominent features of STt-RAM.
Abstract: Spin-transfer torque random access memory (STT-RAM) is a suitable alternative to DRAM in the large last-level caches (L3Cs) on account of low leakage, the absence of refresh energy and good scalability. However, long latency and high energy consumption for write operations are disadvantages of this technology. The proper utilization of row buffer locality can improve energy efficiency and mitigate negative effects of writing operations in the STT-RAM L3Cs. In this paper, we present an integer linear programming (ILP) formulation which minimizes energy consumption in the STT-RAM-based L3C exploiting the row buffer locality and the prominent features of STT-RAM. Since ILP solvers may not achieve the better result in a reasonable time, we propose a sub-optimal algorithm that obtains the results in a polynomial time. Evaluations demonstrate that on average, our ILP model reduces dynamic energy about 19% and improves row buffer hit rate about 23% compared to the state of the art.

2 citations

References
More filters
Proceedings ArticleDOI
20 Jun 2009
TL;DR: This work proposes, crafted from a fundamental understanding of PCM technology parameters, area-neutral architectural enhancements that address these limitations and make PCM competitive with DRAM.
Abstract: Memory scaling is in jeopardy as charge storage and sensing mechanisms become less reliable for prevalent memory technologies, such as DRAM. In contrast, phase change memory (PCM) storage relies on scalable current and thermal mechanisms. To exploit PCM's scalability as a DRAM alternative, PCM must be architected to address relatively long latencies, high energy writes, and finite endurance.We propose, crafted from a fundamental understanding of PCM technology parameters, area-neutral architectural enhancements that address these limitations and make PCM competitive with DRAM. A baseline PCM system is 1.6x slower and requires 2.2x more energy than a DRAM system. Buffer reorganizations reduce this delay and energy gap to 1.2x and 1.0x, using narrow rows to mitigate write energy and multiple rows to improve locality and write coalescing. Partial writes enhance memory endurance, providing 5.6 years of lifetime. Process scaling will further reduce PCM energy costs and improve endurance.

1,568 citations

Proceedings ArticleDOI
01 May 2000
TL;DR: This paper introduces memory access scheduling, a technique that improves the performance of a memory system by reordering memory references to exploit locality within the 3-D memory structure.
Abstract: The bandwidth and latency of a memory system are strongly dependent on the manner in which accesses interact with the “3-D” structure of banks, rows, and columns characteristic of contemporary DRAM chips. There is nearly an order of magnitude difference in bandwidth between successive references to different columns within a row and different rows within a bank. This paper introduces memory access scheduling, a technique that improves the performance of a memory system by reordering memory references to exploit locality within the 3-D memory structure. Conservative reordering, in which the first ready reference in a sequence is performed, improves bandwidth by 40% for traces from five media benchmarks. Aggressive reordering, in which operations are scheduled to optimize memory bandwidth, improves bandwidth by 93% for the same set of applications. Memory access scheduling is particularly important for media processors where it enables the processor to make the most efficient use of scarce memory bandwidth.

1,009 citations

Journal ArticleDOI
12 Nov 2000
TL;DR: It is demonstrated that performance on a hardware multithreaded processor is sensitive to the set of jobs that are coscheduled by the operating system jobscheduler, and that a small sample of the possible schedules is sufficient to identify a good schedule quickly.
Abstract: Simultaneous Multithreading machines fetch and execute instructions from multiple instruction streams to increase system utilization and speedup the execution of jobs. When there are more jobs in the system than there is hardware to support simultaneous execution, the operating system scheduler must choose the set of jobs to coscheduleThis paper demonstrates that performance on a hardware multithreaded processor is sensitive to the set of jobs that are coscheduled by the operating system jobscheduler. Thus, the full benefits of SMT hardware can only be achieved if the scheduler is aware of thread interactions. Here, a mechanism is presented that allows the scheduler to significantly raise the performance of SMT architectures. This is done without any advance knowledge of a workload's characteristics, using sampling to identify jobs which run well together.We demonstrate an SMT jobscheduler called SOS. SOS combines an overhead-free sample phase which collects information about various possible schedules, and a symbiosis phase which uses that information to predict which schedule will provide the best performance. We show that a small sample of the possible schedules is sufficient to identify a good schedule quickly. On a system with random job arrivals and departures, response time is improved as much as 17% over a schedule which does not incorporate symbiosis.

619 citations

Proceedings ArticleDOI
01 Apr 2010
TL;DR: It is shown that the implementation of least-attained-service thread prioritization reduces the time the cores spend stalling and significantly improves system throughput, and ATLAS's performance benefit increases as the number of cores increases.
Abstract: Modern chip multiprocessor (CMP) systems employ multiple memory controllers to control access to main memory. The scheduling algorithm employed by these memory controllers has a significant effect on system throughput, so choosing an efficient scheduling algorithm is important. The scheduling algorithm also needs to be scalable — as the number of cores increases, the number of memory controllers shared by the cores should also increase to provide sufficient bandwidth to feed the cores. Unfortunately, previous memory scheduling algorithms are inefficient with respect to system throughput and/or are designed for a single memory controller and do not scale well to multiple memory controllers, requiring significant finegrained coordination among controllers. This paper proposes ATLAS (Adaptive per-Thread Least-Attained-Service memory scheduling), a fundamentally new memory scheduling technique that improves system throughput without requiring significant coordination among memory controllers. The key idea is to periodically order threads based on the service they have attained from the memory controllers so far, and prioritize those threads that have attained the least service over others in each period. The idea of favoring threads with least-attained-service is borrowed from the queueing theory literature, where, in the context of a single-server queue it is known that least-attained-service optimally schedules jobs, assuming a Pareto (or any decreasing hazard rate) workload distribution. After verifying that our workloads have this characteristic, we show that our implementation of least-attained-service thread prioritization reduces the time the cores spend stalling and significantly improves system throughput. Furthermore, since the periods over which we accumulate the attained service are long, the controllers coordinate very infrequently to form the ordering of threads, thereby making ATLAS scalable to many controllers. We evaluate ATLAS on a wide variety of multiprogrammed SPEC 2006 workloads and systems with 4–32 cores and 1–16 memory controllers, and compare its performance to five previously proposed scheduling algorithms. Averaged over 32 workloads on a 24-core system with 4 controllers, ATLAS improves instruction throughput by 10.8%, and system throughput by 8.4%, compared to PAR-BS, the best previous CMP memory scheduling algorithm. ATLAS's performance benefit increases as the number of cores increases.

439 citations

Proceedings ArticleDOI
04 Dec 2010
TL;DR: This paper presents a new memory scheduling algorithm that addresses system throughput and fairness separately with the goal of achieving the best of both, and evaluates TCM on a wide variety of multiprogrammed workloads and compares its performance to four previously proposed scheduling algorithms, finding that TCM achieves both the best system throughputand fairness.
Abstract: In a modern chip-multiprocessor system, memory is a shared resource among multiple concurrently executing threads. The memory scheduling algorithm should resolve memory contention by arbitrating memory access in such a way that competing threads progress at a relatively fast and even pace, resulting in high system throughput and fairness. Previously proposed memory scheduling algorithms are predominantly optimized for only one of these objectives: no scheduling algorithm provides the best system throughput and best fairness at the same time. This paper presents a new memory scheduling algorithm that addresses system throughput and fairness separately with the goal of achieving the best of both. The main idea is to divide threads into two separate clusters and employ different memory request scheduling policies in each cluster. Our proposal, Thread Cluster Memory scheduling (TCM), dynamically groups threads with similar memory access behavior into either the latency-sensitive (memory-non-intensive) or the bandwidth-sensitive (memory-intensive) cluster. TCM introduces three major ideas for prioritization: 1) we prioritize the latency-sensitive cluster over the bandwidth-sensitive cluster to improve system throughput, 2) we introduce a ``niceness'' metric that captures a thread's propensity to interfere with other threads, 3) we use niceness to periodically shuffle the priority order of the threads in the bandwidth-sensitive cluster to provide fair access to each thread in a way that reduces inter-thread interference. On the one hand, prioritizing memory-non-intensive threads significantly improves system throughput without degrading fairness, because such ``light'' threads only use a small fraction of the total available memory bandwidth. On the other hand, shuffling the priority order of memory-intensive threads improves fairness because it ensures no thread is disproportionately slowed down or starved. We evaluate TCM on a wide variety of multiprogrammed workloads and compare its performance to four previously proposed scheduling algorithms, finding that TCM achieves both the best system throughput and fairness. Averaged over 96 workloads on a 24-core system with 4 memory channels, TCM improves system throughput and reduces maximum slowdown by 4.6%/38.6% compared to ATLAS (previous work providing the best system throughput) and 7.6%/4.6% compared to PAR-BS (previous work providing the best fairness).

375 citations

Frequently Asked Questions (2)
Q1. What have the authors contributed in "A case for small row buffers in non-volatile main memories" ?

In this work, the authors discuss and evaluate architectural changes to enable small row buffers at a low cost in NVMs. The authors find that on a multi-core system, reducing the row buffer size can greatly reduce main memory dynamic energy compared to a DRAM baseline with large row sizes, without greatly affecting endurance, and for some NVM technologies, leads to improved performance. 

Their future work includes exploring architectural techniques which effectively leverage small row buffer sizes for improved performance and energy-efficiency.