TL;DR: It is found that on a multi-core system, reducing the row buffer size can greatly reduce main memory dynamic energy compared to a DRAM baseline with large row sizes, without greatly affecting endurance, and for some NVM technologies, leads to improved performance.
Abstract: DRAM-based main memories have read operations that destroy the read data, and as a result, must buffer large amounts of data on each array access to keep chip costs low. Unfortunately, system-level trends such as increased memory contention in multi-core architectures and data mapping schemes that improve memory parallelism lead to only a small amount of the buffered data to be accessed. This makes buffering large amounts of data on every memory array access energy-inefficient; yet organizing DRAM chips to buffer small amounts of data is costly, as others have shown [11]. Emerging non-volatile memories (NVMs) such as PCM, STT-RAM, and RRAM, however, do not have destructive read operations, opening up opportunities for employing small row buffers without incurring additional area penalty and/or design complexity. In this work, we discuss and evaluate architectural changes to enable small row buffers at a low cost in NVMs. We find that on a multi-core system, reducing the row buffer size can greatly reduce main memory dynamic energy compared to a DRAM baseline with large row sizes, without greatly affecting endurance, and for some NVM technologies, leads to improved performance.
Abstract—DRAM-based main memories have read operations that destroy the read data, and as a result, must buffer large amounts of data on each array access to keep chip costs low.
Emerging non-volatile memories (NVMs) such as PCM, STT-RAM, and RRAM, however, do not have destructive read operations, opening up opportunities for employing small row buffers without incurring additional area penalty and/or design complexity.
Over time, this charge leaks, causing the stored data to be lost.
As a result, the performance benefit of large row buffers may decrease in multi-core systems.
II. MOTIVATION
Emerging NVM technologies have several promising attributes compared to existing memory technologies such as SRAM (used in on-chip caches), DRAM, and Flash.
NVMs provide cost advantages compared to SRAM and DRAM, and latency advantages compared to Flash.
Typical DRAM chip micro-architectures (JEDEC-standard DDRtype SDRAM) are divided into banks that consist of rows and columns .
Comparing the 1- and 8-core row-interleaved data, the authors see that while row interleaving does enable more row buffer locality, its benefits diminish as memory system contention increases with more cores: row buffer hit rate is less than 50% for row interleaving even with large, 1KB rows.
III. A SMALL ROW BUFFER NVM ARCHITECTURE
Figure 1(b) shows the organization of their NVM architecture.
Compared to a traditional DRAM organization, the physical placement of the row buffer and the column multiplexer (part of the I/O gating circuitry in DRAM designs) are swapped in the data path (shown in gray).
This rearrangement makes better use of resources by sharing a smaller number of sense amplifiers (the devices which store bits in the row buffer) among multiple bitlines.
Note that this is not possible in DRAM (without reducing the row size) because a sense amplifier for each bit in the row is required in DRAM to restore the charge of the cell after it is read.
Unlike DRAM, however, their organization requires decoding both the row address and the column address during a RAS command, so that only a subset of the row containing the bits of interest will be selected, sensed, and stored in the row buffer.
IV. RESULTS
The authors modify their memory simulator timings according to those in Table I for PCM and STT-RAM.
The authors evaluate 31 multiprogrammed workloads composed of SPEC, TPC, and STREAM benchmarks.
Note that this reduction is achieved despite worse underlying technology parameters 2For more details, please refer to their accompanying tech report [5].
For a given memory technology, reducing the row buffer size does not greatly affect system performance due to the already low row buffer locality present on their multi-core system .
NVM cells have a limited lifetime in terms of the number of times they can be written to before their ability to store data fails, also known as Durability.
TL;DR: This work proposes a write-back aware last-level cache management scheme for the hybrid main memory, which improves the cache hit ratio of NVM memory blocks and minimizes write-backs to NVM.
Abstract: Hybrid main memory with both DRAM and emerging non-volatile memory (NVM) becomes a promising solution for high performance and energy-efficient embedded systems. Cache plays an important role and highly affects the number of write backs to NVM and DRAM blocks. However, existing cache policies fail to fully address the significant asymmetry between NVM operations (especially writes) and DRAM operations, leading to non-optimal system designs. We propose a write-back aware last-level cache management scheme for the hybrid main memory, which improves the cache hit ratio of NVM memory blocks and minimizes write-backs to NVM. Experimental results show that our proposed framework leads to better performance and energy saving compared with the state-of-the-art cache management scheme for hybrid main memory architecture.
TL;DR: Three major new research challenges and solution directions are described: enabling new DRAM architectures, functions, interfaces, and better integration of the DRAM and the rest of the system, designing a memory system that employs emerging non-volatile memory technologies and takes advantage of multiple different technologies.
Abstract: The memory system is a fundamental performance and energy bottleneck in almost all computing systems. Recent system design, application, and technology trends that require more capacity, bandwidth, efficiency, and predictability out of the memory system make it an even more important system bottleneck. At the same time, DRAM technology is experiencing difficult technology scaling challenges that make the maintenance and enhancement of its capacity, energy-efficiency, and reliability significantly more costly with conventional techniques. In this article, after describing the demands and challenges faced by the memory system, we examine some promising research and design directions to overcome challenges posed by memory scaling. Specifically, we describe three major new research challenges and solution directions: 1) enabling new DRAM architectures, functions, interfaces, and better integration of the DRAM and the rest of the system (an approach we call system-DRAM co-design), 2) designing a memory system that employs emerging non-volatile memory technologies and takes advantage of multiple different technologies (i.e., hybrid memory systems), 3) providing predictable performance and QoS to applications sharing the memory system (i.e., QoS-aware memory systems). We also briefly describe our ongoing related work in combating scaling challenges of NAND flash memory.
19 citations
Cites background from "A case for small row buffers in non..."
...Given the trends and
requirements described in Section 2, it is likely time to re-examine the partitioning of computation between processors and DRAM, treating memory as a first-class accelerator as an integral part of a heterogeneous parallel computing system [129]....
[...]
...Enabling In-Memory Computation
Today's systems waste significant amount of energy, DRAM bandwidth and time (as well as valuable on-chip cache space) by sometimes unnecessarily moving data from main memory to processor caches....
TL;DR: HEBE significantly improves both performance and lifetime of NVM-based main memory, and introduces an isolation transistor to decouple parts of a peripheral circuit operating at different voltages, allowing the decoupled logic blocks to undergo long-latency de-stress operations independently and off the critical path of memory read and write accesses.
Abstract: Modern computing systems are embracing non-volatile memory (NVM) to implement high-capacity and low-cost main memory. Elevated operating voltages of NVM accelerate the aging of CMOS transistors in the peripheral circuitry of each memory bank. Aggressive device scaling increases power density and temperature, which further accelerates aging, challenging the reliable operation of NVM-based main memory. We propose HEBE, an architectural technique to mitigate the circuit aging-related problems of NVM-based main memory. HEBE is built on three contributions. First, we propose a new analytical model that can dynamically track the aging in the peripheral circuitry of each memory bank based on the bank's utilization. Second, we develop an intelligent memory request scheduler that exploits this aging model at run time to de-stress the peripheral circuitry of a memory bank only when its aging exceeds a critical threshold. Third, we introduce an isolation transistor to decouple parts of a peripheral circuit operating at different voltages, allowing the decoupled logic blocks to undergo long-latency de-stress operations independently and off the critical path of memory read and write accesses, improving performance. We evaluate HEBE with workloads from the SPEC CPU2017 Benchmark suite. Our results show that HEBE significantly improves both performance and lifetime of NVM-based main memory.
17 citations
Cites background from "A case for small row buffers in non..."
...enchmark suite [7]. Our results show that HEBE significantly improves both performance and lifetime of NVM-based main memory. 2 BACKGROUND NVM, like DRAM [22, 42, 49, 78], is organized hierarchically [56, 58, 73, 81, 84, 85, 100]. For example, a 128GB NVM can have 2 channels, 1 rank/channel and 8 banks/rank. A bank can have 64 partitions [81]. Each bit in NVM is represented by the resistance of an NVM cell: low resistance is ...
TL;DR: The evaluations show that the GPU-aware cache and memory management techniques proposed in this dissertation are effective at mitigating the interference caused by GPUs on current and future GPU-based systems.
Abstract: The continued growth of the computational capability of throughput processors has made throughput processors the platform of choice for a wide variety of high performance computing applications Graphics Processing Units (GPUs) are a prime example of throughput processors that can deliver high performance for applications ranging from typical graphics applications to general-purpose data parallel (GPGPU) applications However, this success has been accompanied by new performance bottlenecks throughout the memory hierarchy of GPU-based systems We identify and eliminate performance bottlenecks caused by major sources of interference throughout the memory hierarchy
We introduce changes to the memory hierarchy for systems with GPUs that allow the memory hierarchy to be aware of both CPU and GPU applications' characteristics We introduce mechanisms to dynamically analyze different applications' characteristics and propose four major changes throughout the memory hierarchy We propose changes to the cache management and memory scheduling mechanisms to mitigate intra-application interference in GPGPU applications We propose changes to the memory controller design and its scheduling policy to mitigate inter-application interference in heterogeneous CPU-GPU systems We redesign the MMU and the memory hierarchy in GPUs to be aware of ddress-translation data in order to mitigate the inter-address-space interference We introduce a hardware-software cooperative technique that modifies the memory allocation policy to enable large page support in order to further reduce the inter-address-space interference at the shared Translation Lookaside Buffer (TLB) Our evaluations show that the GPU-aware cache and memory management techniques proposed in this dissertation are effective at mitigating the interference caused by GPUs on current and future GPU-based systems
12 citations
Cites methods from "A case for small row buffers in non..."
...Aside from memory scheduling and memory partitioning techniques, previous works propose new designs that are capable of reducing memory latency in conventional DRAM [21, 22, 67, 69, 70, 71, 72, 75, 151, 160, 165, 210, 222, 238, 239, 240, 241, 242, 262, 276, 289, 295, 320, 338, 367, 382, 391, 431, 452] as well as non-volatile memory [227, 231, 232, 233, 273, 274, 275, 348, 351, 442]....
TL;DR: A Write Priority overlap Read (WPoR) scheduling scheme which preferentially serves for a write request in one partition and allows other partitions to perform as many read requests as possible within this partition's program duration is proposed.
Abstract: Phase Change Memory (PCM) is an emerging non-volatile memory with the salient features of large-scale, high-speed, low-power and radiation resistance. It hence becomes an ideal candidate for the next-generation storage media of main memory. However, PCM suffers from inefficient I/O performance due to long write latency. Recent studies propose a multi-partition (or multi-subarray) architecture within each bank to enhance internal parallelism. However, conventional scheduling schemes fail to exploit the advantage of multiple partitions and incur inefficient bank utilization. In this paper, we propose a Write Priority overlap Read (WPoR) scheduling scheme which preferentially serves for a write request in one partition and allows other partitions to perform as many read requests as possible within this partition's program duration. Experimental results demonstrate that WPoR reduces the write latency by 24.7% (on average) compared with state-of-the-art scheduling algorithms. Meanwhile, the IPC indicator of WPoR scheduling increases respectively 6%, 7% and 26% (on average) compared with Read Priority, Write Pausing and Write Cancellation schemes.
TL;DR: This work proposes, crafted from a fundamental understanding of PCM technology parameters, area-neutral architectural enhancements that address these limitations and make PCM competitive with DRAM.
Abstract: Memory scaling is in jeopardy as charge storage and sensing mechanisms become less reliable for prevalent memory technologies, such as DRAM. In contrast, phase change memory (PCM) storage relies on scalable current and thermal mechanisms. To exploit PCM's scalability as a DRAM alternative, PCM must be architected to address relatively long latencies, high energy writes, and finite endurance.We propose, crafted from a fundamental understanding of PCM technology parameters, area-neutral architectural enhancements that address these limitations and make PCM competitive with DRAM. A baseline PCM system is 1.6x slower and requires 2.2x more energy than a DRAM system. Buffer reorganizations reduce this delay and energy gap to 1.2x and 1.0x, using narrow rows to mitigate write energy and multiple rows to improve locality and write coalescing. Partial writes enhance memory endurance, providing 5.6 years of lifetime. Process scaling will further reduce PCM energy costs and improve endurance.
TL;DR: This paper introduces memory access scheduling, a technique that improves the performance of a memory system by reordering memory references to exploit locality within the 3-D memory structure.
Abstract: The bandwidth and latency of a memory system are strongly dependent on the manner in which accesses interact with the “3-D” structure of banks, rows, and columns characteristic of contemporary DRAM chips. There is nearly an order of magnitude difference in bandwidth between successive references to different columns within a row and different rows within a bank. This paper introduces memory access scheduling, a technique that improves the performance of a memory system by reordering memory references to exploit locality within the 3-D memory structure. Conservative reordering, in which the first ready reference in a sequence is performed, improves bandwidth by 40% for traces from five media benchmarks. Aggressive reordering, in which operations are scheduled to optimize memory bandwidth, improves bandwidth by 93% for the same set of applications. Memory access scheduling is particularly important for media processors where it enables the processor to make the most efficient use of scarce memory bandwidth.
TL;DR: It is demonstrated that performance on a hardware multithreaded processor is sensitive to the set of jobs that are coscheduled by the operating system jobscheduler, and that a small sample of the possible schedules is sufficient to identify a good schedule quickly.
Abstract: Simultaneous Multithreading machines fetch and execute instructions from multiple instruction streams to increase system utilization and speedup the execution of jobs. When there are more jobs in the system than there is hardware to support simultaneous execution, the operating system scheduler must choose the set of jobs to coscheduleThis paper demonstrates that performance on a hardware multithreaded processor is sensitive to the set of jobs that are coscheduled by the operating system jobscheduler. Thus, the full benefits of SMT hardware can only be achieved if the scheduler is aware of thread interactions. Here, a mechanism is presented that allows the scheduler to significantly raise the performance of SMT architectures. This is done without any advance knowledge of a workload's characteristics, using sampling to identify jobs which run well together.We demonstrate an SMT jobscheduler called SOS. SOS combines an overhead-free sample phase which collects information about various possible schedules, and a symbiosis phase which uses that information to predict which schedule will provide the best performance. We show that a small sample of the possible schedules is sufficient to identify a good schedule quickly. On a system with random job arrivals and departures, response time is improved as much as 17% over a schedule which does not incorporate symbiosis.
TL;DR: It is shown that the implementation of least-attained-service thread prioritization reduces the time the cores spend stalling and significantly improves system throughput, and ATLAS's performance benefit increases as the number of cores increases.
Abstract: Modern chip multiprocessor (CMP) systems employ multiple memory controllers to control access to main memory. The scheduling algorithm employed by these memory controllers has a significant effect on system throughput, so choosing an efficient scheduling algorithm is important. The scheduling algorithm also needs to be scalable — as the number of cores increases, the number of memory controllers shared by the cores should also increase to provide sufficient bandwidth to feed the cores. Unfortunately, previous memory scheduling algorithms are inefficient with respect to system throughput and/or are designed for a single memory controller and do not scale well to multiple memory controllers, requiring significant finegrained coordination among controllers. This paper proposes ATLAS (Adaptive per-Thread Least-Attained-Service memory scheduling), a fundamentally new memory scheduling technique that improves system throughput without requiring significant coordination among memory controllers. The key idea is to periodically order threads based on the service they have attained from the memory controllers so far, and prioritize those threads that have attained the least service over others in each period. The idea of favoring threads with least-attained-service is borrowed from the queueing theory literature, where, in the context of a single-server queue it is known that least-attained-service optimally schedules jobs, assuming a Pareto (or any decreasing hazard rate) workload distribution. After verifying that our workloads have this characteristic, we show that our implementation of least-attained-service thread prioritization reduces the time the cores spend stalling and significantly improves system throughput. Furthermore, since the periods over which we accumulate the attained service are long, the controllers coordinate very infrequently to form the ordering of threads, thereby making ATLAS scalable to many controllers. We evaluate ATLAS on a wide variety of multiprogrammed SPEC 2006 workloads and systems with 4–32 cores and 1–16 memory controllers, and compare its performance to five previously proposed scheduling algorithms. Averaged over 32 workloads on a 24-core system with 4 controllers, ATLAS improves instruction throughput by 10.8%, and system throughput by 8.4%, compared to PAR-BS, the best previous CMP memory scheduling algorithm. ATLAS's performance benefit increases as the number of cores increases.
TL;DR: This paper presents a new memory scheduling algorithm that addresses system throughput and fairness separately with the goal of achieving the best of both, and evaluates TCM on a wide variety of multiprogrammed workloads and compares its performance to four previously proposed scheduling algorithms, finding that TCM achieves both the best system throughputand fairness.
Abstract: In a modern chip-multiprocessor system, memory is a shared resource among multiple concurrently executing threads. The memory scheduling algorithm should resolve memory contention by arbitrating memory access in such a way that competing threads progress at a relatively fast and even pace, resulting in high system throughput and fairness. Previously proposed memory scheduling algorithms are predominantly optimized for only one of these objectives: no scheduling algorithm provides the best system throughput and best fairness at the same time. This paper presents a new memory scheduling algorithm that addresses system throughput and fairness separately with the goal of achieving the best of both. The main idea is to divide threads into two separate clusters and employ different memory request scheduling policies in each cluster. Our proposal, Thread Cluster Memory scheduling (TCM), dynamically groups threads with similar memory access behavior into either the latency-sensitive (memory-non-intensive) or the bandwidth-sensitive (memory-intensive) cluster. TCM introduces three major ideas for prioritization: 1) we prioritize the latency-sensitive cluster over the bandwidth-sensitive cluster to improve system throughput, 2) we introduce a ``niceness'' metric that captures a thread's propensity to interfere with other threads, 3) we use niceness to periodically shuffle the priority order of the threads in the bandwidth-sensitive cluster to provide fair access to each thread in a way that reduces inter-thread interference. On the one hand, prioritizing memory-non-intensive threads significantly improves system throughput without degrading fairness, because such ``light'' threads only use a small fraction of the total available memory bandwidth. On the other hand, shuffling the priority order of memory-intensive threads improves fairness because it ensures no thread is disproportionately slowed down or starved. We evaluate TCM on a wide variety of multiprogrammed workloads and compare its performance to four previously proposed scheduling algorithms, finding that TCM achieves both the best system throughput and fairness. Averaged over 96 workloads on a 24-core system with 4 memory channels, TCM improves system throughput and reduces maximum slowdown by 4.6%/38.6% compared to ATLAS (previous work providing the best system throughput) and 7.6%/4.6% compared to PAR-BS (previous work providing the best fairness).
Q1. What have the authors contributed in "A case for small row buffers in non-volatile main memories" ?
In this work, the authors discuss and evaluate architectural changes to enable small row buffers at a low cost in NVMs. The authors find that on a multi-core system, reducing the row buffer size can greatly reduce main memory dynamic energy compared to a DRAM baseline with large row sizes, without greatly affecting endurance, and for some NVM technologies, leads to improved performance.
Q2. What are the future works in "A case for small row buffers in non-volatile main memories" ?
Their future work includes exploring architectural techniques which effectively leverage small row buffer sizes for improved performance and energy-efficiency.