scispace - formally typeset
Search or ask a question

Showing papers on "Memory controller published in 2017"


Proceedings ArticleDOI
14 Oct 2017
TL;DR: DRISA, a DRAM-based Reconfigurable In-Situ Accelerator architecture, is proposed to provide both powerful computing capability and large memory capacity/bandwidth to address the memory wall problem in traditional von Neumann architecture.
Abstract: Data movement between the processing units and the memory in traditional von Neumann architecture is creating the “memory wall” problem. To bridge the gap, two approaches, the memory-rich processor (more on-chip memory) and the compute-capable memory (processing-in-memory) have been studied. However, the first one has strong computing capability but limited memory capacity/bandwidth, whereas the second one is the exact the opposite.To address the challenge, we propose DRISA, a DRAM-based Reconfigurable In-Situ Accelerator architecture, to provide both powerful computing capability and large memory capacity/bandwidth. DRISA is primarily composed of DRAM memory arrays, in which every memory bitline can perform bitwise Boolean logic operations (such as NOR). DRISA can be reconfigured to compute various functions with the combination of the functionally complete Boolean logic operations and the proposed hierarchical internal data movement designs. We further optimize DRISA to achieve high performance by simultaneously activating multiple rows and subarrays to provide massive parallelism, unblocking the internal data movement bottlenecks, and optimizing activation latency and energy. We explore four design options and present a comprehensive case study to demonstrate significant acceleration of convolutional neural networks. The experimental results show that DRISA can achieve 8.8× speedup and 1.2× better energy efficiency compared with ASICs, and 7.7× speedup and 15× better energy efficiency over GPUs with integer operations.CCS CONCEPTS• Hardware → Dynamic memory; • Computer systems organization → reconfigurable computing; Neural networks;

315 citations


Journal ArticleDOI
TL;DR: A tool is designed that carefully models I/O power in the memory system, explores the design space, and gives the user the ability to define new types of memory interconnects/topologies, and a new relay-on-board chip that partitions a DDR channel into multiple cascaded channels is introduced.
Abstract: Historically, server designers have opted for simple memory systems by picking one of a few commoditized DDR memory products. We are already witnessing a major upheaval in the off-chip memory hierarchy, with the introduction of many new memory products—buffer-on-board, LRDIMM, HMC, HBM, and NVMs, to name a few. Given the plethora of choices, it is expected that different vendors will adopt different strategies for their high-capacity memory systems, often deviating from DDR standards and/or integrating new functionality within memory systems. These strategies will likely differ in their choice of interconnect and topology, with a significant fraction of memory energy being dissipated in I/O and data movement. To make the case for memory interconnect specialization, this paper makes three contributions.First, we design a tool that carefully models I/O power in the memory system, explores the design space, and gives the user the ability to define new types of memory interconnects/topologies. The tool is validated against SPICE models, and is integrated into version 7 of the popular CACTI package. Our analysis with the tool shows that several design parameters have a significant impact on I/O power.We then use the tool to help craft novel specialized memory system channels. We introduce a new relay-on-board chip that partitions a DDR channel into multiple cascaded channels. We show that this simple change to the channel topology can improve performance by 22% for DDR DRAM and lower cost by up to 65% for DDR DRAM. This new architecture does not require any changes to DIMMs, and it efficiently supports hybrid DRAM/NVM systems.Finally, as an example of a more disruptive architecture, we design a custom DIMM and parallel bus that moves away from the DDR3/DDR4 standards. To reduce energy and improve performance, the baseline data channel is split into three narrow parallel channels and the on-DIMM interconnects are operated at a lower frequency. In addition, this allows us to design a two-tier error protection strategy that reduces data transfers on the interconnect. This architecture yields a performance improvement of 18% and a memory power reduction of 23%.The cascaded channel and narrow channel architectures serve as case studies for the new tool and show the potential for benefit from re-organizing basic memory interconnects.

217 citations


Proceedings ArticleDOI
14 Oct 2017
TL;DR: A new high-bandwidth DRAM architecture, Fine-Grained DRAM (FGDRAM), is proposed, which improves bandwidth by 4× and improves the energy efficiency of DRAM by 2× relative to the highest-band width, most energy-efficient contemporary DRAM, High Bandwidth Memory (HBM2).
Abstract: Future GPUs and other high-performance throughput processors will require multiple TB/s of bandwidth to DRAM. Satisfying this bandwidth demand within an acceptable energy budget is a challenge in these extreme bandwidth memory systems. We propose a new high-bandwidth DRAM architecture, Fine-Grained DRAM (FGDRAM), which improves bandwidth by 4× and improves the energy efficiency of DRAM by 2× relative to the highest-bandwidth, most energy-efficient contemporary DRAM, High Bandwidth Memory (HBM2). These benefits are in large measure achieved by partitioning the DRAM die into many independent units, called grains, each of which has a local, adjacent I/O. This approach unlocks the bandwidth of all the banks in the DRAM to be used simultaneously, eliminating shared buses interconnecting various banks. Furthermore, the on-DRAM data movement energy is significantly reduced due to the much shorter wiring distance between the cell array and the local I/O. This FGDRAM architecture readily lends itself to leveraging existing techniques to reducing the effective DRAM row size in an area efficient manner, reducing wasteful row activate energy in applications with low locality. In addition, when FGDRAM is paired with a memory controller optimized to exploit the additional concurrency provided by the independent grains, it improves GPU system performance by 19% over an iso-bandwidth and iso-capacity future HBM baseline. Thus, this energy-efficient, high-bandwidth FGDRAM architecture addresses the needs of future extreme-bandwidth memory systems. CCS CONCEPTS • Hardware → Dynamic memory; Power and energy; • Computing methodologies → Graphics processors; • Computer systems organization → Parallel architectures;

142 citations


Proceedings ArticleDOI
01 Feb 2017
TL;DR: The first publicly-available DRAM testing infrastructure that can flexibly and efficiently test DRAM chips in a manner accessible to both software and hardware developers is developed, SoftMC (Soft Memory Controller), an FPGA-based testing platform that can control and test memory modules designed for the commonly-used DDR interface.
Abstract: DRAM is the primary technology used for main memory in modern systems. Unfortunately, as DRAM scales down to smaller technology nodes, it faces key challenges in both data integrity and latency, which strongly affects overall system reliability and performance. To develop reliable and high-performance DRAM-based main memory in future systems, it is critical to characterize, understand, and analyze various aspects (e.g., reliability, latency) of existing DRAM chips. To enable this, there is a strong need for a publicly-available DRAM testing infrastructure that can flexibly and efficiently test DRAM chips in a manner accessible to both software and hardware developers. This paper develops the first such infrastructure, SoftMC (Soft Memory Controller), an FPGA-based testing platform that can control and test memory modules designed for the commonly-used DDR (Double Data Rate) interface. SoftMC has two key properties: (i) it provides flexibility to thoroughly control memory behavior or to implement a wide range of mechanisms using DDR commands, and (ii) it is easy to use as it provides a simple and intuitive high-level programming interface for users, completely hiding the low-level details of the FPGA. We demonstrate the capability, flexibility, and programming ease of SoftMC with two example use cases. First, we implement a test that characterizes the retention time of DRAM cells. Experimental results we obtain using SoftMC are consistent with the findings of prior studies on retention time in modern DRAM, which serves as a validation of our infrastructure. Second, we validate two recently-proposed mechanisms, which rely on accessing recently-refreshed or recently-accessed DRAM cells faster than other DRAM cells. Using our infrastructure, we show that the expected latency reduction effect of these mechanisms is not observable in existing DRAM chips, which demonstrates the usefulness of SoftMC in testing new ideas on existing memory modules. We discuss several other use cases of SoftMC, including the ability to characterize emerging non-volatile memory modules that obey the DDR standard. We hope that our open-source release of SoftMC fills a gap in the space of publicly-available experimental memory testing infrastructures and inspires new studies, ideas, and methodologies in memory system design.

138 citations


Proceedings ArticleDOI
14 Oct 2017
TL;DR: This paper proposes a new logging approach, Proteus for durable transactions that achieves the favorable characteristics of both prior software and hardware approaches and adds hardware support, primarily within the core, to manage the execution of these instructions.
Abstract: Emerging non-volatile memory (NVM) technologies, such as phase-change memory, spin-transfer torque magnetic memory, memristor, and 3D Xpoint, are encouraging the development of new architectures that support the challenges of persistent programming. An important remaining challenge is dealing with the high logging overheads introduced by durable transactions. In this paper, we propose a new logging approach, Proteus for durable transactions that achieves the favorable characteristics of both prior software and hardware approaches. Like software, it has no hardware constraint limiting the number of transactions or logs available to it, and like hardware, it has very low overhead. Our approach introduces two new instructions: log-load creates a log entry by loading the original data, and log-flush writes the log entry into the log. We add hardware support, primarily within the core, to manage the execution of these instructions and critical ordering requirements between logging operations and updates to data. We also propose a novel optimization at the memory controller that is enabled by a persistent write pending queue in the memory controller. We drop log updates that have not yet written back to NVMM by the time a transaction is considered durable. We implemented our design on a cycle accurate simulator, MarssX86, and compared it against state-of-the-art hardware logging, ATOM [19], and a software only approach. Our experiments show that Proteus improves performance by 1.44-1.47 $\times$ depending on configuration, on average, compared to a system without hardware logging and 9-11% faster than ATOM. A significant advantage of our approach is dropping writes to the log when they are not needed. On average, ATOM makes $3.4 \times$ more writes to memory than our design. CCS CONCEPTS • Computer systems organization → Serial architectures; • Hardware → Memory and dense storage;

91 citations


Book ChapterDOI
TL;DR: RowClone, a mechanism that exploits DRAM technology to perform bulk copy and initialization operations completely inside main memory, and a complementary work that uses DRAM to performs bulk bitwise AND and OR operations inside mainmemory significantly improve the performance and energy efficiency of the respective operations.
Abstract: In existing systems, the off-chip memory interface allows the memory controller to perform only read or write operations. Therefore, to perform any operation, the processor must first read the source data and then write the result back to memory after performing the operation. This approach consumes high latency, bandwidth, and energy for operations that work on a large amount of data. Several works have proposed techniques to process data near memory by adding a small amount of compute logic closer to the main memory chips. In this chapter, we describe two techniques proposed by recent works that take this approach of processing in memory further by exploiting the underlying operation of the main memory technology to perform more complex tasks. First, we describe RowClone, a mechanism that exploits DRAM technology to perform bulk copy and initialization operations completely inside main memory. We then describe a complementary work that uses DRAM to perform bulk bitwise AND and OR operations inside main memory. These two techniques significantly improve the performance and energy efficiency of the respective operations.

75 citations


Proceedings ArticleDOI
04 Apr 2017
TL;DR: TEMPO as mentioned in this paper is a low-overhead hardware mechanism to boost memory performance by exploiting the operating system's (OS) virtual memory subsystem by translating page tables to DRAM.
Abstract: We propose translation-enabled memory prefetching optimizations or TEMPO, a low-overhead hardware mechanism to boost memory performance by exploiting the operating system's (OS) virtual memory subsystem. We are the first to make the following observations: (1) a substantial fraction (20-40%) of DRAM references in modern big- data workloads are devoted to accessing page tables; and (2) when memory references require page table lookups in DRAM, the vast majority of them (98%+) also look up DRAM for the subsequent data access. TEMPO exploits these observations to enable DRAM row-buffer and on-chip cache prefetching of the data that page tables point to. TEMPO requires trivial changes to the memory controller (under 3% additional area), no OS or application changes, and improves performance by 10-30% and energy by 1-14%.

58 citations


Journal ArticleDOI
TL;DR: A dynamic adaptive replacement policy (DARP) in the shared last-level cache for the DRAM/PCM hybrid main memory is proposed and results have shown that the DARP improved the memory access efficiency by 25.4%.
Abstract: The increasing demand on the main memory capacity is one of the main big data challenges. Dynamic random access memory (DRAM) does not represent the best choice for a main memory, due to high power consumption and low density. However, the nonvolatile memory, such as the phase-change memory (PCM), represents an additional choice because of the low power consumption and high-density characteristic. Nevertheless, the high access latency and limited write endurance have disabled the PCM to replace the DRAM currently. Therefore, a hybrid memory, which combines both the DRAM and the PCM, has become a good alternative to the traditional DRAM memory. Both DRAM and PCM disadvantages are challenges for the hybrid memory. In this paper, a dynamic adaptive replacement policy (DARP) in the shared last-level cache for the DRAM/PCM hybrid main memory is proposed. The DARP distinguishes the cache data into the PCM data and the DRAM data, then, the algorithm adopts different replacement policies for each data type. Specifically, for the PCM data, the least recently used (LRU) replacement policy is adopted, and for the DRAM data, the DARP is employed according to the process behavior. Experimental results have shown that the DARP improved the memory access efficiency by 25.4%.

55 citations


Proceedings ArticleDOI
01 Sep 2017
TL;DR: The contribution of this paper is to qualitatively analyze and characterize the conflicts due to parallel accesses to main memory by both CPU cores and iGPU, so to motivate the need of novel paradigms for memory centric scheduling mechanisms.
Abstract: Most of today's mixed criticality platforms feature Systems on Chip (SoC) where a multi-core CPU complex (the host) competes with an integrated Graphic Processor Unit (iGPU, the device) for accessing central memory. The multi-core host and the iGPU share the same memory controller, which has to arbitrate data access to both clients through often undisclosed or non-priority driven mechanisms. Such aspect becomes critical when the iGPU is a high performance massively parallel computing complex potentially able to saturate the available DRAM bandwidth of the considered SoC. The contribution of this paper is to qualitatively analyze and characterize the conflicts due to parallel accesses to main memory by both CPU cores and iGPU, so to motivate the need of novel paradigms for memory centric scheduling mechanisms. We analyzed different well known and commercially available platforms in order to estimate variations in throughput and latencies within various memory access patterns, both at host and device side.

53 citations


Patent
16 May 2017
TL;DR: In this paper, the memory controller is configured to associate one or more physical blocks to each of a plurality of stream IDs, and execute a first command containing a first stream ID received from a host, by storing write data included in the write IO in the one of the physical blocks associated with the first stream IDs.
Abstract: A storage device includes a nonvolatile semiconductor memory device including a plurality of physical blocks and a memory controller. The memory controller is configured to associate one or more physical blocks to each of a plurality of stream IDs, execute a first command containing a first stream ID received from a host, by storing write data included in the write IO in the one or more physical blocks associated with the first stream ID, and execute a second command containing a second stream ID received from the host, by selecting a first physical block that includes valid data and invalid data, transfer the valid data stored in the first physical block to a second physical block, and associate the first physical block from which the valid data has been transferred, with the second stream ID.

51 citations


Journal ArticleDOI
TL;DR: A cycle-accurate simulator for hybrid memory cube called CasHMC provides a cycle-by-cycle simulation of every module in an HMC and generates analysis results including a bandwidth graph and statistical data.
Abstract: 3D-stacked DRAM has been actively studied to overcome the limits of conventional DRAM. The Hybrid Memory Cube (HMC) is a type of 3D-stacked DRAM that has drawn great attention because of its usability for server systems and processing-in-memory (PIM) architecture. Since HMC is not directly stacked on the processor die where the central processing units (CPUs) and graphic processing units (GPUs) are integrated, HMC has to be linked to other processor components through high speed serial links. Therefore, the communication bandwidth and latency should be carefully estimated to evaluate the performance of HMC. However, most existing HMC simulators employ only simple HMC modeling. In this paper, we propose a cycle-accurate simulator for hybrid memory cube called CasHMC. It provides a cycle-by-cycle simulation of every module in an HMC and generates analysis results including a bandwidth graph and statistical data. Furthermore, CasHMC is implemented in C++ as a single wrapped object that includes an HMC controller, communication links, and HMC memory. Instantiating this single wrapped object facilitates simultaneous simulation in parallel with other simulators that generate memory access patterns such as a processor simulator or a memory trace generator.

Journal ArticleDOI
TL;DR: This paper proposes to partially replace DRAM using PCM to optimize the management of flash memory metadata for better system reliability in the presence of power failure and system crash, and presents a write-activity-aware PCM-assisted flash memory management scheme, called PCm-FTL.
Abstract: Phase change memory (PCM) is a promising DRAM alternative because of its non-volatility, high density, low standby power and close-to-DRAM performance. These features make PCM an attractive solution to optimize the management of NAND flash memory in embedded systems. However, PCM's limited write endurance hinders its application in embedded systems. Therefore, how to manage flash memory with PCM—particularly guarantee PCM a reasonable lifetime—becomes a challenging issue. In this paper, we propose to partially replace DRAM using PCM to optimize the management of flash memory metadata for better system reliability in the presence of power failure and system crash. To prolong PCM's lifetime, we present a write-activity-aware PCM-assisted flash memory management scheme, called PCM-FTL . By differentiating sequential and random I/O behaviors, a novel two-level mapping mechanism and a customized wear-leveling scheme are developed to reduce writes to PCM and extend its lifetime. We evaluate PCM-FTL with a variety of general-purpose and mobile I/O workloads. Experimental results show that PCM-FTL can significantly reduce write activities and achieve an even distribution of writes in PCM with very low overhead.

Journal ArticleDOI
TL;DR: This paper explores the interactions between DRAM and PCM to improve both the performance and the endurance of a DRAM-PCM hybrid main memory and develops a proactive strategy to allocate pages taking both program segments and DRAM conflict misses into consideration.
Abstract: Phase change memory (PCM), given its nonvolatility, potential high density, and low standby power, is a promising candidate to be used as main memory in next generation computer systems. However, to hide its shortcomings of limited endurance and slow write performance, state-of-the-art solutions tend to construct a dynamic RAM (DRAM)-PCM hybrid memory and place write-intensive pages in DRAM. While existing optimizations to this hybrid architecture focus on tuning DRAM configurations to reduce the number of write operations to PCM, this paper explores the interactions between DRAM and PCM to improve both the performance and the endurance of a DRAM-PCM hybrid main memory. Specifically, it exploits the flexibility of mapping virtual pages to physical pages, and develops a proactive strategy to allocate pages taking both program segments and DRAM conflict misses into consideration, thus distributing those heavily written pages across different DRAM sets. Meanwhile, a lifetime-aware DRAM replacement algorithm and a conflict-aware page remapping strategy are proposed to further reduce DRAM misses and PCM writes. Experiments confirm that the proposed techniques are able to improve average memory hit time and reduce maximum PCM write counts thus enhancing both performance and lifetime of a DRAM-PCM hybrid main memory.

Proceedings ArticleDOI
14 Oct 2017
TL;DR: This work presents the Data Processing Unit or DPU, a shared memory many-core that is specifically designed for high bandwidth analytics workloads and provides acceleration for core to core communication via a unique hardware RPC mechanism called the Atomic Transaction Engine.
Abstract: For many years, the highest energy cost in processing has been data movement rather than computation, and energy is the limiting factor in processor design [21]. As the data needed for a single application grows to exabytes [56], there is clearly an opportunity to design a bandwidth-optimized architecture for big data computation by specializing hardware for data movement. We present the Data Processing Unit or DPU, a shared memory many-core that is specifically designed for high bandwidth analytics workloads. The DPU contains a unique Data Movement System (DMS), which provides hardware acceleration for data movement and partitioning operations at the memory controller that is sufficient to keep up with DDR bandwidth. The DPU also provides acceleration for core to core communication via a unique hardware RPC mechanism called the Atomic Transaction Engine. Comparison of a DPU chip fabricated in 40nm with a Xeon processor on a variety of data processing applications shows a 3× - 15× performance per watt advantage.CCS CONCEPTS• Computer systems organization $\rightarrow$ Multicore architectures; Special purpose systems;

Proceedings ArticleDOI
18 Apr 2017
TL;DR: A novel DRAM controller that bundles and executes memory requests of hard real-time applications in consecutive rounds based on their type to reduce read/write switching delay and provides a configurable, guaranteed bandwidth for soft real- time requests is designed.
Abstract: We design a novel DRAM controller that bundles and executes memory requests of hard real-time applications in consecutive rounds based on their type to reduce read/write switching delay. At the same time, our controller provides a configurable, guaranteed bandwidth for soft real-time requests. We show that there is a fundamental trade-off between the latency guarantee for hard real-time requests and the bandwidth provided to soft requests. Finally, we compare our approach analytically and experimentally with the current state-of-theart real-time memory controller for single-rank DRAM devices, which applies type reordering at the level of DRAM commands rather than requests. Our evaluation shows that for tasks exhibiting average row hit ratios, or for which computing a row hit guarantee might be difficult, our controller provides both smaller guaranteed latency and larger bandwidth.

Proceedings ArticleDOI
01 Jul 2017
TL;DR: This work designs a software interface that programmers can use to identify data structures that are resilient to approximations and proposes a runtime quality control framework that automatically determines the error constraints for the identified data structures such that a given target application-level quality is maintained.
Abstract: Memory subsystems are a major energy bottleneck in computing platforms due to frequent transfers between processors and off-chip memory. We propose approximate memory compression, a technique that leverages the intrinsic resilience of emerging workloads such as machine learning and data analytics to reduce off-chip memory traffic and energy. To realize approximate memory compression, we enhance the memory controller to be aware of memory regions that contain approximation-resilient data, and to transparently compress/decompress the data written to/read from these regions. To provide control over approximations, the quality-aware memory controller conforms to a specified error constraint for each approximate memory region. We design a software interface that programmers can use to identify data structures that are resilient to approximations. We also propose a runtime quality control framework that automatically determines the error constraints for the identified data structures such that a given target application-level quality is maintained. We evaluate our proposal by implementing a hardware prototype using the Intel UniPHY-DDR3 memory controller and NIOS-II processor, a Hynix DDR3 DRAM module, and a Stratix-IV FPGA development board. Across a suite of 8 machine learning benchmarks, approximate memory compression obtains a 1.28× benefit in DRAM energy and a simultaneous 11.5% improvement in execution time for a small (< 1.5%) loss in output quality.

Patent
Kang Kyu-Chang1, Yang Hui-Kap1
27 Jul 2017
TL;DR: In this article, the row selection circuit performs an access operation with respect to the memory bank and a hammer refresh operation on a row that is physically adjacent to a row accessed intensively.
Abstract: A memory device includes a memory bank, a row selection circuit and a refresh controller. The memory bank includes a plurality of memory blocks, and each memory block includes a plurality of memory cells arranged in rows and columns. The row selection circuit performs an access operation with respect to the memory bank and a hammer refresh operation with respect to a row that is physically adjacent to a row that is accessed intensively. The refresh controller controls the row selection circuit such that the hammer refresh operation is performed during a row active time for the access operation. The hammer refresh operation may be performed efficiently and performance of the memory device may be enhanced by performing the hammer refresh operation during the row active time for the access operation.

Journal ArticleDOI
TL;DR: A novel approach to schedule memory requests in Mixed Criticality Systems by enabling the MCS designer to specify memory requirements per task is proposed, and a compact time-division-multiplexing scheduler and framework that constructs optimal schedules to manage requests to off-chip memory are introduced.
Abstract: We propose a novel approach to schedule memory requests in Mixed Criticality Systems (MCS). This approach supports an arbitrary number of criticality levels by enabling the MCS designer to specify memory requirements per task. It retains locality within large-size requests to satisfy memory requirements of all tasks. To achieve this target, we introduce a compact time-division-multiplexing scheduler, and a framework that constructs optimal schedules to manage requests to off-chip memory. We also present a static analysis that guarantees meeting requirements of all tasks. We compare the proposed controller against state-of-the-art memory controllers using both a case study and synthetic experiments.

Proceedings ArticleDOI
14 Oct 2017
TL;DR: This paper presents a lightweight hardware mechanism that augments the memory controller and performs the page merging process with minimal hypervisor involvement, and repurposes the Error Correction Codes (ECC) engine to generate accurate and inexpensive ECC-based hash keys.
Abstract: To reduce the memory requirements of virtualized environments, modern hypervisors are equipped with the capability to search the memory address space and merge identical pages – a process called page deduplication. This process uses a combination of data hashing and exhaustive comparison of pages, which consumes processor cycles and pollutes caches. In this paper, we present a lightweight hardware mechanism that augments the memory controller and performs the page merging process with minimal hypervisor involvement. Our concept, called PageForge, is effective. It compares pages in the memory controller, and repurposes the Error Correction Codes (ECC) engine to generate accurate and inexpensive ECC-based hash keys. We evaluate PageForge with simulations of a 10-core processor with a virtual machine (VM) on each core, running a set of applications from the TailBench suite. When compared with RedHat’s KSM, a state-of-the-art software implementation of page merging, PageForge attains identical savings in memory footprint while substantially reducing the overhead. Compared to a system without same-page merging, PageForge reduces the memory footprint by an average of 48%, enabling the deployment of twice as many VMs for the same physical memory. Importantly, it keeps the average latency overhead to 10%, and the $95 ^{th}$ percentile tail latency to 11%. In contrast, in KSM, these latency overheads are 68% and 136%, respectively. CCS CONCEPTS • Computer systems organization $\rightarrow$ Cloud computing; • Software and its engineering $\rightarrow$ Operating systems; Memory management; Virtual machines;

Proceedings ArticleDOI
14 Oct 2017
TL;DR: ConTutto is the first ever FPGA platform on the memory bus of a server class processor, providing a means for in-line acceleration of certain computations on-route to memory, and enables sensitivity analysis for memory latency while running real applications.
Abstract: We demonstrate the use of an FPGA as a memory buffer in a POWER8® system, creating a novel prototyping platform that enables innovation in the memory subsystem of POWER-based servers. Our platform, called ConTutto, is pin-compatible with POWER8 buffered memory DIMMs and plugs into a memory slot of a standard POWER8 processor system, running at aggregate memory channel speeds of 35 GB/s per link. ConTutto, which means “with everything”, is a platform to experiment with different memory technologies, such as STT-MRAM and NAND Flash, in an end-to-end system context. Enablement of STTMRAM and NVDIMM using ConTutto shows up to 12.5x lower latency and 7.5x higher bandwidth compared to the respective technologies when attached to the PCIe bus. Moreover, due to the unique attach-point of the FPGA between the processor and system memory, ConTutto provides a means for in-line acceleration of certain computations on-route to memory, and enables sensitivity analysis for memory latency while running real applications. To the best of our knowledge, ConTutto is the first ever FPGA platform on the memory bus of a server class processor. CCS CONCEPTS •Hardware → Emerging technologies → Analysis and design of emerging devices and systems; • Computer systems organization → Architectures → Other architectures → Reconfigurable computing;

Journal ArticleDOI
TL;DR: P-Alloc is presented, a process-variation tolerant reliability management strategy for 3D charge-trapping flash memory that significantly enhances the reliability and reduces the access latency compared to the baseline scheme.
Abstract: Three-dimensional (3D) flash memory is an emerging memory technology that enables a number of improvements to conventional planar NAND flash memory, including larger capacity, less program disturbance, and lower access latency. In contrast to conventional planar flash memory, 3D flash memory adopts charge-trapping mechanism. NAND strings punch through multiple stacked layers to form the three-dimensional infrastructure. However, the etching processes for NAND strings are unable to produce perfectly vertical features, especially on the scale of 20 nanometers or less. The process variation will cause uneven distribution of electrons, which poses a threat to the integrity of data stored in flash. This paper present P-Alloc, a process-variation tolerant reliability management strategy for 3D charge-trapping flash memory. P-Alloc offers both hardware and software support to allocate data to the 3D flash in the presence of process variation. P-Alloc predicts the state of a physical page, i.e., the basic unit for each write or read operation in flash memory, and tries to assign critical data to more reliable pages. A hardware-based voltage threshold compensation scheme is also proposed to further reduce the faults. We demonstrate the viability of the proposed scheme using a variety of realistic workloads. Our extensive evaluations show that, P-Alloc significantly enhances the reliability and reduces the access latency compared to the baseline scheme.

Patent
13 Jun 2017
TL;DR: In this article, the host directly assigns physical addresses and performs logical-to-physical address translation in a manner that reduces or eliminates the need for a memory controller to handle these functions.
Abstract: This disclosure provides for improvements in managing multi-drive, multi-die or multi-plane NAND flash memory. In one embodiment, the host directly assigns physical addresses and performs logical-to-physical address translation in a manner that reduces or eliminates the need for a memory controller to handle these functions, and initiates functions such as wear leveling in a manner that avoids competition with host data accesses. A memory controller optionally educates the host on array composition, capabilities and addressing restrictions. Host software can therefore interleave write and read requests across dies in a manner unencumbered by memory controller address translation. For multi-plane designs, the host writes related data in a manner consistent with multi-plane device addressing limitations. The host is therefore able to “plan ahead” in a manner supporting host issuance of true multi-plane read commands.

Proceedings ArticleDOI
18 Jun 2017
TL;DR: This work proposes a persistent memory accelerator design, which guarantees NVRAM data persistence by hardware yet leaving cache hierarchy and memory controller operations unaltered, and achieves the performance close to the one without persistence guarantee.
Abstract: Persistent memory places NVRAM on the memory bus, offering fast access to persistent data. Yet maintaining NVRAM data persistence raises a host of challenges. Most proposed schemes either incur much performance overhead or require substantial modifications to existing architectures. We propose a persistent memory accelerator design, which guarantees NVRAM data persistence by hardware yet leaving cache hierarchy and memory controller operations unaltered. A nonvolatile transaction cache keeps an alternative version of data updates side-by-side with the cache hierarchy and paves a new persistent path without affecting original processor execution path. As a result, our design achieves the performance close to the one without persistence guarantee.

Proceedings ArticleDOI
27 Mar 2017
TL;DR: This paper proposes a miss penalty-aware LRU-based (MALRU) cache replacement policy for hybrid memory systems that improves system performance against LRU and the state-of-the-art HAP policy and advocates a more general metric, Average Memory Access Time (AMAT), to evaluate the performance of hybrid memories.
Abstract: Current DRAM based memory systems face the scalability challenges in terms of storage density, power, and cost. Hybrid memory architecture composed of emerging Non-Volatile Memory (NVM) and DRAM is a promising approach to large-capacity and energy-efficient main memory. However, hybrid memory systems pose a new challenge to on-chip cache management due to the asymmetrical penalty of memory access to DRAM and NVM in case of cache misses. Cache hit rate is no longer an effective metric for evaluating memory access performance in hybrid memory systems. Current cache replacement policies that aim to improve cache hit rate are not efficient either. In this paper, we take into account the asymmetry of cache miss penalty on DRAM and NVM, and advocate a more general metric, Average Memory Access Time (AMAT), to evaluate the performance of hybrid memories. We propose a miss penalty-aware LRU-based (MALRU) cache replacement policy for hybrid memory systems. MALRU is aware of the source (DRAM or NVM) of missing blocks and prevents high-latency NVM blocks as well as low-latency DRAM blocks with good temporal locality from being evicted. Experimental results show that MALRU improves system performance against LRU and the state-of-the-art HAP policy by up to 20.4% and 11.7% (11.1% and 5.7% on average), respectively.

Proceedings ArticleDOI
01 Jun 2017
TL;DR: Several building blocks that are needed for implementing PREM on NVIDIA Tegra X1 platform are introduced and a modification of the MemGuard tool to be practically usable on ARM platforms are proposed and it is shown that this mechanism can be used to make the execution time of CPU tasks more predictable.
Abstract: Many today's real-time applications, such as Advanced Driver Assistant Systems (ADAS), demand both high computing power and safety guarantees. High computing power can be easily delivered by, now ubiquitous, multi-core CPUs or by a heterogeneous system with a multi-core CPU and a parallel accelerator such as a GPU. Reaching the required safety level in such a system is by far more difficult because the commercial-of-the-shelf (COTS) high-performance platforms contain many shared resources (e.g. main memory) with arbiters not designed to provide real-time guarantees. A promising approach to address this problem, known as PRedictable Execution Model (PREM), was introduced by Pellizzoni et al. [1]. We are interested in applying PREM to ARM-based heterogeneous platforms, but so far, all PREM-related work has been done on x86 or PowerPC. In this paper, we introduce several building blocks that are needed for implementing PREM on NVIDIA Tegra X1 platform. We propose a modification of the MemGuard tool to be practically usable on ARM platforms. We also analyse a throttling mechanism of Tegra X1 memory controller, that allows controlling memory bandwidth of non-CPU clients such as the GPU. We show that this mechanism can be used to make the execution time of CPU tasks more predictable.

Patent
Choi Wonjun1, Yang Hui-Kap1
18 May 2017
TL;DR: In this paper, a row selection circuit performs the access operation and the refresh operation with respect to the memory bank, while the collision controller generates a wait signal causing a delay of the access operations based on a result of a comparison of a row address associated with an access operation with a refresh operation.
Abstract: A memory device includes a memory bank, a command control logic circuit, a row selection circuit, a refresh controller and a collision controller The memory bank includes a plurality of memory blocks The command control logic circuit decodes commands received from a memory controller to generate control signals The command control logic receives an active command for an access operation during a refresh operation The row selection circuit performs the access operation and the refresh operation with respect to the memory bank The refresh controller controls the refresh operation The collision controller generates a wait signal causing a delay of the access operation based on a result of a comparison of a row address associated with the access operation and a refresh address associated with the refresh operation

Journal ArticleDOI
TL;DR: DRAMSpec is introduced, a high-level DRAM bank/chip modeling tool that is able to aid in evaluating novel DRAM architectures, such as the Hybrid Memory Cube (HMC), for which no DRAM datasheets are available.
Abstract: In systems ranging from mobile devices to servers, DRAM has a big impact on performance and contributes a significant part of the total consumed power. The performance and power of the system depends on the architecture of the DRAM chip, the design of the memory controller and the access patterns received by the memory controller. Thus, evaluating the impact of DRAM design decisions requires a holistic approach that includes an appropriate model of the DRAM bank, a realistic controller and DRAM power model, and a representative workload, which requires a full system simulator running a complete software stack. In this paper, we introduce DRAMSpec, a high-level DRAM bank/chip modeling tool. Our contribution is to move the DRAM modeling abstraction level from the circuit level to the DRAM bank and by the integration in full system simulators we allow system or processor designers (non-DRAM experts) to tune future DRAM architectures for their target applications. We demonstrate the merits of DRAMSpec by exploring the influence of DRAM row-buffer (page) size and the number of banks on performance and power of a server application (memcached). Our new DRAM design offers a 16% DRAM performance improvement and 13% DRAM energy saving compared to standard comodity DDR3 devices. Additionally, we demonstrate how our tool is able to aid in evaluating novel DRAM architectures, such as the Hybrid Memory Cube (HMC), for which no DRAM datasheets are available. Finally, we highlight the DRAM technology scaling for a specific HMC architecture and we quantify the impact on latency and power.

Journal ArticleDOI
TL;DR: A new memory-based control problem is addressed for neutral systems with time-varying delay, input saturations and energy bounded disturbances by using the combination of a novel delay-dependent polytopic approach, augmented Lyapunov–Krasovskii functionals and some integral inequalities.
Abstract: In this paper, a new memory-based control problem is addressed for neutral systems with time-varying delay, input saturations and energy bounded disturbances. Attention is focused on the design of a memory-based state feedback controller such that the closed-loop system achieves the desirable performance indices including the boundedness of the state trajectories, the H∞ disturbance rejection/attenuation level as well as the asymptotic stability. By using the combination of a novel delay-dependent polytopic approach, augmented Lyapunov–Krasovskii functionals and some integral inequalities, delay-dependent sufficient conditions are first proposed in terms of linear matrix inequalities. Then, three convex optimization problems are formulated whose aims are to, respectively, maximize the disturbance tolerance level, minimize the disturbance attenuation level and maximize the initial condition set. Finally, simulation examples demonstrate the effectiveness and benefits of the obtained results.

Patent
17 Oct 2017
TL;DR: In this paper, the cache tracking database is updated when the NVDIMM-N cache system is switched off and the data and cache tracking data can be retrieved from non-volatile memory devices when the system is restored.
Abstract: An SCM memory mode NVDIMM-N cache system includes an SCM subsystem, and an NVDIMM-N subsystem having at volatile memory device(s) and non-volatile memory device(s). A memory controller writes data to the volatile memory device(s) and, in response, updates a cache tracking database. The memory controller then writes a subset of the data to the SCM subsystem subsequent to the writing of that data to the volatile memory device(s) and, in response, updates the cache tracking database. The memory controller then receives a shutdown signal and, in response, copies the cache tracking database to the volatile memory device(s) in the NVDIMM-N subsystem. The NVDIMM-N subsystem then copies at least some of the data and the cache tracking database from the volatile memory device(s) to the non-volatile memory device(s) prior to shutdown. The data and the cache tracking database may then be retrieved from non-volatile memory device(s) when the system is restored.

Proceedings ArticleDOI
01 Feb 2017
TL;DR: This paper presents SecMC, a secure memory controller that provides efficient memory scheduling with a strong quantitative security guarantee against timing channel attacks, and proposes SecMC-Bound, which enables trading-off security for performance with a quantitative information theoretic bound on information leakage.
Abstract: This paper presents SecMC, a secure memory controller that provides efficient memory scheduling with a strong quantitative security guarantee against timing channel attacks. The first variant, named SecMC-NI, eliminates timing channels while allowing a tight memory schedule by interleaving memory requests that access different banks or ranks. Experimental results show that SecMC-NI significantly (45% on average) improves the performance of the best known scheme that does not rely on restricting memory placements. To further improve the performance, the paper proposes SecMC-Bound, which enables trading-off security for performance with a quantitative information theoretic bound on information leakage. The experimental results show that allowing small information leakage can yield significant performance improvements.