scispace - formally typeset
Search or ask a question

Showing papers on "Memory controller published in 2023"


Journal ArticleDOI
TL;DR: Monarch as discussed by the authors is a resistive 3D stacked memory based on a novel reconfigurable crosspoint array called XAM, which switches between random access and content-addressable modes.
Abstract: 3D die stacking has often been proposed to build large-scale DRAM-based caches. Unfortunately, the power and performance overheads of DRAM limit the efficiency of high-bandwidth memories. Also, DRAM is facing serious scalability challenges that make alternative technologies more appealing. This paper examines Monarch, a resistive 3D stacked memory based on a novel reconfigurable crosspoint array called XAM. The XAM array is capable of switching between random access and content-addressable modes, which enables Monarch (i) to better utilize the in-package bandwidth and (ii) to satisfy both the random access memory and associative search requirements of various applications. Moreover, the Monarch controller ensures a given target lifetime for the resistive stack. Our simulation results on a set of parallel memory-intensive applications indicate that Monarch outperforms an ideal DRAM caching by $1.21\times$ on average. For in-memory hash table and string matching workloads, Monarch improves performance up to $12\times$ over the conventional high bandwidth memories.

1 citations


Proceedings ArticleDOI
17 Jun 2023
TL;DR: In this paper , the authors propose a decoupled SSD (dSSD) system that decouples the front-end (i.e. cores, system-bus, DRAM) with the back-end with the flash memory.
Abstract: Modern NAND Flash memory-based Solid State Drives (SSDs) are designed to provide high-bandwidth for I/O requests through high-speed NVMe interface and increased internal flash memory bandwidth. In addition to providing high performance for incoming I/O requests, the flash translation layer (FTL) also handles other flash memory management processes including garbage collection that can negatively impact I/O performance. In this work, we address how the sharing of system resources (e.g., system-bus and DRAM) for I/O requests and garbage collection can cause interference and performance degradation. In particular, we propose to rethink SSD architecture through a Decoupled SSD (dSSD) system that decouples the front-end (i.e. cores, system-bus, DRAM) with the back-end (i.e. flash memory). A flash-controller network-on-chip (fNoC) that interconnects the flash controllers together is introduced to enable decoupling of the I/O path and garbage collection path to improve performance and reliability. dSSD enables advanced commands such as copyback command to be exploited for efficient garbage collection and we propose to extend copyback command with global copyback through the fNoC. To improve reliability, we propose to recycle superblocks through superblock recycle table within the flash controller. Without any modification to the FTL, a hardware-based offloading mechanism within the flash controller of the dSSD is proposed to dynamically re-organize a superblock. Our evaluations show that decoupled SSD results in up to 42.7% I/O bandwidth improvement and 63.8% GC performance improvement, while achieving approximately 31.4× improvement in tail-latency on average. Dynamic superblock management through the dSSD results in approximately 23% improvement in lifetime with minimal impact on performance and cost.

Posted ContentDOI
29 Jun 2023
TL;DR: In this paper , the authors show that RowPress amplifies DRAM's vulnerability to read-disturb attacks by reducing the number of row activations needed to induce bitflips by one to two orders of magnitude under realistic conditions.
Abstract: Memory isolation is critical for system reliability, security, and safety. Unfortunately, read disturbance can break memory isolation in modern DRAM chips. For example, RowHammer is a well-studied read-disturb phenomenon where repeatedly opening and closing (i.e., hammering) a DRAM row many times causes bitflips in physically nearby rows. This paper experimentally demonstrates and analyzes another widespread read-disturb phenomenon, RowPress, in real DDR4 DRAM chips. RowPress breaks memory isolation by keeping a DRAM row open for a long period of time, which disturbs physically nearby rows enough to cause bitflips. We show that RowPress amplifies DRAM's vulnerability to read-disturb attacks by significantly reducing the number of row activations needed to induce a bitflip by one to two orders of magnitude under realistic conditions. In extreme cases, RowPress induces bitflips in a DRAM row when an adjacent row is activated only once. Our detailed characterization of 164 real DDR4 DRAM chips shows that RowPress 1) affects chips from all three major DRAM manufacturers, 2) gets worse as DRAM technology scales down to smaller node sizes, and 3) affects a different set of DRAM cells from RowHammer and behaves differently from RowHammer as temperature and access pattern changes. We demonstrate in a real DDR4-based system with RowHammer protection that 1) a user-level program induces bitflips by leveraging RowPress while conventional RowHammer cannot do so, and 2) a memory controller that adaptively keeps the DRAM row open for a longer period of time based on access pattern can facilitate RowPress-based attacks. To prevent bitflips due to RowPress, we describe and evaluate a new methodology that adapts existing RowHammer mitigation techniques to also mitigate RowPress with low additional performance overhead. We open source all our code and data to facilitate future research on RowPress.

Proceedings ArticleDOI
08 Mar 2023
TL;DR: In this article , a memory controller design with an AMBA 3 AHB_Lite standard based on a single master and multiple slave model is presented, which is verified using a System Verilog verification environment and functional coverage.
Abstract: As technology advances, the on-chip communication bus architecture becomes increasingly prominent in interconnecting various components within the System-on-Chip (SoC). The standard ARM AMBA on-chip interconnect bus is designed as an SoC system's high-performance backbone bus, which supports faster communication with internal and external memories. This paper presented a memory controller design with an AMBA 3 AHB_Lite standard based on a single master and multiple slave model. We verified the design as per the specifications of ARM using a System Verilog verification environment and functional coverage. Various testbench verification environment components such as transaction and generator (which generates the input stimulus), Driver (which drives input data to the Design Under Test (DUT)), Monitor (which monitors the signals from the DUT), and the Scoreboard (which reports about the design working condition) are developed to test single burst, wrapping, and increment bursts of various sizes (4, 8, and 16 beats) with waited transfer responses of the AHB_Lite protocol. We also observed different corner cases during burst and wrap transfer.

Proceedings ArticleDOI
17 Jun 2023
TL;DR: Wang et al. as mentioned in this paper proposed DRAM Translation Layer (DTL) for host software/MC-transparent DRAM power management with commodity DRAM devices, which is placed in the CXL memory controller to provide flexible address mappings between host physical address and DRAM device physical address.
Abstract: Memory disaggregation is a promising solution to scale memory capacity and bandwidth shared by multiple server nodes in a flexible and cost-effective manner. DRAM power consumption, which is reported to be around 40% of the total system power in the datacenter server, will become an even more serious concern in this high-capacity environment. Exploiting the low average utilization of DRAM capacity in today's datacenters, it is appealing to put unallocated/cold DRAM ranks into a power-saving mode. However, the conventional DRAM address mapping with fine-grained interleaving to maximize rank-level parallelism is incompatible with such rank-level DRAM power management techniques. Furthermore, existing DRAM power-saving techniques often require intrusive changes to the system stack, including OS, memory controller (MC), or even DRAM devices, to pose additional challenges for deployment. Thus, we propose DRAM Translation Layer (DTL) for host software/MC-transparent DRAM power management with commodity DRAM devices. Inspired by Flash Translation Layer (FTL) in modern SSDs, DTL is placed in the CXL memory controller to provide (i) flexible address mappings between host physical address and DRAM device physical address and (ii) host-transparent memory page migration. Leveraging DTL, we propose two DRAM power-saving techniques with different temporal granularities to maximize the number of DRAM ranks that can enter low-power states while provisioning sufficient DRAM bandwidth: rank-level power-down and hotness-aware self-refresh. The first technique consolidates unallocated memory pages into a subset of ranks at deallocation of a virtual machine (VM) and turns them off transparently to both OS and host MC. Our evaluation with CloudSuite benchmarks demonstrates that this technique saves DRAM power by 31.6% on average at a 1.6% performance cost. The hotness-aware self-refresh scheme further reduces DRAM energy consumption by up to 14.9% with negligible performance loss via opportunistically migrating cold pages into a rank and making it enter self-refresh mode.

Proceedings ArticleDOI
12 Feb 2023
TL;DR: In this paper , the authors use the temporal dimension, implemented with architectural multiplexing coupled with block-level synchronization, to model a complete system-on-chip architecture, while preserving timing in sync.
Abstract: High-end FPGAs enable architecture modeling through emulation with high speed and fidelity. However, the available reconfigurable logic and memory resources limit the size, complexity, and speed of the emulated target designs. The challenge is to map and model large and fast memory hierarchies, such as large caches and mixed main memory, various heterogeneous computation instances, such as CPUs, GPUs, AI/ML processing units and accelerator cores, and communication infrastructure, such as buses and networks. In addition to the spatial dimension, this work uses the temporal dimension, implemented with architectural multiplexing coupled with block-level synchronization, to model a complete system-on-chip architecture. Our approach presents mechanisms to abstract instance plurality while preserving timing in sync. With only a subset of the architecture on the FPGA, we freeze a whole emulated module's activity and state during the additional time intervals necessary for the action on the virtualized modules to elapse. We demonstrate this technique by emulating a hypothetical system consisting of a processor and an SRAM memory too large to map on the FPGA. For this, we modify a LiteX-generated SoC consisting of a VexRISC-V processor and DDR memory, with the memory controller issuing stall signals that freeze the processor, effectively ''hiding'' the memory latency. For Linux boot, we measure significant emulation vs. simulation speedup while matching RTL simulation accuracy. The work is open-sourced.

Journal ArticleDOI
TL;DR: In this paper , an accelerated Direct Memory Access (DMA) controller based on Advanced eXtensible Interface (AXI) bus protocol is proposed, which supports single transmission of data and linked list transmission type, which can verify the transmitted data and ensure the security of data transmission.
Abstract: Because Direct Memory Access (DMA) hardly consumes processor resources when carrying high-speed data, an accelerated DMA controller based on Advanced eXtensible Interface (AXI) bus protocol is proposed in this paper. The accelerable part of the controller is to replace the CPU for descriptor splitting processing by hardware, which greatly improves the CPU computing power. At the same time, the controller is equipped with eight deeply configured channels, which are used to process different types of tasks. The design supports single transmission of data and linked list transmission type, which can verify the transmitted data and ensure the security of data transmission. In this design, the working frequency of the controller can reach 500M, the power consumption is 1.3mw, and the data throughput rate can reach 40Gbps, which greatly improves the data moving efficiency of the system.

Journal ArticleDOI
TL;DR: In this article , the authors proposed an FPGA implementation of the associative processor (AP) architecture, including the CAM and its peripheral circuits, such as the controller, data cache, instruction cache, and program counter.
Abstract: In order to deal with increasingly complex computing problems, an In-memory-based computation system was proposed to replace the traditional Von-Neumann architectures. In-memory computing can save the time and energy of data movement between the memory and processor to avoid the memory-wall bottleneck of traditional Von-Neumann architecture. The associative processor (AP) is such an architecture that is proposed to implement in-memory computing. Content addressable memory (CAM), as a critical part of in-memory computing, plays an important role in an AP. In this paper, we proposed a novel FPGA implementation of the AP, including the CAM and its peripheral circuits, such as the controller, data cache, instruction cache, and program counter. The design details of the whole AP architecture are described by Verilog HDL. To the best of our knowledge, this is the first work that implements an associative processor on a real-world FPGA platform.

Proceedings ArticleDOI
01 Apr 2023
TL;DR: In this article , the authors propose an extension to ARM's AXI4 protocol, which adds irregular stream semantics to memory requests and avoids inefficient narrow-bus transfers by packing multiple narrow data elements onto a wide bus.
Abstract: Data-intensive applications involving irregular memory streams are inefficiently handled by modern processors and memory systems highly optimized for regular, contiguous data. Recent work tackles these inefficiencies in hardware through core-side stream extensions or memory-side prefetchers and accelerators, but fails to provide end-to-end solutions which also achieve high efficiency in on-chip interconnects. We propose AXI-Pack, an extension to ARM's AXI4 protocol introducing bandwidth-efficient strided and indirect bursts to enable end-to-end irregular streams. AXI-Pack adds irregular stream semantics to memory requests and avoids inefficient narrow-bus transfers by packing multiple narrow data elements onto a wide bus. It retains full compatibility with AXI4 and does not require modifications to non-burst-reshaping interconnect IPs. To demonstrate our approach end-to-end, we extend an open-source RISC-V vector processor to leverage AXI-Pack at its memory interface for strided and indexed accesses. On the memory side, we design a banked memory controller efficiently handling AXI-Pack requests. On a system with a 256-bit-wide interconnect running FP32 workloads, AXI-Pack achieves near-ideal peak on-chip bus utilizations of 87% and 39%, speedups of 5.4x and 2.4x, and energy efficiency improvements of 5.3x and 2.1x over a baseline using an AXI4 bus on strided and indirect benchmarks, respectively.

Journal ArticleDOI
TL;DR: In this article , the authors proposed a power-efficient and cost-effective solution consisting of CXL-attached memory hardware and a software suite, which inherits and expands the traditional memory management architecture of existing Linux systems.
Abstract: The rapid development of data-intensive technologies has driven an increasing demand for new architectural solutions with scalable, composable, and coherent computing environments. Compute Express Link (CXL), an open-standard interconnect protocol, overcomes architectural limitations by efficiently expanding memory capacity and bandwidth. In this work, we propose a power-efficient and cost-effective solution consisting of CXL-attached memory hardware and a software suite. The memory module hardware integrated with double-data-rate (DDR) dynamic random access memory and a CXL controller expands bandwidth by dozens of gigabytes per second and increases memory capacity by a few terabytes. Our software suite, scalable memory development kit, inherits and expands the traditional memory management architecture of existing Linux systems. We proved the functionality of the proposed solution by integrating renowned data centers and computing-intensive applications. The proposed CXL solution improved throughput for in-memory database and artificial intelligence applications by 1.5-fold and 1.99-fold, respectively, compared with the conventional DDR-only memory system.

Proceedings ArticleDOI
01 Feb 2023
TL;DR: Secure persistent buffers (SecPB) as discussed by the authors is a battery-backed persistent structure that moves the point of secure data persistency from the memory controller closer to the core of the processor.
Abstract: The durability of data stored in persistent memory (PM) exposes data to potentially data leakage attacks. Recent research has identified the requirements for crash recoverable secure PM, but do not consider recent trends of the persistency domain extending on-chip to include cache hierarchies. In this paper, we explore this design space and identify performance and energy optimization opportunities.We propose secure persistent buffers (SecPB), a battery-backed persistent structure that moves the point of secure data persistency from the memory controller closer to the core. We revisit the fundamentals of how data in PM is secured and show how various subsets of security metadata can be generated lazily while still guaranteeing crash recoverability and integrity verification. We analyze the metadata dependency chain required in securing PM and expose optimization opportunities that allow for SecPB to reduce performance overheads by up to 32.8×, with average performance overheads as low as 1.3% observed for reasonable battery capacities.

Book ChapterDOI
TL;DR: In this paper , an existing Coq DRAM controller framework is used to write DRAM scheduling algorithms that comply with a variety of correctness criteria, which can be used to generate proved logically equivalent hardware.
Abstract: Recent research in both academia and industry has successfully used deductive verification to design hardware and prove its correctness. While tools and languages to write formally proved hardware have been proposed, applications and use cases are often overlooked. In this work, we focus on Dynamic Random Access Memories (DRAM) controllers and the DRAM itself – which has its expected temporal and functional behaviours described in the standards written by the Joint Electron Device Engineering Council (JEDEC). Concretely, we associate an existing Coq DRAM controller framework – which can be used to write DRAM scheduling algorithms that comply with a variety of correctness criteria – to a back-end system that generates proved logically equivalent hardware. This makes it possible to simultaneously enjoy the trustworthiness provided by the Coq framework and use the generated synthesizable hardware in real systems. We validate the approach by using the generated code as a plug-in replacement in an existing DDR4 controller implementation, which includes a host interface (AXI), a physical layer (PHY) from Xilinx, and a model of a memory part Micron MT40A1G8WE-075E:D. We simulate and synthesise the full system.

Journal ArticleDOI
TL;DR: In this paper , the authors introduce the concept of enhanced memory functions (EMFs) and describe two use cases, one prototyped using a field-programmable gate array-based intelligent memory controller platform.
Abstract: The arrival of the Compute Express Link (CXL) protocol is a significant milestone for the systems community. CXL provides a standardized, cache-coherent memory protocol that can be used to attach devices and memory to a system, while maintaining memory coherency with the host processor. CXL enables accelerators (e.g., graphics processing units and data processing units) to both have direct load/store access to the host memory and the ability to make their own on-device memory likewise accessible to the host central processing unit. Because CXL allows technology interposition on the memory data plane, it opens up the possibility of “pushing down” functions into the memory subsystem. In this article, we introduce the concept of enhanced memory functions (EMFs). We then describe two use cases, one prototyped using a field-programmable gate array-based intelligent memory controller platform. Finally, we show initial experimental results indicating that EMFs could present valuable solutions to problems that are difficult to solve within existing computer architectures.

Posted ContentDOI
23 Mar 2023
TL;DR: In this article , a new cycle-level DRAM cache model is presented for heterogeneous and disaggregated systems, which enables the community to perform a design space exploration for future generation of memory systems supporting DRAM caches.
Abstract: The increasing growth of applications' memory capacity and performance demands has led the CPU vendors to deploy heterogeneous memory systems either within a single system or via disaggregation. For instance, systems like Intel's Knights Landing and Sapphire Rapids can be configured to use high bandwidth memory as a cache to main memory. While there is significant research investigating the designs of DRAM caches, there has been little research investigating DRAM caches from a full system point of view, because there is not a suitable model available to the community to accurately study largescale systems with DRAM caches at a cycle-level. In this work we describe a new cycle-level DRAM cache model in the gem5 simulator which can be used for heterogeneous and disaggregated systems. We believe this model enables the community to perform a design space exploration for future generation of memory systems supporting DRAM caches.

Posted ContentDOI
23 Mar 2023
TL;DR: In this paper , the authors present a cycle-level DRAM cache model which is integrated with gem5. This model leverages the flexibility of gem5's memory devices models and full system support.
Abstract: To accommodate the growing memory footprints of today's applications, CPU vendors have employed large DRAM caches, backed by large non-volatile memories like Intel Optane (e.g., Intel's Cascade Lake). The existing computer architecture simulators do not provide support to model and evaluate systems which use DRAM devices as a cache to the non-volatile main memory. In this work, we present a cycle-level DRAM cache model which is integrated with gem5. This model leverages the flexibility of gem5's memory devices models and full system support to enable exploration of many different DRAM cache designs. We demonstrate the usefulness of this new tool by exploring the design space of a DRAM cache controller through several case studies including the impact of scheduling policies, required buffering, combining different memory technologies (e.g., HBM, DDR3/4/5, 3DXPoint, High latency) as the cache and main memory, and the effect of wear-leveling when DRAM cache is backed by NVM main memory. We also perform experiments with real workloads in full-system simulations to validate the proposed model and show the sensitivity of these workloads to the DRAM cache sizes.