scispace - formally typeset
Search or ask a question

Showing papers on "Memory controller published in 2020"


Journal ArticleDOI
TL;DR: Based on the stability criterion, considering both the time-varying delay and its bounds, a generalised memory controller is designed for T-S fuzzy systems, which covers the memoryless and traditional memory ones as its special cases.
Abstract: In this paper, stability analysis and controller synthesis problems for T-S fuzzy systems with time-varying delay are studied. A generalised parameter-dependent reciprocally convex inequality (GPDRCI) is presented to handle the derivative of triple integral terms, which is more general than some existing ones. By choosing suitable flexible polynomials with tunable parameters, novel flexible polynomial-based functions (FPFs) are proposed in delay-product types, which overcome the incompletely slack matrices, higher order time delay and insufficient parameters in the existing functions. Benefitting from completely slack matrices and lower order time delay, coupling relationship among system states and time delay is fully linked. Based on the GPDRCI and FPFs, a stability condition is derived for T-S fuzzy systems. Based on the stability criterion, considering both the time-varying delay and its bounds, a generalised memory controller is designed for T-S fuzzy systems, which covers the memoryless and traditional memory ones as its special cases. In addition, the constraints on introduced slack matrices in some existing works are avoided with the help of a matrix inequality decoupling technique. These provide extra free dimensions in the solution space. Some examples are employed to illustrate the effectiveness of the proposed methods.

30 citations


Patent
31 Mar 2020
TL;DR: In this article, the authors describe a memory controller implemented error correction code (ECC) memory, where ECC groups may be placed across banks of the memory to restrict a given bank to a single member of the ECC group.
Abstract: Devices and techniques for memory controller implemented error correction code (ECC) memory are disclosed herein. ECC groups may be placed across banks of the memory. In some examples, an ECC group is a collection of bytes equal to one row in one bank. Also, the placement may restrict a given bank to a single member of the ECC group. A memory operation can be received and executed using the ECC groups.

22 citations


Journal ArticleDOI
TL;DR: Network-on-Memory is presented, a lightweight inter-bank data communication scheme that enables direct data copy across both memory banks of a 3D-stacked memory and improves the performance of data-intensive workloads by 3.8X and 75 percent, on average, compared to the baseline conventional DRAM architecture and state-of-the-art techniques.
Abstract: Data copy is a widely-used memory operation in many programs and operating system services. In conventional computers, data copy is often carried out by two separate read and write transactions that pass data back and forth between the DRAM chip and the processor chip. Some prior mechanisms propose to avoid this unnecessary data movement by using the shared internal bus in the DRAM chip to directly copy data within the DRAM chip (e.g., between two DRAM banks). While these methods exhibit superior performance compared to conventional techniques, data copy across different DRAM banks is still greatly slower than data copy within the same DRAM bank. Hence, these techniques have limited benefit for the emerging 3D-stacked memories (e.g., HMC and HBM) that contain hundreds of DRAM banks across multiple memory controllers. In this paper, we present Network-on-Memory (NoM), a lightweight inter-bank data communication scheme that enables direct data copy across both memory banks of a 3D-stacked memory. NoM adopts a TDM-based circuit-switching design, where circuit setup is done by the memory controller. Compared to state-of-the-art approaches, NoM enables both fast data copy between multiple DRAM banks and concurrent data transfer operations. Our evaluation shows that NoM improves the performance of data-intensive workloads by 3.8X and 75 percent, on average, compared to the baseline conventional 3D-stacked DRAM architecture and state-of-the-art techniques, respectively.

20 citations


Proceedings ArticleDOI
30 May 2020
TL;DR: This paper proposes a transparent and efficient hardware-assisted out-of-place update (HOOP) mechanism that supports atomic data durability, without incurring much extra writes and performance overhead and demonstrates scalable data recovery capability on multi-core systems.
Abstract: Byte-addressable non-volatile memory (NVM) is a promising technology that provides near-DRAM performance with scalable memory capacity. However, it requires atomic data durability to ensure memory persistency. Therefore, many techniques, including logging and shadow paging, have been proposed. However, most of them either introduce extra write traffic to NVM or suffer from significant performance overhead on the critical path of program execution, or even both.In this paper, we propose a transparent and efficient hardware-assisted out-of-place update (HOOP) mechanism that supports atomic data durability, without incurring much extra writes and performance overhead. The key idea is to write the updated data to a new place in NVM, while retaining the old data until the updated data becomes durable. To support this, we develop a lightweight indirection layer in the memory controller to enable efficient address translation and adaptive garbage collection for NVM. We evaluate HOOP with a variety of popular data structures and data-intensive applications, including key-value stores and databases. Our evaluation shows that HOOP achieves low critical-path 1atency with small write amplification, which is close to that of a native system without persistence support. Compared with state-of-the-art crash-consistency techniques, it improves application performance by up to $ 1.7\times$, while reducing the write amplification by up to $ 2.1\times$. HOOP also demonstrates scalable data recovery capability on multi-core systems.

19 citations


Journal ArticleDOI
TL;DR: Network-on-Memory (NoM) as mentioned in this paper adopts a TDM-based circuit-switching design, where circuit setup is done by the memory controller, enabling both fast data copy between multiple DRAM banks and concurrent data transfer operations.
Abstract: Data copy is a widely-used memory operation in many programs and operating system services. In conventional computers, data copy is often carried out by two separate read and write transactions that pass data back and forth between the DRAM chip and the processor chip. Some prior mechanisms propose to avoid this unnecessary data movement by using the shared internal bus in the DRAM chip to directly copy data within the DRAM chip (e.g., between two DRAM banks). While these methods exhibit superior performance compared to conventional techniques, data copy across different DRAM banks is still greatly slower than data copy within the same DRAM bank. Hence, these techniques have limited benefit for the emerging 3D-stacked memories (e.g., HMC and HBM) that contain hundreds of DRAM banks across multiple memory controllers. In this paper, we present Network-on-Memory (NoM), a lightweight inter-bank data communication scheme that enables direct data copy across both memory banks of a 3D-stacked memory. NoM adopts a TDM-based circuit-switching design, where circuit setup is done by the memory controller. Compared to state-of-the-art approaches, NoM enables both fast data copy between multiple DRAM banks and concurrent data transfer operations. Our evaluation shows that NoM improves the performance of data-intensive workloads by 3.8X and 75%, on average, compared to the baseline conventional 3D-stacked DRAM architecture and state-of-the-art techniques, respectively.

19 citations


Journal ArticleDOI
Sun Hao, Chen Lan, Hao Xiaoran, Liu Chenji, Ni Mao 
TL;DR: A hybrid storage class memory system to reduce the energy consumption and optimize IO performance is presented and a migration scheme implemented in the memory controller is proposed that can reduce energy consumption by 46.2% on average compared with the traditional DRAM-only system.
Abstract: Conventional main memory can no longer meet the requirements of low energy consumption and massive data storage in an artificial intelligence Internet of Things (AIoT) system Moreover, the efficiency is decreased due to the swapping of data between the main memory and storage This paper presents a hybrid storage class memory system to reduce the energy consumption and optimize IO performance Phase change memory (PCM) brings the advantages of low static power and a large capacity to a hybrid memory system In order to avoid the impact of poor write performance in PCM, a migration scheme implemented in the memory controller is proposed By counting the write times and row buffer miss times in PCM simultaneously, the write-intensive data can be selected and migrated from PCM to dynamic random-access memory (DRAM) efficiently, which improves the performance of hybrid storage class memory In addition, a fast mode with a tmpfs-based, in-memory file system is applied to hybrid storage class memory to reduce the number of data movements between memory and external storage Experimental results show that the proposed system can reduce energy consumption by 462% on average compared with the traditional DRAM-only system The fast mode increases the IO performance of the system by more than 30 times compared with the common ext3 file system

16 citations


Proceedings ArticleDOI
30 May 2020
TL;DR: The benefits of VBI are demonstrated with two important use cases: reducing the overheads of address translation (for both native execution and virtual machine environments), as VBI reduces the number of translation requests and associated memory accesses; and two heterogeneous main memory architectures, where VBI significantly improves performance over conventional virtual memory.
Abstract: Computers continue to diversify with respect to system designs, emerging memory technologies, and application memory demands. Unfortunately, continually adapting the conventional virtual memory framework to each possible system configuration is challenging, and often results in performance loss or requires non-trivial workarounds. To address these challenges, we propose a new virtual memory framework, the Virtual Block Interface (VBI). We design VBI based on the key idea that delegating memory management duties to hardware can reduce the overheads and software complexity associated with virtual memory. VBI introduces a set of variable-sized virtual blocks (VBs) to applications. Each VB is a contiguous region of the globally-visible VBI address space, and an application can allocate each semantically meaningful unit of information (e.g., a data structure) in a separate VB. VBI decouples access protection from memory allocation and address translation. While the OS controls which programs have access to which VBs, dedicated hardware in the memory controller manages the physical memory allocation and address translation of the VBs. This approach enables several architectural optimizations to (1) efficiently and flexibly cater to different and increasingly diverse system configurations, and (2) eliminate key inefficiencies of conventional virtual memory. We demonstrate the benefits of VBI with two important use cases: (1) reducing the overheads of address translation (for both native execution and virtual machine environments), as VBI reduces the number of translation requests and associated memory accesses; and (2) two heterogeneous main memory architectures, where VBI increases the effectiveness of managing fast memory regions. For both cases, VBI significantly improves performance over conventional virtual memory.

16 citations


Journal ArticleDOI
TL;DR: MEG is proposed, an open source, configurable, cycle-exact, and RISC-V-based full-system emulation infrastructure using FPGA and HBM that provides a highly modular hardware design and includes a bootable Linux image for a realistic software flow, so that users can perform cross-layer software-hardware co-optimization in a full- system environment.
Abstract: Emerging three-dimensional (3D) memory technologies, such as the Hybrid Memory Cube (HMC) and High Bandwidth Memory (HBM), provide high-bandwidth and massive memory-level parallelism. With the growing heterogeneity and complexity of computer systems (CPU cores and accelerators, etc.), efficiently integrating emerging memories into existing systems poses new challenges and requires detailed evaluation in a realistic computing environment. In this article, we propose MEG, an open source, configurable, cycle-exact, and RISC-V-based full-system emulation infrastructure using FPGA and HBM. MEG provides a highly modular hardware design and includes a bootable Linux image for a realistic software flow, so that users can perform cross-layer software-hardware co-optimization in a full-system environment. To improve the observability and debuggability of the system, MEG also provides a flexible performance monitoring scheme to guide the performance optimization. The proposed MEG infrastructure can potentially benefit broad communities across computer architecture, system software, and application software. Leveraging MEG, we present two cross-layer system optimizations as illustrative cases to demonstrate the usability of MEG. In the first case study, we present a reconfigurable memory controller to improve the address mapping of standard memory controller. This reconfigurable memory controller along with its OS support allows us to optimize the address mapping scheme to fully exploit the massive parallelism provided by the emerging three-dimensional (3D) memories. In the second case study, we present a lightweight IOMMU design to tackle the unique challenges brought by 3D memory in providing virtual memory support for near-memory accelerators. We provide a prototype implementation of MEG on a Xilinx VU37P FPGA and demonstrate its capability, fidelity, and flexibility on real-world benchmark applications. We hope MEG fills a gap in the space of publicly available FPGA-based full-system emulation infrastructures, specifically targeting memory systems, and inspires further collaborative software/hardware innovations.

14 citations


Journal ArticleDOI
TL;DR: In this article, the authors propose approximate memory compression, a technique that leverages the intrinsic resilience of emerging workloads such as machine learning and data analytics to reduce off-chip memory traffic, thereby improving energy and performance.
Abstract: Memory subsystems are a major energy bottleneck in computing platforms due to frequent transfers between processors and off-chip memory. We propose approximate memory compression, a technique that leverages the intrinsic resilience of emerging workloads such as machine learning and data analytics to reduce off-chip memory traffic, thereby improving energy and performance. We realize approximate memory compression by enhancing the memory controller to be aware of approximate memory regions—regions in memory that contain approximation-resilient data—and to transparently compress (decompress) the data written to (read from) these regions. To provide control over approximations, each approximate memory region is associated with an error constraint such as the maximum error that may be introduced in each data element. The quality-aware memory controller subjects memory transactions to a compression scheme that introduces approximations, thereby reducing memory traffic, while adhering to the specified error constraint for each approximate memory region. A software interface is provided to allow programmers to identify data structures (DSs) that are resilient to approximations. A runtime quality control framework automatically determines the error constraints for the identified DSs such that a given target application-level quality is maintained. We evaluate our proposal by applying it to three different main memory technologies in the context of a general-purpose computing system—DDR3 DRAM, LPDDR3 DRAM, and spin-transfer torque magnetic RAM (STT-MRAM). To demonstrate the feasibility of the proposed concepts, we also implement a hardware prototype using the Intel UniPHY-DDR3 memory controller and Nios-II processor, a Hynix DDR3 DRAM module, and a Stratix-IV field-programmable gate array (FPGA) development board. Across a wide range of machine learning benchmarks, approximate memory compression obtains significant benefits in main memory energy ( $1.18\times $ for DDR3 DRAM, $1.52\times $ for LPDDR3 DRAM, and $2.0\times $ for STT-MRAM) and a simultaneous improvement in execution time (5.2% for DDR3 DRAM, 5.4% for LPDDR3 DRAM, and 9.3% for STT-MRAM) with nearly identical application output quality.

13 citations


Proceedings ArticleDOI
30 May 2020
TL;DR: This work proposes to offload the update and verification of system-level redundancy to TVARAK, a new hardware controller co-located with the last-level cache that enables efficient protection of data from bugs in memory controller and NVM DIMM firmware.
Abstract: Production storage systems complement device-level ECC (which covers media errors) with system-checksums and cross-device parity. This system-level redundancy enables systems to detect and recover from data corruption due to device firmware bugs (e.g., reading data from the wrong physical location). Direct access to NVM penalizes software-only implementations of system-level redundancy, forcing a choice between lack of data protection or significant performance penalties. We propose to offload the update and verification of system-level redundancy to Tvarak, a new hardware controller co-located with the last-level cache. Tvarak enables efficient protection of data from such bugs in memory controller and NVM DIMM firmware. Simulation-based evaluation with seven data-intensive applications shows that Tvarak is efficient. For example, Tvarak reduces Redis set-only performance by only 3%, compared to 50% reduction for a state-of-the-art software-only approach.

12 citations


Journal ArticleDOI
TL;DR: A novel scheme which comprises an adaptive refresh method that adjusts refresh period on each DRAM chip, a runtime method of retention-time profiling that operates inside DRAM chips during idle time thereby improving availability, and a dual-row activation method which improves weak cell retention time at a very small area cost is proposed.

Journal ArticleDOI
TL;DR: A step-by-step memory scheduling strategy that isolates the CPU requests from the GPU requests when the memory controller receives the memory request, thereby preventing the GPU request from interfering with the CPU request.
Abstract: Multiple CPUs and GPUs are integrated on the same chip to share memory, and access requests between cores are interfering with each other. Memory requests from the GPU seriously interfere with the CPU memory access performance. Requests between multiple CPUs are intertwined when accessing memory, and its performance is greatly affected. The difference in access latency between GPU cores increases the average latency of memory accesses. In order to solve the problems encountered in the shared memory of heterogeneous multi-core systems, we propose a step-by-step memory scheduling strategy, which improve the system performance. The step-by-step memory scheduling strategy first creates a new memory request queue based on the request source and isolates the CPU requests from the GPU requests when the memory controller receives the memory request, thereby preventing the GPU request from interfering with the CPU request. Then, for the CPU request queue, a dynamic bank partitioning strategy is implemented, which dynamically maps it to different bank sets according to different memory characteristics of the application, and eliminates memory request interference of multiple CPU applications without affecting bank-level parallelism. Finally, for the GPU request queue, the criticality is introduced to measure the difference of the memory access latency between the cores. Based on the first ready-first come first served strategy, we implemented criticality-aware memory scheduling to balance the locality and criticality of application access.

Proceedings ArticleDOI
01 Jan 2020
TL;DR: The proposed RRAM Variability Aware Controller (RRAM-VAC) stores and then coalesces the write requests from the processor before triggering the actual write process, which averages the RRAM variability and enables the system to run at the memory programming time distribution mean rather than the worst case tail.
Abstract: The growing need for connected, smart and energy efficient devices requires them to provide both ultra-low standby power and relatively high computing capabilities when awoken. In this context, emerging resistive memory technologies (RRAM) appear as a promising solution as they enable cheap fine grain technology co-integration with CMOS, fast switching and non-volatile storage. However, RRAM technologies suffer from fundamental flaws such as a strong device-to-device and cycle-to-cycle variability which is worsened by aging, forcing the designers to consider worst case design conditions. In this work, we propose, for the first time, a circuit that can take advantage of recently published Write Termination (WT) circuits from both the energy and performances point of view. The proposed RRAM Variability Aware Controller (RRAM-VAC) stores and then coalesces the write requests from the processor before triggering the actual write process. By doing so, it averages the RRAM variability and enables the system to run at the memory programming time distribution mean rather than the worst case tail. We explore the design space of the proposed solution for various RRAM variability specifications, benchmark the effect of the proposed memory controller with real application memory traces and show (for the considered RRAM technology specifications) 44 % to 50 % performances improvement and from 10% to 85% energy gains depending on the application memory access patterns.

Proceedings ArticleDOI
17 Nov 2020
TL;DR: In this paper, an automatic and reliable method for reverse engineering the DRAM addressing of Intel server-class processors is presented. But, the method mainly relies on CPU hardware performance counters to precisely locate the accessed DRAM component.
Abstract: The memory controller of a processor translates the physical memory address to hardware components such as memory channels, ranks, and banks. This DRAM address mapping is of interest to many researchers in the fields of IT security, hardware architecture, system software, and performance tuning. However, Intel processors are using a complex and undocumented DRAM addressing. The addressing can be different for every system because it depends on many aspects such as the processor model, DIMM population on the motherboard, and BIOS settings. Thus an analysis for every individual system is necessary. In this paper, we introduce an automatic and reliable method for reverse engineering the DRAM addressing of Intel server-class processors. In contrast to existing approaches, it is reliable, measurement errors are unlikely to occur, and can be detected if they occur. Our method mainly relies on CPU hardware performance counters to precisely locate the accessed DRAM component. It eliminates the problem of wrong attribution that is common in timing based approaches. We validated our method by reversing engineering the DRAM addressing of a diverse set of Intel processors. This set includes Broadwell, Haswell, and Skylake micro-architectures, with various core counts, DIMM arrangements, and BIOS settings. We show the correctness of the determined addressing functions using micro-benchmarks that access specific DRAM components.

Proceedings ArticleDOI
01 Feb 2020
TL;DR: A RISC-V based System on Chip (SoC) for high voltage and current tissue stimulus, targeting implantable medical devices, is presented, designed in a 0.18μm HV-CMOS process.
Abstract: A RISC-V based System on Chip (SoC) for high voltage and current tissue stimulus, targeting implantable medical devices, is presented. The circuit is designed in a 0.18μm HV-CMOS process, including the RISC-V 32RVI based microcontroller core, called Siwa —which includes SPI, UART and GPIO interfaces, a packet-based bus and memory controller, and 8kB SRAM—, combined with several biological tissue stimulus and sensing circuits. The complete test chip (analog+RISC-V) occupies a 5mm2 area but only 0.82mm2 correspond to the RISC-V micro-controller, which operates up to 20MHz, with average energy needs of less than 48 pJ/cycle (3pJ STD), and for which several reliability and safety issues were considered.

Journal ArticleDOI
TL;DR: An alternative off-chip memory solution that is based on the emerging Reduced Latency DRAM (RLDRAM) protocol is promoted, and a predictable memory controller (RLDC) managing accesses to this memory is proposed.
Abstract: Predictable execution time upon accessing shared memories in multi-core real-time systems is a stringent requirement. A plethora of existing works focus on the analysis of Double Data Rate Dynamic Random Access Memories (DDR DRAMs), or redesigning its memory to provide predictable memory behavior. In this paper, we show that DDR DRAMs by construction suffer inherent limitations associated with achieving such predictability. These limitations lead to (1) highly variable access latencies that fluctuate based on various factors such as access patterns and memory state from previous accesses, and (2) overly pessimistic latency bounds. As a result, DDR DRAMs can be ill-suited for some real-time systems that mandate a strict predictable performance with tight timing constraints. Targeting these systems, we promote an alternative off-chip memory solution that is based on the emerging Reduced Latency DRAM (RLDRAM) protocol, and propose a predictable memory controller (RLDC) managing accesses to this memory. Comparing with the state-of-the-art predictable DDR controllers, the proposed solution provides up to $$\mathbf{11 }\times $$ less timing variability and $$\mathbf{6.4 }\times $$ reduction in the worst case memory latency.

Book ChapterDOI
21 Sep 2020
TL;DR: This paper presents two new microarchitectural covert channel attacks using the memory controller that allow a privileged adversary to leak information in a native environment and an extension to cross-VM scenarios for unprivileged adversaries.
Abstract: Data confidentiality is put at risk on cloud platforms where multiple tenants share the underlying hardware. As multiple workloads are executed concurrently, conflicts in memory resource occur, resulting in observable timing variations during execution. Malicious tenants can intentionally manipulate the hardware platform to devise a covert channel, enabling them to steal the data of co-residing tenants. This paper presents two new microarchitectural covert channel attacks using the memory controller. The first attack allows a privileged adversary (i.e. process) to leak information in a native environment. The second attack is an extension to cross-VM scenarios for unprivileged adversaries. This work is the first instance of leakage channel based on the memory controller. As opposed to previous denial-of-service attacks, we manage to modulate the load on the channel scheduler with accuracy. Both attacks are implemented on cross-core configurations. Furthermore, the cross-VM covert channel is successfully tested across three different Intel microarchitectures. Finally, a comparison against state-of-the-art covert channel attacks is provided, along with a discussion on potential mitigation techniques.

Proceedings ArticleDOI
24 Feb 2020
TL;DR: DRAM-less, a hardware automation approach that precisely integrates many state-of-the-art phase change memory (PRAM) modules into its data processing network to dramatically reduce unnecessary data copies with a minimum of software modifications is proposed.
Abstract: General purpose hardware accelerators have become major data processing resources in many computing domains. However, the processing capability of hardware accelerations is often limited by costly software interventions and memory copies to support compulsory data movement between different processors and solid-state drives (SSDs). This in turn also wastes a significant amount of energy in modern accelerated systems. In this work, we propose, DRAM-less, a hardware automation approach that precisely integrates many state-of-the-art phase change memory (PRAM) modules into its data processing network to dramatically reduce unnecessary data copies with a minimum of software modifications. We implement a new memory controller that plugs a real 3x nm multi-partition PRAM to 28nm technology FPGA logic cells and interoperate its design into a real PCIe accelerator emulation platform. The evaluation results reveal that our DRAM-less achieves, on average, 47% better performance than advanced acceleration approaches that use a peer-to-peer DMA.

Patent
Shallal Aws1
12 Nov 2020
TL;DR: In this article, the authors present techniques for implementing hybrid memory modules with improved inter-memory data transmission paths, which exhibits improved transmission latencies and power consumption when transmitting data between DRAM devices and NVM devices (e.g., flash devices).
Abstract: Disclosed herein are techniques for implementing hybrid memory modules with improved inter-memory data transmission paths. The claimed embodiments address the problem of implementing a hybrid memory module that exhibits improved transmission latencies and power consumption when transmitting data between DRAM devices and NVM devices (e.g., flash devices) during data backup and data restore operations. Some embodiments are directed to approaches for providing a direct data transmission path coupling a non-volatile memory controller and the DRAM devices to transmit data between the DRAM devices and the flash devices. In one or more embodiments, the DRAM devices can be port switched devices, with a first port coupled to the data buffers and a second port coupled to the direct data transmission path. Further, in one or more embodiments, such data buffers can be disabled when transmitting data between the DRAM devices and the flash devices.

Patent
05 Nov 2020
TL;DR: In this article, the authors propose a memory module that includes address-buffer and data-buffer components that together support wide and narrow-data modes, and the address buffer component manages communication between a memory controller and two sets of memory components.
Abstract: Described are memory modules that include address-buffer components and data-buffer components that together support wide- and narrow-data modes. The address-buffer component manages communication between a memory controller and two sets of memory components. In the wide-data mode, the address-buffer enables memory components in each set and instructs the data-buffer components to communicate full-width read and write data by combining data from or to from both sets for each memory access. In the narrow-data mode, the address-buffer enables memory components in just one of the two sets and instructs the data-buffer components to half-width read and write data with one set per memory access.

Proceedings ArticleDOI
18 May 2020
TL;DR: A large throughput difference is found, which emphasizes the importance of choosing the best durability domain for each application and system, and confirms that recently published persistent transactional memory algorithms are able to scale, and that recent optimizations for these algorithms lead to strong performance.
Abstract: Storing data structures in high-capacity byte-addressable persistent memory instead of DRAM or a storage device offers the opportunity to (1) reduce cost and power consumption compared with DRAM, (2) decrease the latency and CPU resources needed for an I/O operation compared with storage, and (3) allow for fast recovery as the data structure remains in memory after a machine failure. The first commercial offering in this space is Intel® Optane™ Direct Connect (Optane™ DC) Persistent Memory. Optane™ DC promises access time within a constant factor of DRAM, with larger capacity, lower energy consumption, and persistence. We present an experimental evaluation of persistent transactional memory performance, and explore how Optane™ DC durability domains affect the overall results. Given that neither of the two available durability domains can deliver performance competitive with DRAM, we introduce and emulate a new durability domain, called PDRAM, in which the memory controller tracks enough information (and has enough reserve power) to make DRAM behave like a persistent cache of Optane™ DC memory.In this paper we compare the performance of these durability domains on several configurations of five persistent transactional memory applications. We find a large throughput difference, which emphasizes the importance of choosing the best durability domain for each application and system. At the same time, our results confirm that recently published persistent transactional memory algorithms are able to scale, and that recent optimizations for these algorithms lead to strong performance, with speedups as high as 6× at 16 threads.

Proceedings ArticleDOI
30 May 2020
TL;DR: Mocktails is a methodology to synthetically recreate the varying spatio-temporal memory access behaviour of proprietary heterogeneous compute devices commonly found in mobile systems that accurately recreates the dynamic behaviour of memory access scheduling for memory controller metrics.
Abstract: Computation demands on mobile and edge devices are increasing dramatically. Mobile devices, such as smart phones, incorporate a large number of dedicated accelerators and fixed-function hardware blocks to deliver the required performance and power efficiency. Due to the heterogeneous nature of these devices, they feature vastly larger design spaces than traditional systems featuring only a CPU. Currently, academia struggles to fully evaluate such heterogeneous systems on chip due to the limited access and availability of proprietary workloads. To address these challenges, we propose Mocktails: a methodology to synthetically recreate the varying spatio-temporal memory access behaviour of proprietary heterogeneous compute devices. We focus on capturing the interspersed address streams of the workload and the burstiness of the injection process for proprietary compute devices commonly found in mobile systems. We evaluate Mocktails in simulation with proprietary memory traces of IP blocks. Mocktails accurately recreates the dynamic behaviour of memory access scheduling for memory controller metrics including read row hits (at most 7.3% error) and write row hits (at most 2.8% error). Architects can use Mocktails in their simulations as a substitute for a proprietary compute device, making the tool a useful conduit between industry and academia.

Journal ArticleDOI
TL;DR: MCsim is an extensible and cycle-accurate MC simulator that is able to run as a trace-based simulator as well as provide an interface to connect with external CPU and memory device simulators.
Abstract: Numerous proposals for memory controller (MC) designs have been exposed to the research community. Interest has since been growing in the area of computer architecture and real-time systems to improve the throughput of the system and/or guarantee timing requirements through novel scheduling algorithms. Consequently, comprehensive simulators are highly demanded since they provide an infrastructure for development of new ideas effectively without re-implementing the other parts of the hardware. Although there has been several proposals for off-chip memory device simulators, there is a shortage in their MC counterparts. In this letter, we propose MCsim, an extensible and cycle-accurate MC simulator. Designed as an integrable environment, MCsim is able to run as a trace-based simulator as well as provide an interface to connect with external CPU and memory device simulators.

Patent
28 Jan 2020
TL;DR: In this article, the memory controller is configured to switch between the dual channel mode and the single channel mode, such that the second ECC memory device is a spare memory device, and the first error correcting code memory device protects the first memory devices and the second memory devices.
Abstract: A technique relates to operating a memory controller. The memory controller drives first memory devices and second memory devices of the memory controller in a dual channel mode. A first error correcting code (ECC) memory device and a second ECC memory device protect the first memory devices and the second memory devices. The memory controller drives the first memory devices and the second memory devices in a single channel mode such that the second ECC memory device is a spare memory device, and the first ECC memory device protects the first memory devices and the second memory devices. The memory controller is configured to switch between the dual channel mode and the single channel mode.

Patent
03 Mar 2020
TL;DR: In this paper, transaction-based hybrid memory devices include a host memory controller to control operation of the device, coupled to a hybrid memory controller over a memory bus, including non-volatile memory control logic and cache control logic to accelerate cache operations.
Abstract: A transaction-based hybrid memory device includes a host memory controller to control operation of the device. A hybrid memory controller is coupled to the host memory controller over a memory bus. The hybrid memory controller includes non-volatile memory control logic to control operation of non-volatile memory devices and cache control logic to accelerate cache operations, a direct memory access (DMA) engine to control volatile cache memory and to transfer data between non-volatile memory, and cache memory to off load host cache managements and transactions. A host interface couples the host memory controller to the memory bus.

Posted Content
TL;DR: The Virtual Block Interface (VBI) as mentioned in this paper is a new virtual memory framework that delegates memory management duties to hardware to reduce the overheads and software complexity associated with virtual memory.
Abstract: Computers continue to diversify with respect to system designs, emerging memory technologies, and application memory demands. Unfortunately, continually adapting the conventional virtual memory framework to each possible system configuration is challenging, and often results in performance loss or requires non-trivial workarounds. To address these challenges, we propose a new virtual memory framework, the Virtual Block Interface (VBI). We design VBI based on the key idea that delegating memory management duties to hardware can reduce the overheads and software complexity associated with virtual memory. VBI introduces a set of variable-sized virtual blocks (VBs) to applications. Each VB is a contiguous region of the globally-visible VBI address space, and an application can allocate each semantically meaningful unit of information (e.g., a data structure) in a separate VB. VBI decouples access protection from memory allocation and address translation. While the OS controls which programs have access to which VBs, dedicated hardware in the memory controller manages the physical memory allocation and address translation of the VBs. This approach enables several architectural optimizations to (1) efficiently and flexibly cater to different and increasingly diverse system configurations, and (2) eliminate key inefficiencies of conventional virtual memory. We demonstrate the benefits of VBI with two important use cases: (1) reducing the overheads of address translation (for both native execution and virtual machine environments), as VBI reduces the number of translation requests and associated memory accesses; and (2) two heterogeneous main memory architectures, where VBI increases the effectiveness of managing fast memory regions. For both cases, VBI significanttly improves performance over conventional virtual memory.

Proceedings ArticleDOI
08 Jun 2020
TL;DR: DSM, a light-weight hardware extension in memory controller to detect the pages with same content in memory and refresh only one of them and redirect the requests to the others to this page, is presented.
Abstract: The number of cores and the capacities of main memory in modern systems have been growing significantly. Specifically, memory scaling, although at a slower pace than computation scaling, provided opportunities for very large DRAMs with Terabytes (TBs) capacity. Consequently, addressing the performance and energy consumption bottlenecks of DRAMs is more important than ever. DRAM memory refresh operation is one of the main contributing factors to the memory overheads, especially for large capacity DRAMs used in modern servers and emerging large-scale data centers. This paper addresses the memory refresh problem by leveraging the fact that most cloud servers host virtualized systems that use similar kernels, libraries, etc. We propose and experimentally evaluate a novel approach that exploits this observation to address the DRAM refresh overhead in such systems. More specifically, in this work, we present DSM, a light-weight hardware extension in memory controller to detect the pages with same content in memory and refresh only one of them and redirect the requests to the others to this page. Our detailed experimental analysis shows that the proposed DSM design can reduce 99\textsuperscriptth percentile memory access latency by up to 2.01x, and it also reduces the overall memory energy consumption by up to 8.5%.

Journal ArticleDOI
TL;DR: MemSZ introduces a low latency, parallel design of the Squeeze (SZ) algorithm offering aggressive compression ratios, up to 16:1 in the authors' implementation, and improves the execution time, energy, and memory traffic by up to 15%, 9%, and 64%, respectively.
Abstract: This article describes Memory Squeeze (MemSZ), a new approach for lossy general-purpose memory compression. MemSZ introduces a low latency, parallel design of the Squeeze (SZ) algorithm offering aggressive compression ratios, up to 16:1 in our implementation. Our compressor is placed between the memory controller and the cache hierarchy of a processor to reduce the memory traffic of applications that tolerate approximations in parts of their data. Thereby, the available off-chip bandwidth is utilized more efficiently improving system performance and energy efficiency. Two alternative multi-core variants of the MemSZ system are described. The first variant has a shared last-level cache (LLC) on the processor-die, which is modified to store both compressed and uncompressed data. The second has a 3D-stacked DRAM cache with larger cache lines that match the granularity of the compressed memory blocks and stores only uncompressed data. For applications that tolerate aggressive approximation in large fractions of their data, MemSZ reduces baseline memory traffic by up to 81%, execution time by up to 62%, and energy costs by up to 25% introducing up to 1.8% error to the application output. Compared to the current state-of-the-art lossy memory compression design, MemSZ improves the execution time, energy, and memory traffic by up to 15%, 9%, and 64%, respectively.

Journal ArticleDOI
TL;DR: A novel approximate memory controller architecture (AMC) is proposed that reduces the DRAM latency by opportunistically exploiting row buffer locality and bank level parallelism in memory request scheduling, and leverages approximability of the reply data from DRAM, to reduce the number of reply packets injected into the NoC.
Abstract: High interconnect bandwidth is crucial for achieving better performance in many-core GPGPU architectures that execute highly data parallel applications. The parallel warps of threads running on shader cores generate a high volume of read requests to the main memory due to the limited size of data caches at the shader cores. This leads to a scenarios with rapid arrival of an even larger volume of reply data from the DRAM, which creates a bottleneck at memory controllers (MCs) that send reply packets back to the requesting cores over the network-on-chip (NoC). Coping with such high volumes of data requires intelligent memory scheduling and innovative NoC architectures. To mitigate memory bottlenecks in GPGPUs, we first propose a novel approximate memory controller architecture ( AMC ) that reduces the DRAM latency by opportunistically exploiting row buffer locality and bank level parallelism in memory request scheduling, and leverages approximability of the reply data from DRAM, to reduce the number of reply packets injected into the NoC. To further realize high throughput and low energy communication in GPGPUs, we propose a low power, approximate NoC architecture ( Dapper ) that increases the utilization of the available network bandwidth by using single cycle overlay circuits for the reply traffic between MCs and shader cores. Experimental results show that Dapper and AMC together increase NoC throughput by up to 21 percent; and reduce NoC latency by up to 45.5 percent and energy consumed by the NoC and MC by up to 38.3 percent, with minimal impact on output accuracy, compared to state-of-the-art approximate NoC/MC architectures.

Journal ArticleDOI
TL;DR: Bandwidth-Aware Prefetch Configuration (BAPC) is proposed a scalable adaptive prefetching algorithm that improves the performance of multi-program workloads and reduces bandwidth consumption over the IBM POWER8 default configuration.
Abstract: Advanced hardware prefetch engines are being integrated in current high-performance processors. Prefetching can boost the performance of most applications, however, the induced bandwidth consumption can lead the system to a high contention for main memory bandwidth, which is a scarce resource in current multicores. In such a case, the system performance can be severely damaged. This article characterizes the applications’ behavior in an IBM POWER8 machine, which presents many prefetch settings, varying the bandwidth contention. The study reveals that the best prefetch setting for each application depends on the main memory bandwidth availability, that is, it depends on the co-running applications. Based on this study, we propose Bandwidth-Aware Prefetch Configuration (BAPC) a scalable adaptive prefetching algorithm that improves the performance of multi-program workloads. BAPC increases the performance of the applications in a 12, 15, and 16 percent of 6-, 8-, and 10-application workloads over the IBM POWER8 default configuration. In addition, BAPC reduces bandwidth consumption in 39, 42, and 45 percent, respectively.