Showing papers on "Memory controller published in 2020"

PDF

Open Access

Journal Article•DOI•

Stability Analysis and Generalised Memory Controller Design for Delayed T-S Fuzzy Systems via Flexible Polynomial-Based Functions

[...]

Yufeng Tian¹, Zhanshan Wang¹•Institutions (1)

Northeastern University (China)¹

21 Dec 2020-IEEE Transactions on Fuzzy Systems

TL;DR: Based on the stability criterion, considering both the time-varying delay and its bounds, a generalised memory controller is designed for T-S fuzzy systems, which covers the memoryless and traditional memory ones as its special cases.

...read moreread less

Abstract: In this paper, stability analysis and controller synthesis problems for T-S fuzzy systems with time-varying delay are studied. A generalised parameter-dependent reciprocally convex inequality (GPDRCI) is presented to handle the derivative of triple integral terms, which is more general than some existing ones. By choosing suitable flexible polynomials with tunable parameters, novel flexible polynomial-based functions (FPFs) are proposed in delay-product types, which overcome the incompletely slack matrices, higher order time delay and insufficient parameters in the existing functions. Benefitting from completely slack matrices and lower order time delay, coupling relationship among system states and time delay is fully linked. Based on the GPDRCI and FPFs, a stability condition is derived for T-S fuzzy systems. Based on the stability criterion, considering both the time-varying delay and its bounds, a generalised memory controller is designed for T-S fuzzy systems, which covers the memoryless and traditional memory ones as its special cases. In addition, the constraints on introduced slack matrices in some existing works are avoided with the help of a matrix inequality decoupling technique. These provide extra free dimensions in the solution space. Some examples are employed to illustrate the effectiveness of the proposed methods.

...read moreread less

30 citations

Patent•

Memory controller implemented error correction code memory

[...]

Hornung Bryan¹, Brewer Tony¹•Institutions (1)

Micron Technology¹

31 Mar 2020

TL;DR: In this article, the authors describe a memory controller implemented error correction code (ECC) memory, where ECC groups may be placed across banks of the memory to restrict a given bank to a single member of the ECC group.

...read moreread less

Abstract: Devices and techniques for memory controller implemented error correction code (ECC) memory are disclosed herein. ECC groups may be placed across banks of the memory. In some examples, an ECC group is a collection of bytes equal to one row in one bank. Also, the placement may restrict a given bank to a single member of the ECC group. A memory operation can be received and executed using the ECC groups.

...read moreread less

22 citations

Journal Article•DOI•

NoM: Network-on-Memory for Inter-Bank Data Transfer in Highly-Banked Memories

[...]

Seyyed Hossein SeyyedAghaei Rezaei¹, Mehdi Modarressi¹, Rachata Ausavarungnirun², Mohammad Sadrosadati¹, Onur Mutlu³, Masoud Daneshtalab⁴ - Show less +2 more•Institutions (4)

University of Tehran¹, King Mongkut's University of Technology North Bangkok², ETH Zurich³, Royal Institute of Technology⁴

01 Jan 2020-IEEE Computer Architecture Letters

TL;DR: Network-on-Memory is presented, a lightweight inter-bank data communication scheme that enables direct data copy across both memory banks of a 3D-stacked memory and improves the performance of data-intensive workloads by 3.8X and 75 percent, on average, compared to the baseline conventional DRAM architecture and state-of-the-art techniques.

...read moreread less

Abstract: Data copy is a widely-used memory operation in many programs and operating system services. In conventional computers, data copy is often carried out by two separate read and write transactions that pass data back and forth between the DRAM chip and the processor chip. Some prior mechanisms propose to avoid this unnecessary data movement by using the shared internal bus in the DRAM chip to directly copy data within the DRAM chip (e.g., between two DRAM banks). While these methods exhibit superior performance compared to conventional techniques, data copy across different DRAM banks is still greatly slower than data copy within the same DRAM bank. Hence, these techniques have limited benefit for the emerging 3D-stacked memories (e.g., HMC and HBM) that contain hundreds of DRAM banks across multiple memory controllers. In this paper, we present Network-on-Memory (NoM), a lightweight inter-bank data communication scheme that enables direct data copy across both memory banks of a 3D-stacked memory. NoM adopts a TDM-based circuit-switching design, where circuit setup is done by the memory controller. Compared to state-of-the-art approaches, NoM enables both fast data copy between multiple DRAM banks and concurrent data transfer operations. Our evaluation shows that NoM improves the performance of data-intensive workloads by 3.8X and 75 percent, on average, compared to the baseline conventional 3D-stacked DRAM architecture and state-of-the-art techniques, respectively.

...read moreread less

20 citations

Proceedings Article•DOI•

Hoop: efficient hardware-assisted out-of-place update for non-volatile memory

[...]

Miao Cai¹, Chance C. Coats², Jian Huang²•Institutions (2)

Nanjing University¹, University of Illinois at Urbana–Champaign²

30 May 2020

TL;DR: This paper proposes a transparent and efficient hardware-assisted out-of-place update (HOOP) mechanism that supports atomic data durability, without incurring much extra writes and performance overhead and demonstrates scalable data recovery capability on multi-core systems.

...read moreread less

Abstract: Byte-addressable non-volatile memory (NVM) is a promising technology that provides near-DRAM performance with scalable memory capacity. However, it requires atomic data durability to ensure memory persistency. Therefore, many techniques, including logging and shadow paging, have been proposed. However, most of them either introduce extra write traffic to NVM or suffer from significant performance overhead on the critical path of program execution, or even both.In this paper, we propose a transparent and efficient hardware-assisted out-of-place update (HOOP) mechanism that supports atomic data durability, without incurring much extra writes and performance overhead. The key idea is to write the updated data to a new place in NVM, while retaining the old data until the updated data becomes durable. To support this, we develop a lightweight indirection layer in the memory controller to enable efficient address translation and adaptive garbage collection for NVM. We evaluate HOOP with a variety of popular data structures and data-intensive applications, including key-value stores and databases. Our evaluation shows that HOOP achieves low critical-path 1atency with small write amplification, which is close to that of a native system without persistence support. Compared with state-of-the-art crash-consistency techniques, it improves application performance by up to $ 1.7\times$, while reducing the write amplification by up to $ 2.1\times$. HOOP also demonstrates scalable data recovery capability on multi-core systems.

...read moreread less

19 citations

Journal Article•DOI•

NOM: Network-On-Memory for Inter-Bank Data Transfer in Highly-Banked Memories

[...]

Seyyed Hossein SeyyedAghaei Rezaei¹, Mehdi Modarressi¹, Rachata Ausavarungnirun², Mohammad Sadrosadati¹, Onur Mutlu³, Masoud Daneshtalab⁴ - Show less +2 more•Institutions (4)

University of Tehran¹, King Mongkut's University of Technology North Bangkok², ETH Zurich³, Royal Institute of Technology⁴

21 Apr 2020-arXiv: Hardware Architecture

TL;DR: Network-on-Memory (NoM) as mentioned in this paper adopts a TDM-based circuit-switching design, where circuit setup is done by the memory controller, enabling both fast data copy between multiple DRAM banks and concurrent data transfer operations.

...read moreread less

Abstract: Data copy is a widely-used memory operation in many programs and operating system services. In conventional computers, data copy is often carried out by two separate read and write transactions that pass data back and forth between the DRAM chip and the processor chip. Some prior mechanisms propose to avoid this unnecessary data movement by using the shared internal bus in the DRAM chip to directly copy data within the DRAM chip (e.g., between two DRAM banks). While these methods exhibit superior performance compared to conventional techniques, data copy across different DRAM banks is still greatly slower than data copy within the same DRAM bank. Hence, these techniques have limited benefit for the emerging 3D-stacked memories (e.g., HMC and HBM) that contain hundreds of DRAM banks across multiple memory controllers. In this paper, we present Network-on-Memory (NoM), a lightweight inter-bank data communication scheme that enables direct data copy across both memory banks of a 3D-stacked memory. NoM adopts a TDM-based circuit-switching design, where circuit setup is done by the memory controller. Compared to state-of-the-art approaches, NoM enables both fast data copy between multiple DRAM banks and concurrent data transfer operations. Our evaluation shows that NoM improves the performance of data-intensive workloads by 3.8X and 75%, on average, compared to the baseline conventional 3D-stacked DRAM architecture and state-of-the-art techniques, respectively.

...read moreread less

19 citations

Journal Article•DOI•

An Energy-Efficient and Fast Scheme for Hybrid Storage Class Memory in an AIoT Terminal System

[...]

Sun Hao, Chen Lan, Hao Xiaoran, Liu Chenji, Ni Mao - Show less +1 more

17 Jun 2020-Electronics

TL;DR: A hybrid storage class memory system to reduce the energy consumption and optimize IO performance is presented and a migration scheme implemented in the memory controller is proposed that can reduce energy consumption by 46.2% on average compared with the traditional DRAM-only system.

...read moreread less

Abstract: Conventional main memory can no longer meet the requirements of low energy consumption and massive data storage in an artificial intelligence Internet of Things (AIoT) system Moreover, the efficiency is decreased due to the swapping of data between the main memory and storage This paper presents a hybrid storage class memory system to reduce the energy consumption and optimize IO performance Phase change memory (PCM) brings the advantages of low static power and a large capacity to a hybrid memory system In order to avoid the impact of poor write performance in PCM, a migration scheme implemented in the memory controller is proposed By counting the write times and row buffer miss times in PCM simultaneously, the write-intensive data can be selected and migrated from PCM to dynamic random-access memory (DRAM) efficiently, which improves the performance of hybrid storage class memory In addition, a fast mode with a tmpfs-based, in-memory file system is applied to hybrid storage class memory to reduce the number of data movements between memory and external storage Experimental results show that the proposed system can reduce energy consumption by 462% on average compared with the traditional DRAM-only system The fast mode increases the IO performance of the system by more than 30 times compared with the common ext3 file system

...read moreread less

16 citations

Proceedings Article•DOI•

The virtual block interface: a flexible alternative to the conventional virtual memory framework

[...]

Nastaran Hajinazar¹, Pratyush Patel², Minesh Patel³, Konstantinos Kanellopoulos³, Saugata Ghose⁴, Rachata Ausavarungnirun⁵, Geraldo F. Oliveira³, Jonathan Appavoo⁶, Vivek Seshadri⁷, Onur Mutlu⁴ - Show less +6 more•Institutions (7)

Simon Fraser University¹, University of Washington², ETH Zurich³, Carnegie Mellon University⁴, King Mongkut's University of Technology North Bangkok⁵, Boston University⁶, Microsoft⁷

30 May 2020

TL;DR: The benefits of VBI are demonstrated with two important use cases: reducing the overheads of address translation (for both native execution and virtual machine environments), as VBI reduces the number of translation requests and associated memory accesses; and two heterogeneous main memory architectures, where VBI significantly improves performance over conventional virtual memory.

...read moreread less

Abstract: Computers continue to diversify with respect to system designs, emerging memory technologies, and application memory demands. Unfortunately, continually adapting the conventional virtual memory framework to each possible system configuration is challenging, and often results in performance loss or requires non-trivial workarounds. To address these challenges, we propose a new virtual memory framework, the Virtual Block Interface (VBI). We design VBI based on the key idea that delegating memory management duties to hardware can reduce the overheads and software complexity associated with virtual memory. VBI introduces a set of variable-sized virtual blocks (VBs) to applications. Each VB is a contiguous region of the globally-visible VBI address space, and an application can allocate each semantically meaningful unit of information (e.g., a data structure) in a separate VB. VBI decouples access protection from memory allocation and address translation. While the OS controls which programs have access to which VBs, dedicated hardware in the memory controller manages the physical memory allocation and address translation of the VBs. This approach enables several architectural optimizations to (1) efficiently and flexibly cater to different and increasingly diverse system configurations, and (2) eliminate key inefficiencies of conventional virtual memory. We demonstrate the benefits of VBI with two important use cases: (1) reducing the overheads of address translation (for both native execution and virtual machine environments), as VBI reduces the number of translation requests and associated memory accesses; and (2) two heterogeneous main memory architectures, where VBI increases the effectiveness of managing fast memory regions. For both cases, VBI significantly improves performance over conventional virtual memory.

...read moreread less

16 citations

Journal Article•DOI•

MEG: A RISCV-based System Emulation Infrastructure for Near-data Processing Using FPGAs and High-bandwidth Memory

[...]

Jialiang Zhang¹, Yue Zha¹, Nicholas Beckwith¹, Bangya Liu², Jing Li¹ - Show less +1 more•Institutions (2)

University of Pennsylvania¹, University of Wisconsin-Madison²

30 Sep 2020-ACM Transactions on Reconfigurable Technology and Systems

TL;DR: MEG is proposed, an open source, configurable, cycle-exact, and RISC-V-based full-system emulation infrastructure using FPGA and HBM that provides a highly modular hardware design and includes a bootable Linux image for a realistic software flow, so that users can perform cross-layer software-hardware co-optimization in a full- system environment.

...read moreread less

Abstract: Emerging three-dimensional (3D) memory technologies, such as the Hybrid Memory Cube (HMC) and High Bandwidth Memory (HBM), provide high-bandwidth and massive memory-level parallelism. With the growing heterogeneity and complexity of computer systems (CPU cores and accelerators, etc.), efficiently integrating emerging memories into existing systems poses new challenges and requires detailed evaluation in a realistic computing environment. In this article, we propose MEG, an open source, configurable, cycle-exact, and RISC-V-based full-system emulation infrastructure using FPGA and HBM. MEG provides a highly modular hardware design and includes a bootable Linux image for a realistic software flow, so that users can perform cross-layer software-hardware co-optimization in a full-system environment. To improve the observability and debuggability of the system, MEG also provides a flexible performance monitoring scheme to guide the performance optimization. The proposed MEG infrastructure can potentially benefit broad communities across computer architecture, system software, and application software. Leveraging MEG, we present two cross-layer system optimizations as illustrative cases to demonstrate the usability of MEG. In the first case study, we present a reconfigurable memory controller to improve the address mapping of standard memory controller. This reconfigurable memory controller along with its OS support allows us to optimize the address mapping scheme to fully exploit the massive parallelism provided by the emerging three-dimensional (3D) memories. In the second case study, we present a lightweight IOMMU design to tackle the unique challenges brought by 3D memory in providing virtual memory support for near-memory accelerators. We provide a prototype implementation of MEG on a Xilinx VU37P FPGA and demonstrate its capability, fidelity, and flexibility on real-world benchmark applications. We hope MEG fills a gap in the space of publicly available FPGA-based full-system emulation infrastructures, specifically targeting memory systems, and inspires further collaborative software/hardware innovations.

...read moreread less

14 citations

Journal Article•DOI•

Approximate Memory Compression

[...]

Ashish Ranjan¹, Arnab Raha¹, Vijay Raghunathan¹, Anand Raghunathan¹•Institutions (1)

Purdue University¹

20 Feb 2020-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: In this article, the authors propose approximate memory compression, a technique that leverages the intrinsic resilience of emerging workloads such as machine learning and data analytics to reduce off-chip memory traffic, thereby improving energy and performance.

...read moreread less

Abstract: Memory subsystems are a major energy bottleneck in computing platforms due to frequent transfers between processors and off-chip memory. We propose approximate memory compression, a technique that leverages the intrinsic resilience of emerging workloads such as machine learning and data analytics to reduce off-chip memory traffic, thereby improving energy and performance. We realize approximate memory compression by enhancing the memory controller to be aware of approximate memory regions—regions in memory that contain approximation-resilient data—and to transparently compress (decompress) the data written to (read from) these regions. To provide control over approximations, each approximate memory region is associated with an error constraint such as the maximum error that may be introduced in each data element. The quality-aware memory controller subjects memory transactions to a compression scheme that introduces approximations, thereby reducing memory traffic, while adhering to the specified error constraint for each approximate memory region. A software interface is provided to allow programmers to identify data structures (DSs) that are resilient to approximations. A runtime quality control framework automatically determines the error constraints for the identified DSs such that a given target application-level quality is maintained. We evaluate our proposal by applying it to three different main memory technologies in the context of a general-purpose computing system—DDR3 DRAM, LPDDR3 DRAM, and spin-transfer torque magnetic RAM (STT-MRAM). To demonstrate the feasibility of the proposed concepts, we also implement a hardware prototype using the Intel UniPHY-DDR3 memory controller and Nios-II processor, a Hynix DDR3 DRAM module, and a Stratix-IV field-programmable gate array (FPGA) development board. Across a wide range of machine learning benchmarks, approximate memory compression obtains significant benefits in main memory energy ( $1.18\times $ for DDR3 DRAM, $1.52\times $ for LPDDR3 DRAM, and $2.0\times $ for STT-MRAM) and a simultaneous improvement in execution time (5.2% for DDR3 DRAM, 5.4% for LPDDR3 DRAM, and 9.3% for STT-MRAM) with nearly identical application output quality.

...read moreread less

13 citations

Proceedings Article•DOI•

Tvarak: software-managed hardware offload for redundancy in direct-access NVM storage

[...]

Rajat Kateja¹, Nathan Beckmann¹, Gregory R. Ganger¹•Institutions (1)

Carnegie Mellon University¹

30 May 2020

TL;DR: This work proposes to offload the update and verification of system-level redundancy to TVARAK, a new hardware controller co-located with the last-level cache that enables efficient protection of data from bugs in memory controller and NVM DIMM firmware.

...read moreread less

Abstract: Production storage systems complement device-level ECC (which covers media errors) with system-checksums and cross-device parity. This system-level redundancy enables systems to detect and recover from data corruption due to device firmware bugs (e.g., reading data from the wrong physical location). Direct access to NVM penalizes software-only implementations of system-level redundancy, forcing a choice between lack of data protection or significant performance penalties. We propose to offload the update and verification of system-level redundancy to Tvarak, a new hardware controller co-located with the last-level cache. Tvarak enables efficient protection of data from such bugs in memory controller and NVM DIMM firmware. Simulation-based evaluation with seven data-intensive applications shows that Tvarak is efficient. For example, Tvarak reduces Redis set-only performance by only 3%, compared to 50% reduction for a state-of-the-art software-only approach.

...read moreread less

12 citations

Journal Article•DOI•

Reducing DRAM refresh power consumption by runtime profiling of retention time and dual-row activation

[...]

Haerang Choi¹, Haerang Choi², Hong Do-Sun¹, Jaesung Lee¹, Sungjoo Yoo² - Show less +1 more•Institutions (2)

SK Hynix¹, Seoul National University²

01 Feb 2020-Microprocessors and Microsystems

TL;DR: A novel scheme which comprises an adaptive refresh method that adjusts refresh period on each DRAM chip, a runtime method of retention-time profiling that operates inside DRAM chips during idle time thereby improving availability, and a dual-row activation method which improves weak cell retention time at a very small area cost is proposed.

...read moreread less

Journal Article•DOI•

A memory scheduling strategy for eliminating memory access interference in heterogeneous system

[...]

Juan Fang¹, Mengxuan Wang¹, Zelin Wei¹•Institutions (1)

Beijing University of Technology¹

10 Jan 2020-The Journal of Supercomputing

TL;DR: A step-by-step memory scheduling strategy that isolates the CPU requests from the GPU requests when the memory controller receives the memory request, thereby preventing the GPU request from interfering with the CPU request.

...read moreread less

Abstract: Multiple CPUs and GPUs are integrated on the same chip to share memory, and access requests between cores are interfering with each other. Memory requests from the GPU seriously interfere with the CPU memory access performance. Requests between multiple CPUs are intertwined when accessing memory, and its performance is greatly affected. The difference in access latency between GPU cores increases the average latency of memory accesses. In order to solve the problems encountered in the shared memory of heterogeneous multi-core systems, we propose a step-by-step memory scheduling strategy, which improve the system performance. The step-by-step memory scheduling strategy first creates a new memory request queue based on the request source and isolates the CPU requests from the GPU requests when the memory controller receives the memory request, thereby preventing the GPU request from interfering with the CPU request. Then, for the CPU request queue, a dynamic bank partitioning strategy is implemented, which dynamically maps it to different bank sets according to different memory characteristics of the application, and eliminates memory request interference of multiple CPU applications without affecting bank-level parallelism. Finally, for the GPU request queue, the criticality is introduced to measure the difference of the memory access latency between the cores. Based on the first ready-first come first served strategy, we implemented criticality-aware memory scheduling to balance the locality and criticality of application access.

...read moreread less

Proceedings Article•DOI•

RRAM-VAC: A Variability-Aware Controller for RRAM-based Memory Architectures

[...]

Shikhar Tuli¹, Marco Rios¹, Alexandre Levisse¹, David Atienza Esl¹•Institutions (1)

École Polytechnique Fédérale de Lausanne¹

01 Jan 2020

TL;DR: The proposed RRAM Variability Aware Controller (RRAM-VAC) stores and then coalesces the write requests from the processor before triggering the actual write process, which averages the RRAM variability and enables the system to run at the memory programming time distribution mean rather than the worst case tail.

...read moreread less

Abstract: The growing need for connected, smart and energy efficient devices requires them to provide both ultra-low standby power and relatively high computing capabilities when awoken. In this context, emerging resistive memory technologies (RRAM) appear as a promising solution as they enable cheap fine grain technology co-integration with CMOS, fast switching and non-volatile storage. However, RRAM technologies suffer from fundamental flaws such as a strong device-to-device and cycle-to-cycle variability which is worsened by aging, forcing the designers to consider worst case design conditions. In this work, we propose, for the first time, a circuit that can take advantage of recently published Write Termination (WT) circuits from both the energy and performances point of view. The proposed RRAM Variability Aware Controller (RRAM-VAC) stores and then coalesces the write requests from the processor before triggering the actual write process. By doing so, it averages the RRAM variability and enables the system to run at the memory programming time distribution mean rather than the worst case tail. We explore the design space of the proposed solution for various RRAM variability specifications, benchmark the effect of the proposed memory controller with real application memory traces and show (for the considered RRAM technology specifications) 44 % to 50 % performances improvement and from 10% to 85% energy gains depending on the application memory access patterns.

...read moreread less

Proceedings Article•DOI•

Reliable Reverse Engineering of Intel DRAM Addressing Using Performance Counters

[...]

Christian Helm¹, Soramichi Akiyama¹, Kenjiro Taura¹•Institutions (1)

University of Tokyo¹

17 Nov 2020

TL;DR: In this paper, an automatic and reliable method for reverse engineering the DRAM addressing of Intel server-class processors is presented. But, the method mainly relies on CPU hardware performance counters to precisely locate the accessed DRAM component.

...read moreread less

Abstract: The memory controller of a processor translates the physical memory address to hardware components such as memory channels, ranks, and banks. This DRAM address mapping is of interest to many researchers in the fields of IT security, hardware architecture, system software, and performance tuning. However, Intel processors are using a complex and undocumented DRAM addressing. The addressing can be different for every system because it depends on many aspects such as the processor model, DIMM population on the motherboard, and BIOS settings. Thus an analysis for every individual system is necessary. In this paper, we introduce an automatic and reliable method for reverse engineering the DRAM addressing of Intel server-class processors. In contrast to existing approaches, it is reliable, measurement errors are unlikely to occur, and can be detected if they occur. Our method mainly relies on CPU hardware performance counters to precisely locate the accessed DRAM component. It eliminates the problem of wrong attribution that is common in timing based approaches. We validated our method by reversing engineering the DRAM addressing of a diverse set of Intel processors. This set includes Broadwell, Haswell, and Skylake micro-architectures, with various core counts, DIMM arrangements, and BIOS settings. We show the correctness of the determined addressing functions using micro-benchmarks that access specific DRAM components.

...read moreread less

Proceedings Article•DOI•

A RISC-V Based Medical Implantable SoC for High Voltage and Current Tissue Stimulus

[...]

Alfredo Arnaud¹, Matias Miguez¹, Joel Gak¹, Rafael Puyol¹, Ronny Garcia-Ramirez², Edgar Solera-Bolanos², Reinaldo Castro-Gonzalez², Roberto Molina-Robles², Alfonso Chacon-Rodriguez², Renato Rimolo-Donadio² - Show less +6 more•Institutions (2)

Universidad Católica del Uruguay Dámaso Antonio Larrañaga¹, Costa Rica Institute of Technology²

01 Feb 2020

TL;DR: A RISC-V based System on Chip (SoC) for high voltage and current tissue stimulus, targeting implantable medical devices, is presented, designed in a 0.18μm HV-CMOS process.

...read moreread less

Abstract: A RISC-V based System on Chip (SoC) for high voltage and current tissue stimulus, targeting implantable medical devices, is presented. The circuit is designed in a 0.18μm HV-CMOS process, including the RISC-V 32RVI based microcontroller core, called Siwa —which includes SPI, UART and GPIO interfaces, a packet-based bus and memory controller, and 8kB SRAM—, combined with several biological tissue stimulus and sensing circuits. The complete test chip (analog+RISC-V) occupies a 5mm2 area but only 0.82mm2 correspond to the RISC-V micro-controller, which operates up to 20MHz, with average energy needs of less than 48 pJ/cycle (3pJ STD), and for which several reliability and safety issues were considered.

...read moreread less

Journal Article•DOI•

Reduced latency DRAM for multi-core safety-critical real-time systems

[...]

Mohamed Hassan¹•Institutions (1)

McMaster University¹

01 Apr 2020-Real-time Systems

TL;DR: An alternative off-chip memory solution that is based on the emerging Reduced Latency DRAM (RLDRAM) protocol is promoted, and a predictable memory controller (RLDC) managing accesses to this memory is proposed.

...read moreread less

Abstract: Predictable execution time upon accessing shared memories in multi-core real-time systems is a stringent requirement. A plethora of existing works focus on the analysis of Double Data Rate Dynamic Random Access Memories (DDR DRAMs), or redesigning its memory to provide predictable memory behavior. In this paper, we show that DDR DRAMs by construction suffer inherent limitations associated with achieving such predictability. These limitations lead to (1) highly variable access latencies that fluctuate based on various factors such as access patterns and memory state from previous accesses, and (2) overly pessimistic latency bounds. As a result, DDR DRAMs can be ill-suited for some real-time systems that mandate a strict predictable performance with tight timing constraints. Targeting these systems, we promote an alternative off-chip memory solution that is based on the emerging Reduced Latency DRAM (RLDRAM) protocol, and propose a predictable memory controller (RLDC) managing accesses to this memory. Comparing with the state-of-the-art predictable DDR controllers, the proposed solution provides up to $$\mathbf{11 }\times $$ less timing variability and $$\mathbf{6.4 }\times $$ reduction in the worst case memory latency.

...read moreread less

Book Chapter•DOI•

Leaky Controller: Cross-VM Memory Controller Covert Channel on Multi-core Systems

[...]

Benjamin Semal¹, Konstantinos Markantonakis¹, Raja Naeem Akram¹, Jan Kalbantner¹•Institutions (1)

Royal Holloway, University of London¹

21 Sep 2020

TL;DR: This paper presents two new microarchitectural covert channel attacks using the memory controller that allow a privileged adversary to leak information in a native environment and an extension to cross-VM scenarios for unprivileged adversaries.

...read moreread less

Abstract: Data confidentiality is put at risk on cloud platforms where multiple tenants share the underlying hardware. As multiple workloads are executed concurrently, conflicts in memory resource occur, resulting in observable timing variations during execution. Malicious tenants can intentionally manipulate the hardware platform to devise a covert channel, enabling them to steal the data of co-residing tenants. This paper presents two new microarchitectural covert channel attacks using the memory controller. The first attack allows a privileged adversary (i.e. process) to leak information in a native environment. The second attack is an extension to cross-VM scenarios for unprivileged adversaries. This work is the first instance of leakage channel based on the memory controller. As opposed to previous denial-of-service attacks, we manage to modulate the load on the channel scheduler with accuracy. Both attacks are implemented on cross-core configurations. Furthermore, the cross-VM covert channel is successfully tested across three different Intel microarchitectures. Finally, a comparison against state-of-the-art covert channel attacks is provided, along with a discussion on potential mitigation techniques.

...read moreread less

Proceedings Article•DOI•

DRAM-Less: Hardware Acceleration of Data Processing with New Memory

[...]

Jie Zhang¹, Gyuyoung Park¹, David Donofrio², John Shalf², Myoungsoo Jung¹ - Show less +1 more•Institutions (2)

KAIST¹, Lawrence Berkeley National Laboratory²

24 Feb 2020

TL;DR: DRAM-less, a hardware automation approach that precisely integrates many state-of-the-art phase change memory (PRAM) modules into its data processing network to dramatically reduce unnecessary data copies with a minimum of software modifications is proposed.

...read moreread less

Abstract: General purpose hardware accelerators have become major data processing resources in many computing domains. However, the processing capability of hardware accelerations is often limited by costly software interventions and memory copies to support compulsory data movement between different processors and solid-state drives (SSDs). This in turn also wastes a significant amount of energy in modern accelerated systems. In this work, we propose, DRAM-less, a hardware automation approach that precisely integrates many state-of-the-art phase change memory (PRAM) modules into its data processing network to dramatically reduce unnecessary data copies with a minimum of software modifications. We implement a new memory controller that plugs a real 3x nm multi-partition PRAM to 28nm technology FPGA logic cells and interoperate its design into a real PCIe accelerator emulation platform. The evaluation results reveal that our DRAM-less achieves, on average, 47% better performance than advanced acceleration approaches that use a peer-to-peer DMA.

...read moreread less

Patent•

Hybrid memory module with improved inter-memory data transmission path

[...]

Shallal Aws¹•Institutions (1)

Rambus¹

12 Nov 2020

TL;DR: In this article, the authors present techniques for implementing hybrid memory modules with improved inter-memory data transmission paths, which exhibits improved transmission latencies and power consumption when transmitting data between DRAM devices and NVM devices (e.g., flash devices).

...read moreread less

Abstract: Disclosed herein are techniques for implementing hybrid memory modules with improved inter-memory data transmission paths. The claimed embodiments address the problem of implementing a hybrid memory module that exhibits improved transmission latencies and power consumption when transmitting data between DRAM devices and NVM devices (e.g., flash devices) during data backup and data restore operations. Some embodiments are directed to approaches for providing a direct data transmission path coupling a non-volatile memory controller and the DRAM devices to transmit data between the DRAM devices and the flash devices. In one or more embodiments, the DRAM devices can be port switched devices, with a first port coupled to the data buffers and a second port coupled to the direct data transmission path. Further, in one or more embodiments, such data buffers can be disabled when transmitting data between the DRAM devices and the flash devices.

...read moreread less

Patent•

High performance, high capacity memory modules and systems

[...]

Suresh Rajan¹, Abhijit M. Abhyankar¹, Ravindranath Kollipara¹, David Secker¹•Institutions (1)

Rambus¹

05 Nov 2020

TL;DR: In this article, the authors propose a memory module that includes address-buffer and data-buffer components that together support wide and narrow-data modes, and the address buffer component manages communication between a memory controller and two sets of memory components.

...read moreread less

Abstract: Described are memory modules that include address-buffer components and data-buffer components that together support wide- and narrow-data modes. The address-buffer component manages communication between a memory controller and two sets of memory components. In the wide-data mode, the address-buffer enables memory components in each set and instructs the data-buffer components to communicate full-width read and write data by combining data from or to from both sets for each memory access. In the narrow-data mode, the address-buffer enables memory components in just one of the two sets and instructs the data-buffer components to half-width read and write data with one set per memory access.

...read moreread less

Proceedings Article•DOI•

Understanding and Improving Persistent Transactions on Optane™ DC Memory

[...]

Pantea Zardoshti¹, Michael Spear¹, Aida Vosoughi², Garret Swart²•Institutions (2)

Lehigh University¹, Oracle Corporation²

18 May 2020

TL;DR: A large throughput difference is found, which emphasizes the importance of choosing the best durability domain for each application and system, and confirms that recently published persistent transactional memory algorithms are able to scale, and that recent optimizations for these algorithms lead to strong performance.

...read moreread less

Abstract: Storing data structures in high-capacity byte-addressable persistent memory instead of DRAM or a storage device offers the opportunity to (1) reduce cost and power consumption compared with DRAM, (2) decrease the latency and CPU resources needed for an I/O operation compared with storage, and (3) allow for fast recovery as the data structure remains in memory after a machine failure. The first commercial offering in this space is Intel® Optane™ Direct Connect (Optane™ DC) Persistent Memory. Optane™ DC promises access time within a constant factor of DRAM, with larger capacity, lower energy consumption, and persistence. We present an experimental evaluation of persistent transactional memory performance, and explore how Optane™ DC durability domains affect the overall results. Given that neither of the two available durability domains can deliver performance competitive with DRAM, we introduce and emulate a new durability domain, called PDRAM, in which the memory controller tracks enough information (and has enough reserve power) to make DRAM behave like a persistent cache of Optane™ DC memory.In this paper we compare the performance of these durability domains on several configurations of five persistent transactional memory applications. We find a large throughput difference, which emphasizes the importance of choosing the best durability domain for each application and system. At the same time, our results confirm that recently published persistent transactional memory algorithms are able to scale, and that recent optimizations for these algorithms lead to strong performance, with speedups as high as 6× at 16 threads.

...read moreread less

Proceedings Article•DOI•

Mocktails: capturing the memory behaviour of proprietary mobile architectures

[...]

Mario Badr¹, Carlo Delconte, Isak Edo¹, Radhika Jagtap, Matteo Maria Andreozzi, Natalie Enright Jerger¹ - Show less +2 more•Institutions (1)

University of Toronto¹

30 May 2020

TL;DR: Mocktails is a methodology to synthetically recreate the varying spatio-temporal memory access behaviour of proprietary heterogeneous compute devices commonly found in mobile systems that accurately recreates the dynamic behaviour of memory access scheduling for memory controller metrics.

...read moreread less

Abstract: Computation demands on mobile and edge devices are increasing dramatically. Mobile devices, such as smart phones, incorporate a large number of dedicated accelerators and fixed-function hardware blocks to deliver the required performance and power efficiency. Due to the heterogeneous nature of these devices, they feature vastly larger design spaces than traditional systems featuring only a CPU. Currently, academia struggles to fully evaluate such heterogeneous systems on chip due to the limited access and availability of proprietary workloads. To address these challenges, we propose Mocktails: a methodology to synthetically recreate the varying spatio-temporal memory access behaviour of proprietary heterogeneous compute devices. We focus on capturing the interspersed address streams of the workload and the burstiness of the injection process for proprietary compute devices commonly found in mobile systems. We evaluate Mocktails in simulation with proprietary memory traces of IP blocks. Mocktails accurately recreates the dynamic behaviour of memory access scheduling for memory controller metrics including read row hits (at most 7.3% error) and write row hits (at most 2.8% error). Architects can use Mocktails in their simulations as a substitute for a proprietary compute device, making the tool a useful conduit between industry and academia.

...read moreread less

Journal Article•DOI•

MCsim: An Extensible DRAM Memory Controller Simulator

[...]

Reza Mirosanlou¹, Danlu Guo¹, Mohamed Hassan², Rodolfo Pellizzoni¹•Institutions (2)

University of Waterloo¹, McMaster University²

09 Jul 2020-IEEE Computer Architecture Letters

TL;DR: MCsim is an extensible and cycle-accurate MC simulator that is able to run as a trace-based simulator as well as provide an interface to connect with external CPU and memory device simulators.

...read moreread less

Abstract: Numerous proposals for memory controller (MC) designs have been exposed to the research community. Interest has since been growing in the area of computer architecture and real-time systems to improve the throughput of the system and/or guarantee timing requirements through novel scheduling algorithms. Consequently, comprehensive simulators are highly demanded since they provide an infrastructure for development of new ideas effectively without re-implementing the other parts of the hardware. Although there has been several proposals for off-chip memory device simulators, there is a shortage in their MC counterparts. In this letter, we propose MCsim, an extensible and cycle-accurate MC simulator. Designed as an integrable environment, MCsim is able to run as a trace-based simulator as well as provide an interface to connect with external CPU and memory device simulators.

...read moreread less

Patent•

Using dual channel memory as single channel memory with spares

[...]

Kyu-hyoun Kim¹, Warren E. Maule¹, Kevin M. McIlvain, Saravanan Sethuraman•Institutions (1)

IBM¹

28 Jan 2020

TL;DR: In this article, the memory controller is configured to switch between the dual channel mode and the single channel mode, such that the second ECC memory device is a spare memory device, and the first error correcting code memory device protects the first memory devices and the second memory devices.

...read moreread less

Abstract: A technique relates to operating a memory controller. The memory controller drives first memory devices and second memory devices of the memory controller in a dual channel mode. A first error correcting code (ECC) memory device and a second ECC memory device protect the first memory devices and the second memory devices. The memory controller drives the first memory devices and the second memory devices in a single channel mode such that the second ECC memory device is a spare memory device, and the first ECC memory device protects the first memory devices and the second memory devices. The memory controller is configured to switch between the dual channel mode and the single channel mode.

...read moreread less

Patent•

Transaction-based hybrid memory

[...]

Lee Xiaobing, Yang Feng

03 Mar 2020

TL;DR: In this paper, transaction-based hybrid memory devices include a host memory controller to control operation of the device, coupled to a hybrid memory controller over a memory bus, including non-volatile memory control logic and cache control logic to accelerate cache operations.

...read moreread less

Abstract: A transaction-based hybrid memory device includes a host memory controller to control operation of the device. A hybrid memory controller is coupled to the host memory controller over a memory bus. The hybrid memory controller includes non-volatile memory control logic to control operation of non-volatile memory devices and cache control logic to accelerate cache operations, a direct memory access (DMA) engine to control volatile cache memory and to transfer data between non-volatile memory, and cache memory to off load host cache managements and transactions. A host interface couples the host memory controller to the memory bus.

...read moreread less

Posted Content•

The Virtual Block Interface: A Flexible Alternative to the Conventional Virtual Memory Framework.

[...]

Simon Fraser University¹, University of Washington², ETH Zurich³, Carnegie Mellon University⁴, King Mongkut's University of Technology North Bangkok⁵, Boston University⁶, Microsoft⁷

19 May 2020-arXiv: Hardware Architecture

TL;DR: The Virtual Block Interface (VBI) as mentioned in this paper is a new virtual memory framework that delegates memory management duties to hardware to reduce the overheads and software complexity associated with virtual memory.

...read moreread less

Abstract: Computers continue to diversify with respect to system designs, emerging memory technologies, and application memory demands. Unfortunately, continually adapting the conventional virtual memory framework to each possible system configuration is challenging, and often results in performance loss or requires non-trivial workarounds. To address these challenges, we propose a new virtual memory framework, the Virtual Block Interface (VBI). We design VBI based on the key idea that delegating memory management duties to hardware can reduce the overheads and software complexity associated with virtual memory. VBI introduces a set of variable-sized virtual blocks (VBs) to applications. Each VB is a contiguous region of the globally-visible VBI address space, and an application can allocate each semantically meaningful unit of information (e.g., a data structure) in a separate VB. VBI decouples access protection from memory allocation and address translation. While the OS controls which programs have access to which VBs, dedicated hardware in the memory controller manages the physical memory allocation and address translation of the VBs. This approach enables several architectural optimizations to (1) efficiently and flexibly cater to different and increasingly diverse system configurations, and (2) eliminate key inefficiencies of conventional virtual memory. We demonstrate the benefits of VBI with two important use cases: (1) reducing the overheads of address translation (for both native execution and virtual machine environments), as VBI reduces the number of translation requests and associated memory accesses; and (2) two heterogeneous main memory architectures, where VBI increases the effectiveness of managing fast memory regions. For both cases, VBI significanttly improves performance over conventional virtual memory.

...read moreread less

Proceedings Article•DOI•

DSM: A Case for Hardware-Assisted Merging of DRAM Rows with Same Content

[...]

Seyed Armin Vakil Ghahani¹, Mahmut Kandemir¹, Jagadish B. Kotra²•Institutions (2)

Pennsylvania State University¹, Advanced Micro Devices²

08 Jun 2020

TL;DR: DSM, a light-weight hardware extension in memory controller to detect the pages with same content in memory and refresh only one of them and redirect the requests to the others to this page, is presented.

...read moreread less

Abstract: The number of cores and the capacities of main memory in modern systems have been growing significantly. Specifically, memory scaling, although at a slower pace than computation scaling, provided opportunities for very large DRAMs with Terabytes (TBs) capacity. Consequently, addressing the performance and energy consumption bottlenecks of DRAMs is more important than ever. DRAM memory refresh operation is one of the main contributing factors to the memory overheads, especially for large capacity DRAMs used in modern servers and emerging large-scale data centers. This paper addresses the memory refresh problem by leveraging the fact that most cloud servers host virtualized systems that use similar kernels, libraries, etc. We propose and experimentally evaluate a novel approach that exploits this observation to address the DRAM refresh overhead in such systems. More specifically, in this work, we present DSM, a light-weight hardware extension in memory controller to detect the pages with same content in memory and refresh only one of them and redirect the requests to the others to this page. Our detailed experimental analysis shows that the proposed DSM design can reduce 99\textsuperscriptth percentile memory access latency by up to 2.01x, and it also reduces the overall memory energy consumption by up to 8.5%.

...read moreread less

Journal Article•DOI•

MemSZ: Squeezing Memory Traffic with Lossy Compression

[...]

Albin Eldstål-Ahrens¹, Ioannis Sourdis¹•Institutions (1)

Chalmers University of Technology¹

10 Nov 2020-ACM Transactions on Architecture and Code Optimization

TL;DR: MemSZ introduces a low latency, parallel design of the Squeeze (SZ) algorithm offering aggressive compression ratios, up to 16:1 in the authors' implementation, and improves the execution time, energy, and memory traffic by up to 15%, 9%, and 64%, respectively.

...read moreread less

Abstract: This article describes Memory Squeeze (MemSZ), a new approach for lossy general-purpose memory compression. MemSZ introduces a low latency, parallel design of the Squeeze (SZ) algorithm offering aggressive compression ratios, up to 16:1 in our implementation. Our compressor is placed between the memory controller and the cache hierarchy of a processor to reduce the memory traffic of applications that tolerate approximations in parts of their data. Thereby, the available off-chip bandwidth is utilized more efficiently improving system performance and energy efficiency. Two alternative multi-core variants of the MemSZ system are described. The first variant has a shared last-level cache (LLC) on the processor-die, which is modified to store both compressed and uncompressed data. The second has a 3D-stacked DRAM cache with larger cache lines that match the granularity of the compressed memory blocks and stores only uncompressed data. For applications that tolerate aggressive approximation in large fractions of their data, MemSZ reduces baseline memory traffic by up to 81%, execution time by up to 62%, and energy costs by up to 25% introducing up to 1.8% error to the application output. Compared to the current state-of-the-art lossy memory compression design, MemSZ improves the execution time, energy, and memory traffic by up to 15%, 9%, and 64%, respectively.

...read moreread less

Journal Article•DOI•

Approximate NoC and Memory Controller Architectures for GPGPU Accelerators

[...]

Venkata Yaswanth Raparti¹, Sudeep Pasricha¹•Institutions (1)

Colorado State University¹

01 May 2020-IEEE Transactions on Parallel and Distributed Systems

TL;DR: A novel approximate memory controller architecture (AMC) is proposed that reduces the DRAM latency by opportunistically exploiting row buffer locality and bank level parallelism in memory request scheduling, and leverages approximability of the reply data from DRAM, to reduce the number of reply packets injected into the NoC.

...read moreread less

Abstract: High interconnect bandwidth is crucial for achieving better performance in many-core GPGPU architectures that execute highly data parallel applications. The parallel warps of threads running on shader cores generate a high volume of read requests to the main memory due to the limited size of data caches at the shader cores. This leads to a scenarios with rapid arrival of an even larger volume of reply data from the DRAM, which creates a bottleneck at memory controllers (MCs) that send reply packets back to the requesting cores over the network-on-chip (NoC). Coping with such high volumes of data requires intelligent memory scheduling and innovative NoC architectures. To mitigate memory bottlenecks in GPGPUs, we first propose a novel approximate memory controller architecture ( AMC ) that reduces the DRAM latency by opportunistically exploiting row buffer locality and bank level parallelism in memory request scheduling, and leverages approximability of the reply data from DRAM, to reduce the number of reply packets injected into the NoC. To further realize high throughput and low energy communication in GPGPUs, we propose a low power, approximate NoC architecture ( Dapper ) that increases the utilization of the available network bandwidth by using single cycle overlay circuits for the reply traffic between MCs and shader cores. Experimental results show that Dapper and AMC together increase NoC throughput by up to 21 percent; and reduce NoC latency by up to 45.5 percent and energy consumed by the NoC and MC by up to 38.3 percent, with minimal impact on output accuracy, compared to state-of-the-art approximate NoC/MC architectures.

...read moreread less

Journal Article•DOI•

Bandwidth-Aware Dynamic Prefetch Configuration for IBM POWER8

[...]

Carlos Navarro¹, Josue Feliu¹, Salvador Petit¹, Maria E. Gomez¹, Julio Sahuquillo¹ - Show less +1 more•Institutions (1)

Polytechnic University of Valencia¹

01 Aug 2020-IEEE Transactions on Parallel and Distributed Systems

TL;DR: Bandwidth-Aware Prefetch Configuration (BAPC) is proposed a scalable adaptive prefetching algorithm that improves the performance of multi-program workloads and reduces bandwidth consumption over the IBM POWER8 default configuration.

...read moreread less

Abstract: Advanced hardware prefetch engines are being integrated in current high-performance processors. Prefetching can boost the performance of most applications, however, the induced bandwidth consumption can lead the system to a high contention for main memory bandwidth, which is a scarce resource in current multicores. In such a case, the system performance can be severely damaged. This article characterizes the applications’ behavior in an IBM POWER8 machine, which presents many prefetch settings, varying the bandwidth contention. The study reveals that the best prefetch setting for each application depends on the main memory bandwidth availability, that is, it depends on the co-running applications. Based on this study, we propose Bandwidth-Aware Prefetch Configuration (BAPC) a scalable adaptive prefetching algorithm that improves the performance of multi-program workloads. BAPC increases the performance of the applications in a 12, 15, and 16 percent of 6-, 8-, and 10-application workloads over the IBM POWER8 default configuration. In addition, BAPC reduces bandwidth consumption in 39, 42, and 45 percent, respectively.

...read moreread less

Collapse