Showing papers on "Registered memory published in 2017"

PDF

Open Access

Proceedings Article•DOI•

DRISA: a DRAM-based Reconfigurable In-Situ Accelerator

[...]

Shuangchen Li¹, Niu Dimin², Malladi Krishna T², Zheng Hongzhong², Bob Brennan², Yuan Xie¹ - Show less +2 more•Institutions (2)

University of California, Santa Barbara¹, Samsung²

14 Oct 2017

TL;DR: DRISA, a DRAM-based Reconfigurable In-Situ Accelerator architecture, is proposed to provide both powerful computing capability and large memory capacity/bandwidth to address the memory wall problem in traditional von Neumann architecture.

...read moreread less

Abstract: Data movement between the processing units and the memory in traditional von Neumann architecture is creating the “memory wall” problem. To bridge the gap, two approaches, the memory-rich processor (more on-chip memory) and the compute-capable memory (processing-in-memory) have been studied. However, the first one has strong computing capability but limited memory capacity/bandwidth, whereas the second one is the exact the opposite.To address the challenge, we propose DRISA, a DRAM-based Reconfigurable In-Situ Accelerator architecture, to provide both powerful computing capability and large memory capacity/bandwidth. DRISA is primarily composed of DRAM memory arrays, in which every memory bitline can perform bitwise Boolean logic operations (such as NOR). DRISA can be reconfigured to compute various functions with the combination of the functionally complete Boolean logic operations and the proposed hierarchical internal data movement designs. We further optimize DRISA to achieve high performance by simultaneously activating multiple rows and subarrays to provide massive parallelism, unblocking the internal data movement bottlenecks, and optimizing activation latency and energy. We explore four design options and present a comprehensive case study to demonstrate significant acceleration of convolutional neural networks. The experimental results show that DRISA can achieve 8.8× speedup and 1.2× better energy efficiency compared with ASICs, and 7.7× speedup and 15× better energy efficiency over GPUs with integer operations.CCS CONCEPTS• Hardware → Dynamic memory; • Computer systems organization → reconfigurable computing; Neural networks;

...read moreread less

315 citations

Journal Article•DOI•

CACTI 7: New Tools for Interconnect Exploration in Innovative Off-Chip Memories

[...]

Rajeev Balasubramonian¹, Andrew B. Kahng², Naveen Muralimanohar³, Ali Shafiee¹, Vaishnav Srinivas² - Show less +1 more•Institutions (3)

University of Utah¹, University of California, San Diego², Hewlett-Packard³

28 Jun 2017-ACM Transactions on Architecture and Code Optimization

TL;DR: A tool is designed that carefully models I/O power in the memory system, explores the design space, and gives the user the ability to define new types of memory interconnects/topologies, and a new relay-on-board chip that partitions a DDR channel into multiple cascaded channels is introduced.

...read moreread less

Abstract: Historically, server designers have opted for simple memory systems by picking one of a few commoditized DDR memory products. We are already witnessing a major upheaval in the off-chip memory hierarchy, with the introduction of many new memory products—buffer-on-board, LRDIMM, HMC, HBM, and NVMs, to name a few. Given the plethora of choices, it is expected that different vendors will adopt different strategies for their high-capacity memory systems, often deviating from DDR standards and/or integrating new functionality within memory systems. These strategies will likely differ in their choice of interconnect and topology, with a significant fraction of memory energy being dissipated in I/O and data movement. To make the case for memory interconnect specialization, this paper makes three contributions.First, we design a tool that carefully models I/O power in the memory system, explores the design space, and gives the user the ability to define new types of memory interconnects/topologies. The tool is validated against SPICE models, and is integrated into version 7 of the popular CACTI package. Our analysis with the tool shows that several design parameters have a significant impact on I/O power.We then use the tool to help craft novel specialized memory system channels. We introduce a new relay-on-board chip that partitions a DDR channel into multiple cascaded channels. We show that this simple change to the channel topology can improve performance by 22% for DDR DRAM and lower cost by up to 65% for DDR DRAM. This new architecture does not require any changes to DIMMs, and it efficiently supports hybrid DRAM/NVM systems.Finally, as an example of a more disruptive architecture, we design a custom DIMM and parallel bus that moves away from the DDR3/DDR4 standards. To reduce energy and improve performance, the baseline data channel is split into three narrow parallel channels and the on-DIMM interconnects are operated at a lower frequency. In addition, this allows us to design a two-tier error protection strategy that reduces data transfers on the interconnect. This architecture yields a performance improvement of 18% and a memory power reduction of 23%.The cascaded channel and narrow channel architectures serve as case studies for the new tool and show the potential for benefit from re-organizing basic memory interconnects.

...read moreread less

217 citations

Proceedings Article•DOI•

DudeTM: Building Durable Transactions with Decoupling for Persistent Memory

[...]

Mengxing Liu¹, Mingxing Zhang¹, Kang Chen¹, Xuehai Qian², Yongwei Wu¹, Weimin Zheng¹, Jinglei Ren³ - Show less +3 more•Institutions (3)

Tsinghua University¹, University of Southern California², Microsoft³

04 Apr 2017

TL;DR: DUDETM is presented, a crash-consistent durable transaction system that avoids the drawbacks of both undo logging and redo logging and can be implemented with existing hardware TMs with minor hardware modifications, leading to a further 1.7times speedup.

...read moreread less

Abstract: Emerging non-volatile memory (NVM) offers non-volatility, byte-addressability and fast access at the same time. To make the best use of these properties, it has been shown by empirical evidence that programs should access NVM directly through CPU load and store instructions, so that the overhead of a traditional file system or database can be avoided. Thus, durable transactions become a common choice of applications for accessing persistent memory data in a crash consistent manner. However, existing durable transaction systems employ either undo logging, which requires a fence for every memory write, or redo logging, which requires intercepting all memory reads within transactions.This paper presents DUDETM, a crash-consistent durable transaction system that avoids the drawbacks of both undo logging and redo logging. DUDETM uses shadow DRAM to decouple the execution of a durable transaction into three fully asynchronous steps. The advantage is that only minimal fences and no memory read instrumentation are required. This design also enables an out-of-the-box transactional memory (TM) to be used as an independent component in our system. The evaluation results show that DUDETM adds durability to a TM system with only 7.4 ~ 24.6% throughput degradation. Compared to the existing durable transaction systems, DUDETM provides 1.7times to 4.4times higher throughput. Moreover, DUDETM can be implemented with existing hardware TMs with minor hardware modifications, leading to a further 1.7times speedup.

...read moreread less

179 citations

Journal Article•DOI•

Platform Storage Performance With 3D XPoint Technology

[...]

Frank T. Hady¹, Annie Foong¹, Bryan E. Veal¹, Dan J. Williams¹•Institutions (1)

Intel¹

07 Aug 2017

TL;DR: With a combination of high performance and nonvolatility, the arrival of 3D XPoint memory promises to fundamentally change the memory-storage hierarchy at the hardware, system software, and application levels.

...read moreread less

Abstract: With a combination of high performance and nonvolatility, the arrival of 3D XPoint memory promises to fundamentally change the memory-storage hierarchy at the hardware, system software, and application levels. This memory will be deployed first as a block addressable storage device, known as the Intel Optane SSD, and even in this familiar form it will drive basic system change. Access times consistently as fast, or faster, than the rest of the system will blur the line between storage and memory. The low latencies from these solid-state drives (SSDs) allow rethinking even basic storage methodologies to be more memory-like. For example, the manner in which storage performance is measured shifts from input–output operations (IOs) at a given queue depth to response time for a given load, like memory is typically measured. System changes to match the low latency of these SSDs are already advanced, and in many cases they enable the application to utilize the SSD’s performance. In other cases, additional work is required, particularly on policies set originally with slow storage in mind. On top of these already-capable systems are real applications. System-level tests show that applications such as key–value stores and real-time analytics can benefit immediately. These application benefits include significantly faster runtime (up to $3\times $ ) and access to larger data sets than supported in DRAM. Newly viable mechanisms for expanding application memory footprint include native application support or native operating system paging, a significant change in the use of SSDs. The next step in this convergence is 3D XPoint memory accessed through processor load/store operations. Significant operating system support is already in place. The implications of consistently low latency storage and fast persistent memory on computing are great, with applications and systems taking advantage of this new technology as storage as the first to benefit.

...read moreread less

179 citations

Journal Article•DOI•

Energy-Aware Data Allocation With Hybrid Memory for Mobile Cloud Systems

[...]

Meikang Qiu¹, Zhi Chen², Zhong Ming³, Xiao Qin⁴, Jianwei Niu⁵ - Show less +1 more•Institutions (5)

Pace University¹, University of Kentucky², Shenzhen University³, Auburn University⁴, Beihang University⁵

01 Jun 2017-IEEE Systems Journal

TL;DR: Experimental results show that the proposed MDPDA algorithm can efficiently reduce the memory access cost and extend the lifetime of MRAM, and a novelmultidimensional dynamic programming data allocation (MDPDA) algorithm to strategically allocate data blocks to each memory.

...read moreread less

Abstract: Resource scheduling is one of the most important issues in mobile cloud computing due to the constraints in memory, CPU, and bandwidth. High energy consumption and low performance of memory accesses have become overwhelming obstacles for chip multiprocessor (CMP) systems used in cloud systems. In order to address the daunting “memory wall” problem, hybrid on-chip memory architecture has been widely investigated recently. Due to its advantages in size, real-time predictability, power, and software controllability, scratchpad memory (SPM) is a promising technique to replace the hardware cache and bridge the processor–memory gap for CMP systems. In this paper, we present a novel hybrid on-chip SPM that consists of a static random access memory (RAM), a magnetic RAM (MRAM), and a zero-capacitor RAM for CMP systems by fully taking advantages of the benefits of each type of memory. To reduce memory access latency, energy consumption, and the number of write operations to MRAM, we also propose a novel multidimensional dynamic programming data allocation (MDPDA) algorithm to strategically allocate data blocks to each memory. Experimental results show that the proposed MDPDA algorithm can efficiently reduce the memory access cost and extend the lifetime of MRAM.

...read moreread less

142 citations

Proceedings Article•DOI•

Ultra-Efficient Processing In-Memory for Data Intensive Applications

[...]

Mohsen Imani¹, Saransh Gupta¹, Tajana Rosing¹•Institutions (1)

University of California, San Diego¹

18 Jun 2017

TL;DR: This paper proposes an ultra-efficient approximate processing in-memory architecture, called APIM, which exploits the analog characteristics of non-volatile memories to support addition and multiplication inside the crossbar memory, while storing the data.

...read moreread less

Abstract: Recent years have witnessed a rapid growth in the domain of Internet of Things (IoT). This network of billions of devices generates and exchanges huge amount of data. The limited cache capacity and memory bandwidth make transferring and processing such data on traditional CPUs and GPUs highly inefficient, both in terms of energy consumption and delay. However, many IoT applications are statistical at heart and can accept a part of inaccuracy in their computation. This enables the designers to reduce complexity of processing by approximating the results for a desired accuracy. In this paper, we propose an ultra-efficient approximate processing in-memory architecture, called APIM, which exploits the analog characteristics of non-volatile memories to support addition and multiplication inside the crossbar memory, while storing the data. The proposed design eliminates the overhead involved in transferring data to processor by virtually bringing the processor inside memory. APIM dynamically configures the precision of computation for each application in order to tune the level of accuracy during runtime. Our experimental evaluation running six general OpenCL applications shows that the proposed design achieves up to 20× performance improvement and provides 480× improvement in energy-delay product, ensuring acceptable quality of service. In exact mode, it achieves 28× energy savings and 4.8× speed up compared to the state-of-the-art GPU cores.

...read moreread less

79 citations

Proceedings Article•DOI•

Utility-Based Hybrid Memory Management

[...]

Yang Li¹, Saugata Ghose¹, Jongmoo Choi², Jin Sun¹, Hui Wang³, Onur Mutlu¹ - Show less +2 more•Institutions (3)

Carnegie Mellon University¹, Dankook University², Beihang University³

01 Sep 2017

TL;DR: UH-MEM as discussed by the authors is a page management mechanism for various hybrid memories that systematically estimates the utility of migrating a page between different memory types, and uses this information to guide data placement.

...read moreread less

Abstract: While the memory footprints of cloud and HPC applications continue to increase, fundamental issues with DRAM scaling are likely to prevent traditional main memory systems, composed of monolithic DRAM, from greatly growing in capacity Hybrid memory systems can mitigate the scaling limitations of monolithic DRAM by pairing together multiple memory technologies (eg, different types of DRAM, or DRAM and non-volatile memory) at the same level of the memory hierarchy The goal of a hybrid main memory is to combine the different advantages of the multiple memory types in a cost-effective manner while avoiding the disadvantages of each technology Memory pages are placed in and migrated between the different memories within a hybrid memory system, based on the properties of each page It is important to make intelligent page management (ie, placement and migration) decisions, as they can significantly affect system performanceIn this paper, we propose utility-based hybrid memory management (UH-MEM), a new page management mechanism for various hybrid memories, that systematically estimates the utility (ie, the system performance benefit) of migrating a page between different memory types, and uses this information to guide data placement UH-MEM operates in two steps First, it estimates how much a single application would benefit from migrating one of its pages to a different type of memory, by comprehensively considering access frequency, row buffer locality, and memory-level parallelism Second, it translates the estimated benefit of a single application to an estimate of the overall system performance benefit from such a migrationWe evaluate the effectiveness of UH-MEM with various types of hybrid memories, and show that it significantly improves system performance on each of these hybrid memories For a memory system with DRAM and non-volatile memory, UH-MEM improves performance by 14% on average (and up to 26%) compared to the best of three evaluated state-of-the-art mechanisms across a large number of data-intensive workloads

...read moreread less

78 citations

Book Chapter•DOI•

Simple Operations in Memory to Reduce Data Movement

[...]

Vivek Seshadri¹, Onur Mutlu²•Institutions (2)

Microsoft¹, ETH Zurich²

01 Jan 2017-Advances in Computers

TL;DR: RowClone, a mechanism that exploits DRAM technology to perform bulk copy and initialization operations completely inside main memory, and a complementary work that uses DRAM to performs bulk bitwise AND and OR operations inside mainmemory significantly improve the performance and energy efficiency of the respective operations.

...read moreread less

Abstract: In existing systems, the off-chip memory interface allows the memory controller to perform only read or write operations. Therefore, to perform any operation, the processor must first read the source data and then write the result back to memory after performing the operation. This approach consumes high latency, bandwidth, and energy for operations that work on a large amount of data. Several works have proposed techniques to process data near memory by adding a small amount of compute logic closer to the main memory chips. In this chapter, we describe two techniques proposed by recent works that take this approach of processing in memory further by exploiting the underlying operation of the main memory technology to perform more complex tasks. First, we describe RowClone, a mechanism that exploits DRAM technology to perform bulk copy and initialization operations completely inside main memory. We then describe a complementary work that uses DRAM to perform bulk bitwise AND and OR operations inside main memory. These two techniques significantly improve the performance and energy efficiency of the respective operations.

...read moreread less

75 citations

Journal Article•DOI•

In-Memory Processing Paradigm for Bitwise Logic Operations in STT–MRAM

[...]

Wang Kang¹, Haotian Wang¹, Zhaohao Wang¹, Youguang Zhang¹, Weisheng Zhao¹ - Show less +1 more•Institutions (1)

Beihang University¹

12 May 2017

TL;DR: A cost-efficient IMP/NMP solution in spin-transfer torque magnetic random access memory (STT–MRAM) without adding any processing units on the memory chip is proposed.

...read moreread less

Abstract: In the current big data era, the memory wall issue between the processor and the memory becomes one of the most critical bottlenecks for conventional Von-Newman computer architecture In-memory processing (IMP) or near-memory processing (NMP) paradigms have been proposed to address this problem by adding a small amount of processing units inside/near the memory Unfortunately, although intensively studied, prior IMP/NMP platforms are practically unsuccessful because of the fabrication complexity and cost efficiency by integrating the processing units and memory on the same chip Recently, emerging nonvolatile memories provide new possibility for efficiently implementing the IMP/NMP paradigm In this paper, we propose a cost-efficient IMP/NMP solution in spin-transfer torque magnetic random access memory (STT–MRAM) without adding any processing units on the memory chip The key idea behind the proposed IMP/NMP solution is to exploit the peripheral circuitry already existing inside memory (or with minimal changes) to perform bitwise logic operations Such an IMP/NMP platform enables rather fast logic operations as the logic results can be obtained immediately through just a memory-like readout operation Memory read and logics not, and/nand, and or/nor operations can be achieved and dynamically configured within the same STT–MRAM chip Functionality and performance are evaluated with hybrid simulations under the 40 nm technology node

...read moreread less

67 citations

Proceedings Article•DOI•

Translation-Triggered Prefetching

[...]

Abhishek Bhattacharjee¹•Institutions (1)

Rutgers University¹

04 Apr 2017

TL;DR: TEMPO as mentioned in this paper is a low-overhead hardware mechanism to boost memory performance by exploiting the operating system's (OS) virtual memory subsystem by translating page tables to DRAM.

...read moreread less

Abstract: We propose translation-enabled memory prefetching optimizations or TEMPO, a low-overhead hardware mechanism to boost memory performance by exploiting the operating system's (OS) virtual memory subsystem. We are the first to make the following observations: (1) a substantial fraction (20-40%) of DRAM references in modern big- data workloads are devoted to accessing page tables; and (2) when memory references require page table lookups in DRAM, the vast majority of them (98%+) also look up DRAM for the subsequent data access. TEMPO exploits these observations to enable DRAM row-buffer and on-chip cache prefetching of the data that page tables point to. TEMPO requires trivial changes to the memory controller (under 3% additional area), no OS or application changes, and improves performance by 10-30% and energy by 1-14%.

...read moreread less

58 citations

Proceedings Article•DOI•

Unimem: runtime data managementon non-volatile memory-based heterogeneous main memory

[...]

Kai Wu¹, Yingchao Huang¹, Dong Li¹•Institutions (1)

University of California¹

12 Nov 2017

TL;DR: In this article, a lightweight runtime solution that automatically and transparently manages data placement on HMS without the requirement of hardware modifications and disruptive change to applications is presented to bridge the performance gap between NVM and DRAM.

...read moreread less

Abstract: Non-volatile memory (NVM) provides a scalable and power-efficient solution to replace DRAM as main memory. However, because of relatively high latency and low bandwidth of NVM, NVM is often paired with DRAM to build a heterogeneous memory system (HMS). As a result, data objects of the application must be carefully placed to NVM and DRAM for best performance. In this paper, we introduce a lightweight runtime solution that automatically and transparently manage data placement on HMS without the requirement of hardware modifications and disruptive change to applications. Leveraging online profiling and performance models, the runtime characterizes memory access patterns associated with data objects, and minimizes unnecessary data movement. Our runtime solution effectively bridges the performance gap between NVM and DRAM. We demonstrate that using NVM to replace the majority of DRAM can be a feasible solution for future HPC systems with the assistance of a software-based data management.

...read moreread less

Proceedings Article•DOI•

InvisiMem: Smart Memory Defenses for Memory Bus Side Channel

[...]

Shaizeen Aga¹, Satish Narayanasamy¹•Institutions (1)

University of Michigan¹

24 Jun 2017

TL;DR: It is demonstrated that smart memory, memory with compute capability and a packetized interface, can dramatically simplify this problem and have one to two orders of magnitude of lower overheads for performance, space, energy, and memory bandwidth, compared to prior solutions.

...read moreread less

Abstract: A practically feasible low-overhead hardware design that provides strong defenses against memory bus side channel remains elusive. This paper observes that smart memory, memory with compute capability and a packetized interface, can dramatically simplify this problem. InvisiMem expands the trust base to include the logic layer in the smart memory to implement cryptographic primitives, which aid in addressing several memory bus side channel vulnerabilities efficiently. This allows the secure host processor to send encrypted addresses over the untrusted memory bus, and thereby eliminates the need for expensive address obfuscation techniques based on Oblivious RAM (ORAM). In addition, smart memory enables efficient solutions for ensuring freshness without using expensive Merkle trees, and mitigates memory bus timing channel using constant heart-beat packets. We demonstrate that InvisiMem designs have one to two orders of magnitude of lower overheads for performance, space, energy, and memory bandwidth, compared to prior solutions.

...read moreread less

Journal Article•DOI•

Dynamic Adaptive Replacement Policy in Shared Last-Level Cache of DRAM/PCM Hybrid Memory for Big Data Storage

[...]

Gangyong Jia¹, Guangjie Han², Jinfang Jiang², Li Liu²•Institutions (2)

Hangzhou Dianzi University¹, Hohai University²

01 Aug 2017-IEEE Transactions on Industrial Informatics

TL;DR: A dynamic adaptive replacement policy (DARP) in the shared last-level cache for the DRAM/PCM hybrid main memory is proposed and results have shown that the DARP improved the memory access efficiency by 25.4%.

...read moreread less

Abstract: The increasing demand on the main memory capacity is one of the main big data challenges. Dynamic random access memory (DRAM) does not represent the best choice for a main memory, due to high power consumption and low density. However, the nonvolatile memory, such as the phase-change memory (PCM), represents an additional choice because of the low power consumption and high-density characteristic. Nevertheless, the high access latency and limited write endurance have disabled the PCM to replace the DRAM currently. Therefore, a hybrid memory, which combines both the DRAM and the PCM, has become a good alternative to the traditional DRAM memory. Both DRAM and PCM disadvantages are challenges for the hybrid memory. In this paper, a dynamic adaptive replacement policy (DARP) in the shared last-level cache for the DRAM/PCM hybrid main memory is proposed. The DARP distinguishes the cache data into the PCM data and the DRAM data, then, the algorithm adopts different replacement policies for each data type. Specifically, for the PCM data, the least recently used (LRU) replacement policy is adopted, and for the DRAM data, the DARP is employed according to the process behavior. Experimental results have shown that the DARP improved the memory access efficiency by 25.4%.

...read moreread less

Journal Article•DOI•

Resource-Efficient SRAM-Based Ternary Content Addressable Memory

[...]

Ali Ahmed¹, Kyung-Bae Park¹, Sanghyeon Baeg¹•Institutions (1)

Hanyang University¹

01 Apr 2017-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: A novel memory architecture called a resource-efficient SRAM-based TCAM (REST), which emulates TCAM functionality using optimal resources and increases the overall emulated TCAM bits/SRAM at the cost of reduced throughput.

...read moreread less

Abstract: Static random access memory (SRAM)-based ternary content addressable memory (TCAM) offers TCAM functionality by emulating it with SRAM. However, this emulation suffers from reduced memory efficiency while mapping the TCAM table on SRAM units. This is due to the limited capacity of the physical addresses in the SRAM unit. This brief offers a novel memory architecture called a resource-efficient SRAM-based TCAM (REST), which emulates TCAM functionality using optimal resources. The SRAM unit is divided into multiple virtual blocks to store the address information presented in the TCAM table. This approach virtually increases the overall address space of the SRAM unit, mapping a greater portion of the TCAM table in SRAM and increasing the overall emulated TCAM bits/SRAM at the cost of reduced throughput. A $72 \times 28$ -bit REST consumes only one 36-kbit SRAM and a few distributed RAMs via implementation on a Xilinx Kintex-7 field-programmable gate array. It uses only 3.5% of the memory resources compared with a conventional SRAM-based TCAM (hybrid-partitioned TCAM).

...read moreread less

Journal Article•DOI•

CasHMC: A Cycle-Accurate Simulator for Hybrid Memory Cube

[...]

Dong-Ik Jeon¹, Ki-Seok Chung¹•Institutions (1)

Hanyang University¹

01 Jan 2017-IEEE Computer Architecture Letters

TL;DR: A cycle-accurate simulator for hybrid memory cube called CasHMC provides a cycle-by-cycle simulation of every module in an HMC and generates analysis results including a bandwidth graph and statistical data.

...read moreread less

Abstract: 3D-stacked DRAM has been actively studied to overcome the limits of conventional DRAM. The Hybrid Memory Cube (HMC) is a type of 3D-stacked DRAM that has drawn great attention because of its usability for server systems and processing-in-memory (PIM) architecture. Since HMC is not directly stacked on the processor die where the central processing units (CPUs) and graphic processing units (GPUs) are integrated, HMC has to be linked to other processor components through high speed serial links. Therefore, the communication bandwidth and latency should be carefully estimated to evaluate the performance of HMC. However, most existing HMC simulators employ only simple HMC modeling. In this paper, we propose a cycle-accurate simulator for hybrid memory cube called CasHMC. It provides a cycle-by-cycle simulation of every module in an HMC and generates analysis results including a bandwidth graph and statistical data. Furthermore, CasHMC is implemented in C++ as a single wrapped object that includes an HMC controller, communication links, and HMC memory. Instantiating this single wrapped object facilitates simultaneous simulation in parallel with other simulators that generate memory access patterns such as a processor simulator or a memory trace generator.

...read moreread less

Proceedings Article•DOI•

MemPod: A Clustered Architecture for Efficient and Scalable Migration in Flat Address Space Multi-level Memories

[...]

Andreas Prodromou¹, Mitesh R. Meswani², Nuwan Jayasena², Gabriel H. Loh², Dean M. Tullsen³ - Show less +1 more•Institutions (3)

University of Cyprus¹, Advanced Micro Devices², University of California, San Diego³

01 Feb 2017

TL;DR: MemPod is proposed, a scalable and efficient memory management mechanism for flat address space hybrid memories that improves the average main memory access time of multi-programmed workloads, by up to 29% compared to the state of the art.

...read moreread less

Abstract: In the near future, die-stacked DRAM will be increasingly present in conjunction with off-chip memories in hybrid memory systems. Research on this subject revolves around using the stacked memory as a cache or as part of a flat address space. This paper proposes MemPod, a scalable and efficient memory management mechanism for flat address space hybrid memories. MemPod monitors memory activity and periodically migrates the most frequently accessed memory pages to the faster on-chip memory. MemPod's partitioned architectural organization allows for efficientscaling with memory system capabilities. Further, a big data analytics algorithm is adapted to develop an efficient, low-cost activity tracking technique. MemPod improves the average main memory access time of multi-programmed workloads, by up to 29% (9% on average) compared to the state of the art, and that will increase as the differential between memory speeds widens. MemPod's novel activity tracking approach leads to significant cost reduction (12800x lower storage space requirements) and improved future prediction accuracy over prior work which maintains a separatecounter per page.

...read moreread less

Journal Article•DOI•

Durable Address Translation in PCM-Based Flash Storage Systems

[...]

Duo Liu¹, Kan Zhong¹, Tianzheng Wang², Yi Wang³, Zili Shao⁴, Edwin H.-M. Sha¹, Jingling Xue⁵ - Show less +3 more•Institutions (5)

Chongqing University¹, University of Toronto², Shenzhen University³, Hong Kong Polytechnic University⁴, University of New South Wales⁵

01 Feb 2017-IEEE Transactions on Parallel and Distributed Systems

TL;DR: This paper proposes to partially replace DRAM using PCM to optimize the management of flash memory metadata for better system reliability in the presence of power failure and system crash, and presents a write-activity-aware PCM-assisted flash memory management scheme, called PCm-FTL.

...read moreread less

Abstract: Phase change memory (PCM) is a promising DRAM alternative because of its non-volatility, high density, low standby power and close-to-DRAM performance. These features make PCM an attractive solution to optimize the management of NAND flash memory in embedded systems. However, PCM's limited write endurance hinders its application in embedded systems. Therefore, how to manage flash memory with PCM—particularly guarantee PCM a reasonable lifetime—becomes a challenging issue. In this paper, we propose to partially replace DRAM using PCM to optimize the management of flash memory metadata for better system reliability in the presence of power failure and system crash. To prolong PCM's lifetime, we present a write-activity-aware PCM-assisted flash memory management scheme, called PCM-FTL . By differentiating sequential and random I/O behaviors, a novel two-level mapping mechanism and a customized wear-leveling scheme are developed to reduce writes to PCM and extend its lifetime. We evaluate PCM-FTL with a variety of general-purpose and mobile I/O workloads. Experimental results show that PCM-FTL can significantly reduce write activities and achieve an even distribution of writes in PCM with very low overhead.

...read moreread less

Journal Article•DOI•

Area and Energy-Efficient Complementary Dual-Modular Redundancy Dynamic Memory for Space Applications

[...]

Robert Giterman¹, Lior Atias¹, Adam Teman¹•Institutions (1)

Bar-Ilan University¹

01 Feb 2017-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: This paper proposes the smallest solution for soft-error tolerant embedded memory yet to be presented, based on a four-transistor dynamic memory core that internally stores complementary data values to provide an inherent per-bit error detection capability.

...read moreread less

Abstract: The limited size and power budgets of space-bound systems often contradict the requirements for reliable circuit operation within high-radiation environments. In this paper, we propose the smallest solution for soft-error tolerant embedded memory yet to be presented. The proposed complementary dual-modular redundancy (CDMR) memory is based on a four-transistor dynamic memory core that internally stores complementary data values to provide an inherent per-bit error detection capability. By adding simple, low-overhead parity, an error-correction capability is added to the memory architecture for robust soft-error protection. The proposed memory was implemented in a 65-nm CMOS technology, displaying as much as a $3.5\times $ smaller silicon footprint than other radiation-hardened bitcells. In addition, the CDMR memory consumes between 48% and 87% less standby power than other considered solutions across the entire operating region.

...read moreread less

Journal Article•DOI•

Segment and Conflict Aware Page Allocation and Migration in DRAM-PCM Hybrid Main Memory

[...]

Hoda Aghaei Khouzani¹, Fateme S. Hosseini¹, Chengmo Yang¹•Institutions (1)

University of Delaware¹

01 Sep 2017-IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

TL;DR: This paper explores the interactions between DRAM and PCM to improve both the performance and the endurance of a DRAM-PCM hybrid main memory and develops a proactive strategy to allocate pages taking both program segments and DRAM conflict misses into consideration.

...read moreread less

Abstract: Phase change memory (PCM), given its nonvolatility, potential high density, and low standby power, is a promising candidate to be used as main memory in next generation computer systems. However, to hide its shortcomings of limited endurance and slow write performance, state-of-the-art solutions tend to construct a dynamic RAM (DRAM)-PCM hybrid memory and place write-intensive pages in DRAM. While existing optimizations to this hybrid architecture focus on tuning DRAM configurations to reduce the number of write operations to PCM, this paper explores the interactions between DRAM and PCM to improve both the performance and the endurance of a DRAM-PCM hybrid main memory. Specifically, it exploits the flexibility of mapping virtual pages to physical pages, and develops a proactive strategy to allocate pages taking both program segments and DRAM conflict misses into consideration, thus distributing those heavily written pages across different DRAM sets. Meanwhile, a lifetime-aware DRAM replacement algorithm and a conflict-aware page remapping strategy are proposed to further reduce DRAM misses and PCM writes. Experiments confirm that the proposed techniques are able to improve average memory hit time and reduce maximum PCM write counts thus enhancing both performance and lifetime of a DRAM-PCM hybrid main memory.

...read moreread less

Proceedings Article•DOI•

Page Fault Support for Network Controllers

[...]

Ilya Lesokhin¹, Haggai Eran¹, Shachar Raindel², Guy Shapiro², Sagi Grimberg², Liran Liss², Muli Ben-Yehuda¹, Nadav Amit¹, Dan Tsafrir¹ - Show less +5 more•Institutions (2)

Technion – Israel Institute of Technology¹, Mellanox Technologies²

04 Apr 2017

TL;DR: This work designs and implements page fault support for InfiniBand and Ethernet NICs, and demonstrates that the solution provides all the benefits associated with "regular" virtual memory, notably a simpler programming model that rids users from the need to pin and the ability to employ all the canonical memory optimizations, such as memory overcommitment and demand-paging based on actual use.

...read moreread less

Abstract: Direct network I/O allows network controllers (NICs) to expose multiple instances of themselves, to be used by untrusted software without a trusted intermediary. Direct I/O thus frees researchers from legacy software, fueling studies that innovate in multitenant setups. Such studies, however, overwhelmingly ignore one serious problem: direct memory accesses (DMAs) of NICs disallow page faults, forcing systems to either pin entire address spaces to physical memory and thereby hinder memory utilization, or resort to APIs that pin/unpin memory buffers before/after they are DMAed, which complicates the programming model and hampers performance.We solve this problem by designing and implementing page fault support for InfiniBand and Ethernet NICs. A main challenge we tackle---unique to NICs---is handling receive DMAs that trigger page faults, leaving the NIC without memory to store the incoming data. We demonstrate that our solution provides all the benefits associated with "regular" virtual memory, notably (1) a simpler programming model that rids users from the need to pin, and (2) the ability to employ all the canonical memory optimizations, such as memory overcommitment and demand-paging based on actual use. We show that, as a result, benchmark performance improves by up to 1.9x.

...read moreread less

Proceedings Article•DOI•

Near memory key/value lookup acceleration

[...]

Scott Lloyd¹, Maya Gokhale¹•Institutions (1)

Lawrence Livermore National Laboratory¹

02 Oct 2017

TL;DR: A near memory accelerator combining simple hardware building blocks to accelerate lookup in a hash table based key/value store is designed and shows 12.8X - 2.9X speedup compared to conventional CPU lookup depending on workload characteristics.

...read moreread less

Abstract: In the "Big Data" era, fast lookup of keys in a key/value store is a ubiquitous operation. We have designed a near memory accelerator combining simple hardware building blocks to accelerate lookup in a hash table based key/value store. We report on the co-design of hardware and software to accomplish fast lookup using open addressing. The accelerator implements a batch get command to look up a set of keys in a single request. Using an FPGA emulator, we evaluate the performance of a query workload under a comprehensive range of conditions such as hash table load factor (fill) and query key repeat distribution (likelihood of a key to reappear in a query workload). We emulate two memory configurations: Hybrid Memory Cube (or High Bandwidth Memory), and Storage Class Memory. Our design shows 12.8X - 2.9X speedup compared to conventional CPU lookup depending on workload characteristics.

...read moreread less

Proceedings Article•DOI•

Cooperative Path-ORAM for Effective Memory Bandwidth Sharing in Server Settings

[...]

Rujia Wang¹, Youtao Zhang¹, Jun Yang¹•Institutions (1)

University of Pittsburgh¹

01 Feb 2017

TL;DR: CP-ORAM is proposed, a Cooperative Path ORAM design, to effectively schedule the memory requests from both types of applications and to achieve 20% performance improvement on average over the baseline Path OrAM for the secure application in a four-channel server setting.

...read moreread less

Abstract: Path ORAM (Oblivious RAM) is a recently proposed ORAM protocol for preventing information leakage from memory access sequences. It receives wide adoption due to its simplicity, practical efficiency and asymptotic efficiency. However, Path ORAM has extremely large memory bandwidth demand, leading to severe memory competition in server settings, e.g., a server may service one application that uses Path ORAM and one or multiple applications that do not. While Path ORAM synchronously and intensively uses all memory channels, the non-secure applications often exhibit low access intensity and large channel level imbalance. Traditional memory scheduling schemes lead to wasted memory bandwidth to the system and large performance degradation to both types of applications. In this paper, we propose CP-ORAM, a Cooperative Path ORAM design, to effectively schedule the memory requests from both types of applications. CP-ORAM consists of three schemes: P-Path, R-Path, and W-Path. P-Path assigns and enforces scheduling priority for effective memory bandwidth sharing. R-Path maximizes bandwidth utilization by proactively scheduling read operations from the next Path ORAM access. W-Path mitigates contention on busy memory channels with write redirection. We evaluate CP-ORAM and compare it to the state-of-the-art. Our results show that CP-ORAM helps to achieve 20% performance improvement on average over the baseline Path ORAM for the secure application in a four-channel server setting.

...read moreread less

Journal Article•DOI•

A Globally Arbitrated Memory Tree for Mixed-Time-Criticality Systems

[...]

Manil Dev Gomony¹, Jamie Garside², Benny Akesson³, Neil Audsley², Kees Goossens¹ - Show less +1 more•Institutions (3)

Eindhoven University of Technology¹, University of York², Polytechnic Institute of Porto³

01 Feb 2017-IEEE Transactions on Computers

TL;DR: The performance of GAMT is compared with centralized implementations and it is shown that it can run up to four times faster and have over 51 and 37 percent reduction in area and power consumption, respectively, for a given bandwidth.

...read moreread less

Abstract: Embedded systems are increasingly based on multi-core platforms to accommodate a growing number of applications, some of which have real-time requirements. Resources, such as off-chip DRAM, are typically shared between the applications using memory interconnects with different arbitration polices to cater to diverse bandwidth and latency requirements. However, traditional centralized interconnects are not scalable as the number of clients increase. Similarly, current distributed interconnects either cannot satisfy the diverse requirements or have decoupled arbitration stages, resulting in larger area, power and worst-case latency. The four main contributions of this article are: 1) a Globally Arbitrated Memory Tree (GAMT) with a distributed architecture that scales well with the number of cores, 2) an RTL-level implementation that can be configured with five arbitration policies (three distinct and two as special cases), 3) the concept of mixed arbitration policies that allows the policy to be selected individually per core, and 4) a worst-case analysis for a mixed arbitration policy that combines TDM and FBSP arbitration.We compare the performance of GAMT with centralized implementations and show that it can run up to four times faster and have over 51 and 37 percent reduction in area and power consumption, respectively, for a given bandwidth.

...read moreread less

Proceedings Article•DOI•

Mallacc: Accelerating Memory Allocation

[...]

Svilen Kanev¹, Sam Likun Xi¹, Gu-Yeon Wei¹, David Brooks¹•Institutions (1)

Harvard University¹

04 Apr 2017

TL;DR: Mallacc is proposed, an in-core hardware accelerator designed for broad use across a number of high-performance, modern memory allocators, which accelerates the three primary operations of a typical memory allocation request: size class computation, retrieval of a free memory block, and sampling of memory usage.

...read moreread less

Abstract: Recent work shows that dynamic memory allocation consumes nearly 7% of all cycles in Google datacenters. With the trend towards increased specialization of hardware, we propose Mallacc, an in-core hardware accelerator designed for broad use across a number of high-performance, modern memory allocators. The design of Mallacc is quite different from traditional throughput-oriented hardware accelerators. Because memory allocation requests tend to be very frequent, fast, and interspersed inside other application code, accelerators must be optimized for latency rather than throughput and area overheads must be kept to a bare minimum. Mallacc accelerates the three primary operations of a typical memory allocation request: size class computation, retrieval of a free memory block, and sampling of memory usage. Our results show that malloc latency can be reduced by up to 50% with a hardware cost of less than 1500 um2 of silicon area, less than 0.006% of a typical high-performance processor core.

...read moreread less

Proceedings Article•DOI•

Low power in-memory computing based on dual-mode SOT-MRAM

[...]

Farhana Parveen¹, Shaahin Angizi¹, Zhezhi He¹, Deliang Fan¹•Institutions (1)

University of Central Florida¹

24 Jul 2017

TL;DR: A novel Spin Orbit Torque Magnetic Random Access Memory (SOT-MRAM) array design that could simultaneously work as non-volatile memory and implement a reconfigurable in-memory logic (AND, OR) without add-on logic circuits to memory chip as in traditional logic-in-memory designs is proposed.

...read moreread less

Abstract: In this paper, we propose a novel Spin Orbit Torque Magnetic Random Access Memory (SOT-MRAM) array design that could simultaneously work as non-volatile memory and implement a reconfigurable in-memory logic (AND, OR) without add-on logic circuits to memory chip as in traditional logic-in-memory designs. The computed logic output could be simply read out like a normal MRAM bit-cell using the shared memory peripheral circuits. Such intrinsic in-memory logic could be used to process data within memory to greatly reduce power-hungry and long distance data communication in conventional Von-Neumann computing systems. We further employ in-memory data encryption using Advanced Encryption Standard (AES) algorithm as a case study to demonstrate the efficiency of the proposed design. The device to architecture co-simulation results show that the proposed design can achieve 70.15% and 80.87% lower energy consumption compared to CMOS-ASIC and CMOL-AES implementations, respectively. It offers almost similar energy consumption as recent DW-AES implementation, but with 60.65% less area overhead.

...read moreread less

Proceedings Article•DOI•

Approximate memory compression for energy-efficiency

[...]

Ashish Ranjan¹, Arnab Raha¹, Vijay Raghunathan¹, Anand Raghunathan¹•Institutions (1)

Purdue University¹

01 Jul 2017

TL;DR: This work designs a software interface that programmers can use to identify data structures that are resilient to approximations and proposes a runtime quality control framework that automatically determines the error constraints for the identified data structures such that a given target application-level quality is maintained.

...read moreread less

Abstract: Memory subsystems are a major energy bottleneck in computing platforms due to frequent transfers between processors and off-chip memory. We propose approximate memory compression, a technique that leverages the intrinsic resilience of emerging workloads such as machine learning and data analytics to reduce off-chip memory traffic and energy. To realize approximate memory compression, we enhance the memory controller to be aware of memory regions that contain approximation-resilient data, and to transparently compress/decompress the data written to/read from these regions. To provide control over approximations, the quality-aware memory controller conforms to a specified error constraint for each approximate memory region. We design a software interface that programmers can use to identify data structures that are resilient to approximations. We also propose a runtime quality control framework that automatically determines the error constraints for the identified data structures such that a given target application-level quality is maintained. We evaluate our proposal by implementing a hardware prototype using the Intel UniPHY-DDR3 memory controller and NIOS-II processor, a Hynix DDR3 DRAM module, and a Stratix-IV FPGA development board. Across a suite of 8 machine learning benchmarks, approximate memory compression obtains a 1.28× benefit in DRAM energy and a simultaneous 11.5% improvement in execution time for a small (< 1.5%) loss in output quality.

...read moreread less

Patent•

Memory device performing hammer refresh operation and memory system including the same

[...]

Kang Kyu-Chang¹, Yang Hui-Kap¹•Institutions (1)

Samsung¹

27 Jul 2017

TL;DR: In this article, the row selection circuit performs an access operation with respect to the memory bank and a hammer refresh operation on a row that is physically adjacent to a row accessed intensively.

...read moreread less

Abstract: A memory device includes a memory bank, a row selection circuit and a refresh controller. The memory bank includes a plurality of memory blocks, and each memory block includes a plurality of memory cells arranged in rows and columns. The row selection circuit performs an access operation with respect to the memory bank and a hammer refresh operation with respect to a row that is physically adjacent to a row that is accessed intensively. The refresh controller controls the row selection circuit such that the hammer refresh operation is performed during a row active time for the access operation. The hammer refresh operation may be performed efficiently and performance of the memory device may be enhanced by performing the hammer refresh operation during the row active time for the access operation.

...read moreread less

Journal Article•DOI•

PMC: A Requirement-Aware DRAM Controller for Multicore Mixed Criticality Systems

[...]

Mohamed Hassan¹, Hiren D. Patel¹, Rodolfo Pellizzoni¹•Institutions (1)

University of Waterloo¹

11 May 2017-ACM Transactions in Embedded Computing Systems

TL;DR: A novel approach to schedule memory requests in Mixed Criticality Systems by enabling the MCS designer to specify memory requirements per task is proposed, and a compact time-division-multiplexing scheduler and framework that constructs optimal schedules to manage requests to off-chip memory are introduced.

...read moreread less

Abstract: We propose a novel approach to schedule memory requests in Mixed Criticality Systems (MCS). This approach supports an arbitrary number of criticality levels by enabling the MCS designer to specify memory requirements per task. It retains locality within large-size requests to satisfy memory requirements of all tasks. To achieve this target, we introduce a compact time-division-multiplexing scheduler, and a framework that constructs optimal schedules to manage requests to off-chip memory. We also present a static analysis that guarantees meeting requirements of all tasks. We compare the proposed controller against state-of-the-art memory controllers using both a case study and synthetic experiments.

...read moreread less

Journal Article•DOI•

Recent Technology Advances of Emerging Memories

[...]

Yi Chen¹, Hai Helen Li¹, Ismail Bayram², Enes Eken²•Institutions (2)

Duke University¹, University of Pittsburgh²

21 Mar 2017-IEEE Design & Test of Computers

TL;DR: The authors summarize the latest research progress of phase change memory, spin-transfer torque random access memory, and resistiverandom access memory in device engineering, circuit design, computer architecture, and application.

...read moreread less

Abstract: Editor’s note: Phase change memory, spin-transfer torque random access memory, and resistive random access memory are three major emerging memory technologies that receive tremendous attentions from both academia and industry. In this survey article, the authors summarize the latest research progress of these technologies in device engineering, circuit design, computer architecture, and application. —Tei-Wei Kuo, National Taiwan University

...read moreread less

Journal Article•DOI•

Efficient Designs of Multiported Memory on FPGA

[...]

Bo-Cheng Charles Lai¹, Jiun-Liang Lin²•Institutions (2)

National Chiao Tung University¹, MediaTek²

01 Jan 2017-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: A hierarchical design of 4R1W memory is introduced that requires 25% fewer BRAMs than the previous approach of duplicating the 2R1w module and can achieve higher clock frequencies by alleviating the complex routing in an FPGA.

...read moreread less

Abstract: The utilization of block RAMs (BRAMs) is a critical performance factor for multiported memory designs on field-programmable gate arrays (FPGAs). Not only does the excessive demand on BRAMs block the usage of BRAMs from other parts of a design, but the complex routing between BRAMs and logic also limits the operating frequency. This paper first introduces a brand new perspective and a more efficient way of using a conventional two reads one write (2R1W) memory as a 2R1W/4R memory. By exploiting the 2R1W/4R as the building block, this paper introduces a hierarchical design of 4R1W memory that requires 25% fewer BRAMs than the previous approach of duplicating the 2R1W module. Memories with more read/write ports can be extended from the proposed 2R1W/4R memory and the hierarchical 4R1W memory. Compared with previous xor-based and live value table-based approaches, the proposed designs can, respectively, reduce up to 53% and 69% of BRAM usage for 4R2W memory designs with 8K-depth. For complex multiported designs, the proposed BRAM-efficient approaches can achieve higher clock frequencies by alleviating the complex routing in an FPGA. For 4R3W memory with 8K-depth, the proposed design can save 53% of BRAMs and enhance the operating frequency by 20%.

...read moreread less

Collapse