scispace - formally typeset
Search or ask a question

Showing papers on "Registered memory published in 2017"


Proceedings ArticleDOI
14 Oct 2017
TL;DR: DRISA, a DRAM-based Reconfigurable In-Situ Accelerator architecture, is proposed to provide both powerful computing capability and large memory capacity/bandwidth to address the memory wall problem in traditional von Neumann architecture.
Abstract: Data movement between the processing units and the memory in traditional von Neumann architecture is creating the “memory wall” problem. To bridge the gap, two approaches, the memory-rich processor (more on-chip memory) and the compute-capable memory (processing-in-memory) have been studied. However, the first one has strong computing capability but limited memory capacity/bandwidth, whereas the second one is the exact the opposite.To address the challenge, we propose DRISA, a DRAM-based Reconfigurable In-Situ Accelerator architecture, to provide both powerful computing capability and large memory capacity/bandwidth. DRISA is primarily composed of DRAM memory arrays, in which every memory bitline can perform bitwise Boolean logic operations (such as NOR). DRISA can be reconfigured to compute various functions with the combination of the functionally complete Boolean logic operations and the proposed hierarchical internal data movement designs. We further optimize DRISA to achieve high performance by simultaneously activating multiple rows and subarrays to provide massive parallelism, unblocking the internal data movement bottlenecks, and optimizing activation latency and energy. We explore four design options and present a comprehensive case study to demonstrate significant acceleration of convolutional neural networks. The experimental results show that DRISA can achieve 8.8× speedup and 1.2× better energy efficiency compared with ASICs, and 7.7× speedup and 15× better energy efficiency over GPUs with integer operations.CCS CONCEPTS• Hardware → Dynamic memory; • Computer systems organization → reconfigurable computing; Neural networks;

315 citations


Journal ArticleDOI
TL;DR: A tool is designed that carefully models I/O power in the memory system, explores the design space, and gives the user the ability to define new types of memory interconnects/topologies, and a new relay-on-board chip that partitions a DDR channel into multiple cascaded channels is introduced.
Abstract: Historically, server designers have opted for simple memory systems by picking one of a few commoditized DDR memory products. We are already witnessing a major upheaval in the off-chip memory hierarchy, with the introduction of many new memory products—buffer-on-board, LRDIMM, HMC, HBM, and NVMs, to name a few. Given the plethora of choices, it is expected that different vendors will adopt different strategies for their high-capacity memory systems, often deviating from DDR standards and/or integrating new functionality within memory systems. These strategies will likely differ in their choice of interconnect and topology, with a significant fraction of memory energy being dissipated in I/O and data movement. To make the case for memory interconnect specialization, this paper makes three contributions.First, we design a tool that carefully models I/O power in the memory system, explores the design space, and gives the user the ability to define new types of memory interconnects/topologies. The tool is validated against SPICE models, and is integrated into version 7 of the popular CACTI package. Our analysis with the tool shows that several design parameters have a significant impact on I/O power.We then use the tool to help craft novel specialized memory system channels. We introduce a new relay-on-board chip that partitions a DDR channel into multiple cascaded channels. We show that this simple change to the channel topology can improve performance by 22% for DDR DRAM and lower cost by up to 65% for DDR DRAM. This new architecture does not require any changes to DIMMs, and it efficiently supports hybrid DRAM/NVM systems.Finally, as an example of a more disruptive architecture, we design a custom DIMM and parallel bus that moves away from the DDR3/DDR4 standards. To reduce energy and improve performance, the baseline data channel is split into three narrow parallel channels and the on-DIMM interconnects are operated at a lower frequency. In addition, this allows us to design a two-tier error protection strategy that reduces data transfers on the interconnect. This architecture yields a performance improvement of 18% and a memory power reduction of 23%.The cascaded channel and narrow channel architectures serve as case studies for the new tool and show the potential for benefit from re-organizing basic memory interconnects.

217 citations


Proceedings ArticleDOI
04 Apr 2017
TL;DR: DUDETM is presented, a crash-consistent durable transaction system that avoids the drawbacks of both undo logging and redo logging and can be implemented with existing hardware TMs with minor hardware modifications, leading to a further 1.7times speedup.
Abstract: Emerging non-volatile memory (NVM) offers non-volatility, byte-addressability and fast access at the same time. To make the best use of these properties, it has been shown by empirical evidence that programs should access NVM directly through CPU load and store instructions, so that the overhead of a traditional file system or database can be avoided. Thus, durable transactions become a common choice of applications for accessing persistent memory data in a crash consistent manner. However, existing durable transaction systems employ either undo logging, which requires a fence for every memory write, or redo logging, which requires intercepting all memory reads within transactions.This paper presents DUDETM, a crash-consistent durable transaction system that avoids the drawbacks of both undo logging and redo logging. DUDETM uses shadow DRAM to decouple the execution of a durable transaction into three fully asynchronous steps. The advantage is that only minimal fences and no memory read instrumentation are required. This design also enables an out-of-the-box transactional memory (TM) to be used as an independent component in our system. The evaluation results show that DUDETM adds durability to a TM system with only 7.4 ~ 24.6% throughput degradation. Compared to the existing durable transaction systems, DUDETM provides 1.7times to 4.4times higher throughput. Moreover, DUDETM can be implemented with existing hardware TMs with minor hardware modifications, leading to a further 1.7times speedup.

179 citations


Journal ArticleDOI
07 Aug 2017
TL;DR: With a combination of high performance and nonvolatility, the arrival of 3D XPoint memory promises to fundamentally change the memory-storage hierarchy at the hardware, system software, and application levels.
Abstract: With a combination of high performance and nonvolatility, the arrival of 3D XPoint memory promises to fundamentally change the memory-storage hierarchy at the hardware, system software, and application levels. This memory will be deployed first as a block addressable storage device, known as the Intel Optane SSD, and even in this familiar form it will drive basic system change. Access times consistently as fast, or faster, than the rest of the system will blur the line between storage and memory. The low latencies from these solid-state drives (SSDs) allow rethinking even basic storage methodologies to be more memory-like. For example, the manner in which storage performance is measured shifts from input–output operations (IOs) at a given queue depth to response time for a given load, like memory is typically measured. System changes to match the low latency of these SSDs are already advanced, and in many cases they enable the application to utilize the SSD’s performance. In other cases, additional work is required, particularly on policies set originally with slow storage in mind. On top of these already-capable systems are real applications. System-level tests show that applications such as key–value stores and real-time analytics can benefit immediately. These application benefits include significantly faster runtime (up to $3\times $ ) and access to larger data sets than supported in DRAM. Newly viable mechanisms for expanding application memory footprint include native application support or native operating system paging, a significant change in the use of SSDs. The next step in this convergence is 3D XPoint memory accessed through processor load/store operations. Significant operating system support is already in place. The implications of consistently low latency storage and fast persistent memory on computing are great, with applications and systems taking advantage of this new technology as storage as the first to benefit.

179 citations


Journal ArticleDOI
TL;DR: Experimental results show that the proposed MDPDA algorithm can efficiently reduce the memory access cost and extend the lifetime of MRAM, and a novelmultidimensional dynamic programming data allocation (MDPDA) algorithm to strategically allocate data blocks to each memory.
Abstract: Resource scheduling is one of the most important issues in mobile cloud computing due to the constraints in memory, CPU, and bandwidth. High energy consumption and low performance of memory accesses have become overwhelming obstacles for chip multiprocessor (CMP) systems used in cloud systems. In order to address the daunting “memory wall” problem, hybrid on-chip memory architecture has been widely investigated recently. Due to its advantages in size, real-time predictability, power, and software controllability, scratchpad memory (SPM) is a promising technique to replace the hardware cache and bridge the processor–memory gap for CMP systems. In this paper, we present a novel hybrid on-chip SPM that consists of a static random access memory (RAM), a magnetic RAM (MRAM), and a zero-capacitor RAM for CMP systems by fully taking advantages of the benefits of each type of memory. To reduce memory access latency, energy consumption, and the number of write operations to MRAM, we also propose a novel multidimensional dynamic programming data allocation (MDPDA) algorithm to strategically allocate data blocks to each memory. Experimental results show that the proposed MDPDA algorithm can efficiently reduce the memory access cost and extend the lifetime of MRAM.

142 citations


Proceedings ArticleDOI
18 Jun 2017
TL;DR: This paper proposes an ultra-efficient approximate processing in-memory architecture, called APIM, which exploits the analog characteristics of non-volatile memories to support addition and multiplication inside the crossbar memory, while storing the data.
Abstract: Recent years have witnessed a rapid growth in the domain of Internet of Things (IoT). This network of billions of devices generates and exchanges huge amount of data. The limited cache capacity and memory bandwidth make transferring and processing such data on traditional CPUs and GPUs highly inefficient, both in terms of energy consumption and delay. However, many IoT applications are statistical at heart and can accept a part of inaccuracy in their computation. This enables the designers to reduce complexity of processing by approximating the results for a desired accuracy. In this paper, we propose an ultra-efficient approximate processing in-memory architecture, called APIM, which exploits the analog characteristics of non-volatile memories to support addition and multiplication inside the crossbar memory, while storing the data. The proposed design eliminates the overhead involved in transferring data to processor by virtually bringing the processor inside memory. APIM dynamically configures the precision of computation for each application in order to tune the level of accuracy during runtime. Our experimental evaluation running six general OpenCL applications shows that the proposed design achieves up to 20× performance improvement and provides 480× improvement in energy-delay product, ensuring acceptable quality of service. In exact mode, it achieves 28× energy savings and 4.8× speed up compared to the state-of-the-art GPU cores.

79 citations


Proceedings ArticleDOI
01 Sep 2017
TL;DR: UH-MEM as discussed by the authors is a page management mechanism for various hybrid memories that systematically estimates the utility of migrating a page between different memory types, and uses this information to guide data placement.
Abstract: While the memory footprints of cloud and HPC applications continue to increase, fundamental issues with DRAM scaling are likely to prevent traditional main memory systems, composed of monolithic DRAM, from greatly growing in capacity Hybrid memory systems can mitigate the scaling limitations of monolithic DRAM by pairing together multiple memory technologies (eg, different types of DRAM, or DRAM and non-volatile memory) at the same level of the memory hierarchy The goal of a hybrid main memory is to combine the different advantages of the multiple memory types in a cost-effective manner while avoiding the disadvantages of each technology Memory pages are placed in and migrated between the different memories within a hybrid memory system, based on the properties of each page It is important to make intelligent page management (ie, placement and migration) decisions, as they can significantly affect system performanceIn this paper, we propose utility-based hybrid memory management (UH-MEM), a new page management mechanism for various hybrid memories, that systematically estimates the utility (ie, the system performance benefit) of migrating a page between different memory types, and uses this information to guide data placement UH-MEM operates in two steps First, it estimates how much a single application would benefit from migrating one of its pages to a different type of memory, by comprehensively considering access frequency, row buffer locality, and memory-level parallelism Second, it translates the estimated benefit of a single application to an estimate of the overall system performance benefit from such a migrationWe evaluate the effectiveness of UH-MEM with various types of hybrid memories, and show that it significantly improves system performance on each of these hybrid memories For a memory system with DRAM and non-volatile memory, UH-MEM improves performance by 14% on average (and up to 26%) compared to the best of three evaluated state-of-the-art mechanisms across a large number of data-intensive workloads

78 citations


Book ChapterDOI
TL;DR: RowClone, a mechanism that exploits DRAM technology to perform bulk copy and initialization operations completely inside main memory, and a complementary work that uses DRAM to performs bulk bitwise AND and OR operations inside mainmemory significantly improve the performance and energy efficiency of the respective operations.
Abstract: In existing systems, the off-chip memory interface allows the memory controller to perform only read or write operations. Therefore, to perform any operation, the processor must first read the source data and then write the result back to memory after performing the operation. This approach consumes high latency, bandwidth, and energy for operations that work on a large amount of data. Several works have proposed techniques to process data near memory by adding a small amount of compute logic closer to the main memory chips. In this chapter, we describe two techniques proposed by recent works that take this approach of processing in memory further by exploiting the underlying operation of the main memory technology to perform more complex tasks. First, we describe RowClone, a mechanism that exploits DRAM technology to perform bulk copy and initialization operations completely inside main memory. We then describe a complementary work that uses DRAM to perform bulk bitwise AND and OR operations inside main memory. These two techniques significantly improve the performance and energy efficiency of the respective operations.

75 citations


Journal ArticleDOI
Wang Kang1, Haotian Wang1, Zhaohao Wang1, Youguang Zhang1, Weisheng Zhao1 
12 May 2017
TL;DR: A cost-efficient IMP/NMP solution in spin-transfer torque magnetic random access memory (STT–MRAM) without adding any processing units on the memory chip is proposed.
Abstract: In the current big data era, the memory wall issue between the processor and the memory becomes one of the most critical bottlenecks for conventional Von-Newman computer architecture In-memory processing (IMP) or near-memory processing (NMP) paradigms have been proposed to address this problem by adding a small amount of processing units inside/near the memory Unfortunately, although intensively studied, prior IMP/NMP platforms are practically unsuccessful because of the fabrication complexity and cost efficiency by integrating the processing units and memory on the same chip Recently, emerging nonvolatile memories provide new possibility for efficiently implementing the IMP/NMP paradigm In this paper, we propose a cost-efficient IMP/NMP solution in spin-transfer torque magnetic random access memory (STT–MRAM) without adding any processing units on the memory chip The key idea behind the proposed IMP/NMP solution is to exploit the peripheral circuitry already existing inside memory (or with minimal changes) to perform bitwise logic operations Such an IMP/NMP platform enables rather fast logic operations as the logic results can be obtained immediately through just a memory-like readout operation Memory read and logics not, and/nand, and or/nor operations can be achieved and dynamically configured within the same STT–MRAM chip Functionality and performance are evaluated with hybrid simulations under the 40 nm technology node

67 citations


Proceedings ArticleDOI
04 Apr 2017
TL;DR: TEMPO as mentioned in this paper is a low-overhead hardware mechanism to boost memory performance by exploiting the operating system's (OS) virtual memory subsystem by translating page tables to DRAM.
Abstract: We propose translation-enabled memory prefetching optimizations or TEMPO, a low-overhead hardware mechanism to boost memory performance by exploiting the operating system's (OS) virtual memory subsystem. We are the first to make the following observations: (1) a substantial fraction (20-40%) of DRAM references in modern big- data workloads are devoted to accessing page tables; and (2) when memory references require page table lookups in DRAM, the vast majority of them (98%+) also look up DRAM for the subsequent data access. TEMPO exploits these observations to enable DRAM row-buffer and on-chip cache prefetching of the data that page tables point to. TEMPO requires trivial changes to the memory controller (under 3% additional area), no OS or application changes, and improves performance by 10-30% and energy by 1-14%.

58 citations


Proceedings ArticleDOI
12 Nov 2017
TL;DR: In this article, a lightweight runtime solution that automatically and transparently manages data placement on HMS without the requirement of hardware modifications and disruptive change to applications is presented to bridge the performance gap between NVM and DRAM.
Abstract: Non-volatile memory (NVM) provides a scalable and power-efficient solution to replace DRAM as main memory. However, because of relatively high latency and low bandwidth of NVM, NVM is often paired with DRAM to build a heterogeneous memory system (HMS). As a result, data objects of the application must be carefully placed to NVM and DRAM for best performance. In this paper, we introduce a lightweight runtime solution that automatically and transparently manage data placement on HMS without the requirement of hardware modifications and disruptive change to applications. Leveraging online profiling and performance models, the runtime characterizes memory access patterns associated with data objects, and minimizes unnecessary data movement. Our runtime solution effectively bridges the performance gap between NVM and DRAM. We demonstrate that using NVM to replace the majority of DRAM can be a feasible solution for future HPC systems with the assistance of a software-based data management.

Proceedings ArticleDOI
24 Jun 2017
TL;DR: It is demonstrated that smart memory, memory with compute capability and a packetized interface, can dramatically simplify this problem and have one to two orders of magnitude of lower overheads for performance, space, energy, and memory bandwidth, compared to prior solutions.
Abstract: A practically feasible low-overhead hardware design that provides strong defenses against memory bus side channel remains elusive. This paper observes that smart memory, memory with compute capability and a packetized interface, can dramatically simplify this problem. InvisiMem expands the trust base to include the logic layer in the smart memory to implement cryptographic primitives, which aid in addressing several memory bus side channel vulnerabilities efficiently. This allows the secure host processor to send encrypted addresses over the untrusted memory bus, and thereby eliminates the need for expensive address obfuscation techniques based on Oblivious RAM (ORAM). In addition, smart memory enables efficient solutions for ensuring freshness without using expensive Merkle trees, and mitigates memory bus timing channel using constant heart-beat packets. We demonstrate that InvisiMem designs have one to two orders of magnitude of lower overheads for performance, space, energy, and memory bandwidth, compared to prior solutions.

Journal ArticleDOI
TL;DR: A dynamic adaptive replacement policy (DARP) in the shared last-level cache for the DRAM/PCM hybrid main memory is proposed and results have shown that the DARP improved the memory access efficiency by 25.4%.
Abstract: The increasing demand on the main memory capacity is one of the main big data challenges. Dynamic random access memory (DRAM) does not represent the best choice for a main memory, due to high power consumption and low density. However, the nonvolatile memory, such as the phase-change memory (PCM), represents an additional choice because of the low power consumption and high-density characteristic. Nevertheless, the high access latency and limited write endurance have disabled the PCM to replace the DRAM currently. Therefore, a hybrid memory, which combines both the DRAM and the PCM, has become a good alternative to the traditional DRAM memory. Both DRAM and PCM disadvantages are challenges for the hybrid memory. In this paper, a dynamic adaptive replacement policy (DARP) in the shared last-level cache for the DRAM/PCM hybrid main memory is proposed. The DARP distinguishes the cache data into the PCM data and the DRAM data, then, the algorithm adopts different replacement policies for each data type. Specifically, for the PCM data, the least recently used (LRU) replacement policy is adopted, and for the DRAM data, the DARP is employed according to the process behavior. Experimental results have shown that the DARP improved the memory access efficiency by 25.4%.

Journal ArticleDOI
TL;DR: A novel memory architecture called a resource-efficient SRAM-based TCAM (REST), which emulates TCAM functionality using optimal resources and increases the overall emulated TCAM bits/SRAM at the cost of reduced throughput.
Abstract: Static random access memory (SRAM)-based ternary content addressable memory (TCAM) offers TCAM functionality by emulating it with SRAM. However, this emulation suffers from reduced memory efficiency while mapping the TCAM table on SRAM units. This is due to the limited capacity of the physical addresses in the SRAM unit. This brief offers a novel memory architecture called a resource-efficient SRAM-based TCAM (REST), which emulates TCAM functionality using optimal resources. The SRAM unit is divided into multiple virtual blocks to store the address information presented in the TCAM table. This approach virtually increases the overall address space of the SRAM unit, mapping a greater portion of the TCAM table in SRAM and increasing the overall emulated TCAM bits/SRAM at the cost of reduced throughput. A $72 \times 28$ -bit REST consumes only one 36-kbit SRAM and a few distributed RAMs via implementation on a Xilinx Kintex-7 field-programmable gate array. It uses only 3.5% of the memory resources compared with a conventional SRAM-based TCAM (hybrid-partitioned TCAM).

Journal ArticleDOI
TL;DR: A cycle-accurate simulator for hybrid memory cube called CasHMC provides a cycle-by-cycle simulation of every module in an HMC and generates analysis results including a bandwidth graph and statistical data.
Abstract: 3D-stacked DRAM has been actively studied to overcome the limits of conventional DRAM. The Hybrid Memory Cube (HMC) is a type of 3D-stacked DRAM that has drawn great attention because of its usability for server systems and processing-in-memory (PIM) architecture. Since HMC is not directly stacked on the processor die where the central processing units (CPUs) and graphic processing units (GPUs) are integrated, HMC has to be linked to other processor components through high speed serial links. Therefore, the communication bandwidth and latency should be carefully estimated to evaluate the performance of HMC. However, most existing HMC simulators employ only simple HMC modeling. In this paper, we propose a cycle-accurate simulator for hybrid memory cube called CasHMC. It provides a cycle-by-cycle simulation of every module in an HMC and generates analysis results including a bandwidth graph and statistical data. Furthermore, CasHMC is implemented in C++ as a single wrapped object that includes an HMC controller, communication links, and HMC memory. Instantiating this single wrapped object facilitates simultaneous simulation in parallel with other simulators that generate memory access patterns such as a processor simulator or a memory trace generator.

Proceedings ArticleDOI
01 Feb 2017
TL;DR: MemPod is proposed, a scalable and efficient memory management mechanism for flat address space hybrid memories that improves the average main memory access time of multi-programmed workloads, by up to 29% compared to the state of the art.
Abstract: In the near future, die-stacked DRAM will be increasingly present in conjunction with off-chip memories in hybrid memory systems. Research on this subject revolves around using the stacked memory as a cache or as part of a flat address space. This paper proposes MemPod, a scalable and efficient memory management mechanism for flat address space hybrid memories. MemPod monitors memory activity and periodically migrates the most frequently accessed memory pages to the faster on-chip memory. MemPod's partitioned architectural organization allows for efficientscaling with memory system capabilities. Further, a big data analytics algorithm is adapted to develop an efficient, low-cost activity tracking technique. MemPod improves the average main memory access time of multi-programmed workloads, by up to 29% (9% on average) compared to the state of the art, and that will increase as the differential between memory speeds widens. MemPod's novel activity tracking approach leads to significant cost reduction (12800x lower storage space requirements) and improved future prediction accuracy over prior work which maintains a separatecounter per page.

Journal ArticleDOI
TL;DR: This paper proposes to partially replace DRAM using PCM to optimize the management of flash memory metadata for better system reliability in the presence of power failure and system crash, and presents a write-activity-aware PCM-assisted flash memory management scheme, called PCm-FTL.
Abstract: Phase change memory (PCM) is a promising DRAM alternative because of its non-volatility, high density, low standby power and close-to-DRAM performance. These features make PCM an attractive solution to optimize the management of NAND flash memory in embedded systems. However, PCM's limited write endurance hinders its application in embedded systems. Therefore, how to manage flash memory with PCM—particularly guarantee PCM a reasonable lifetime—becomes a challenging issue. In this paper, we propose to partially replace DRAM using PCM to optimize the management of flash memory metadata for better system reliability in the presence of power failure and system crash. To prolong PCM's lifetime, we present a write-activity-aware PCM-assisted flash memory management scheme, called PCM-FTL . By differentiating sequential and random I/O behaviors, a novel two-level mapping mechanism and a customized wear-leveling scheme are developed to reduce writes to PCM and extend its lifetime. We evaluate PCM-FTL with a variety of general-purpose and mobile I/O workloads. Experimental results show that PCM-FTL can significantly reduce write activities and achieve an even distribution of writes in PCM with very low overhead.

Journal ArticleDOI
TL;DR: This paper proposes the smallest solution for soft-error tolerant embedded memory yet to be presented, based on a four-transistor dynamic memory core that internally stores complementary data values to provide an inherent per-bit error detection capability.
Abstract: The limited size and power budgets of space-bound systems often contradict the requirements for reliable circuit operation within high-radiation environments. In this paper, we propose the smallest solution for soft-error tolerant embedded memory yet to be presented. The proposed complementary dual-modular redundancy (CDMR) memory is based on a four-transistor dynamic memory core that internally stores complementary data values to provide an inherent per-bit error detection capability. By adding simple, low-overhead parity, an error-correction capability is added to the memory architecture for robust soft-error protection. The proposed memory was implemented in a 65-nm CMOS technology, displaying as much as a $3.5\times $ smaller silicon footprint than other radiation-hardened bitcells. In addition, the CDMR memory consumes between 48% and 87% less standby power than other considered solutions across the entire operating region.

Journal ArticleDOI
TL;DR: This paper explores the interactions between DRAM and PCM to improve both the performance and the endurance of a DRAM-PCM hybrid main memory and develops a proactive strategy to allocate pages taking both program segments and DRAM conflict misses into consideration.
Abstract: Phase change memory (PCM), given its nonvolatility, potential high density, and low standby power, is a promising candidate to be used as main memory in next generation computer systems. However, to hide its shortcomings of limited endurance and slow write performance, state-of-the-art solutions tend to construct a dynamic RAM (DRAM)-PCM hybrid memory and place write-intensive pages in DRAM. While existing optimizations to this hybrid architecture focus on tuning DRAM configurations to reduce the number of write operations to PCM, this paper explores the interactions between DRAM and PCM to improve both the performance and the endurance of a DRAM-PCM hybrid main memory. Specifically, it exploits the flexibility of mapping virtual pages to physical pages, and develops a proactive strategy to allocate pages taking both program segments and DRAM conflict misses into consideration, thus distributing those heavily written pages across different DRAM sets. Meanwhile, a lifetime-aware DRAM replacement algorithm and a conflict-aware page remapping strategy are proposed to further reduce DRAM misses and PCM writes. Experiments confirm that the proposed techniques are able to improve average memory hit time and reduce maximum PCM write counts thus enhancing both performance and lifetime of a DRAM-PCM hybrid main memory.

Proceedings ArticleDOI
04 Apr 2017
TL;DR: This work designs and implements page fault support for InfiniBand and Ethernet NICs, and demonstrates that the solution provides all the benefits associated with "regular" virtual memory, notably a simpler programming model that rids users from the need to pin and the ability to employ all the canonical memory optimizations, such as memory overcommitment and demand-paging based on actual use.
Abstract: Direct network I/O allows network controllers (NICs) to expose multiple instances of themselves, to be used by untrusted software without a trusted intermediary. Direct I/O thus frees researchers from legacy software, fueling studies that innovate in multitenant setups. Such studies, however, overwhelmingly ignore one serious problem: direct memory accesses (DMAs) of NICs disallow page faults, forcing systems to either pin entire address spaces to physical memory and thereby hinder memory utilization, or resort to APIs that pin/unpin memory buffers before/after they are DMAed, which complicates the programming model and hampers performance.We solve this problem by designing and implementing page fault support for InfiniBand and Ethernet NICs. A main challenge we tackle---unique to NICs---is handling receive DMAs that trigger page faults, leaving the NIC without memory to store the incoming data. We demonstrate that our solution provides all the benefits associated with "regular" virtual memory, notably (1) a simpler programming model that rids users from the need to pin, and (2) the ability to employ all the canonical memory optimizations, such as memory overcommitment and demand-paging based on actual use. We show that, as a result, benchmark performance improves by up to 1.9x.

Proceedings ArticleDOI
02 Oct 2017
TL;DR: A near memory accelerator combining simple hardware building blocks to accelerate lookup in a hash table based key/value store is designed and shows 12.8X - 2.9X speedup compared to conventional CPU lookup depending on workload characteristics.
Abstract: In the "Big Data" era, fast lookup of keys in a key/value store is a ubiquitous operation. We have designed a near memory accelerator combining simple hardware building blocks to accelerate lookup in a hash table based key/value store. We report on the co-design of hardware and software to accomplish fast lookup using open addressing. The accelerator implements a batch get command to look up a set of keys in a single request. Using an FPGA emulator, we evaluate the performance of a query workload under a comprehensive range of conditions such as hash table load factor (fill) and query key repeat distribution (likelihood of a key to reappear in a query workload). We emulate two memory configurations: Hybrid Memory Cube (or High Bandwidth Memory), and Storage Class Memory. Our design shows 12.8X - 2.9X speedup compared to conventional CPU lookup depending on workload characteristics.

Proceedings ArticleDOI
01 Feb 2017
TL;DR: CP-ORAM is proposed, a Cooperative Path ORAM design, to effectively schedule the memory requests from both types of applications and to achieve 20% performance improvement on average over the baseline Path OrAM for the secure application in a four-channel server setting.
Abstract: Path ORAM (Oblivious RAM) is a recently proposed ORAM protocol for preventing information leakage from memory access sequences. It receives wide adoption due to its simplicity, practical efficiency and asymptotic efficiency. However, Path ORAM has extremely large memory bandwidth demand, leading to severe memory competition in server settings, e.g., a server may service one application that uses Path ORAM and one or multiple applications that do not. While Path ORAM synchronously and intensively uses all memory channels, the non-secure applications often exhibit low access intensity and large channel level imbalance. Traditional memory scheduling schemes lead to wasted memory bandwidth to the system and large performance degradation to both types of applications. In this paper, we propose CP-ORAM, a Cooperative Path ORAM design, to effectively schedule the memory requests from both types of applications. CP-ORAM consists of three schemes: P-Path, R-Path, and W-Path. P-Path assigns and enforces scheduling priority for effective memory bandwidth sharing. R-Path maximizes bandwidth utilization by proactively scheduling read operations from the next Path ORAM access. W-Path mitigates contention on busy memory channels with write redirection. We evaluate CP-ORAM and compare it to the state-of-the-art. Our results show that CP-ORAM helps to achieve 20% performance improvement on average over the baseline Path ORAM for the secure application in a four-channel server setting.

Journal ArticleDOI
TL;DR: The performance of GAMT is compared with centralized implementations and it is shown that it can run up to four times faster and have over 51 and 37 percent reduction in area and power consumption, respectively, for a given bandwidth.
Abstract: Embedded systems are increasingly based on multi-core platforms to accommodate a growing number of applications, some of which have real-time requirements. Resources, such as off-chip DRAM, are typically shared between the applications using memory interconnects with different arbitration polices to cater to diverse bandwidth and latency requirements. However, traditional centralized interconnects are not scalable as the number of clients increase. Similarly, current distributed interconnects either cannot satisfy the diverse requirements or have decoupled arbitration stages, resulting in larger area, power and worst-case latency. The four main contributions of this article are: 1) a Globally Arbitrated Memory Tree (GAMT) with a distributed architecture that scales well with the number of cores, 2) an RTL-level implementation that can be configured with five arbitration policies (three distinct and two as special cases), 3) the concept of mixed arbitration policies that allows the policy to be selected individually per core, and 4) a worst-case analysis for a mixed arbitration policy that combines TDM and FBSP arbitration.We compare the performance of GAMT with centralized implementations and show that it can run up to four times faster and have over 51 and 37 percent reduction in area and power consumption, respectively, for a given bandwidth.

Proceedings ArticleDOI
04 Apr 2017
TL;DR: Mallacc is proposed, an in-core hardware accelerator designed for broad use across a number of high-performance, modern memory allocators, which accelerates the three primary operations of a typical memory allocation request: size class computation, retrieval of a free memory block, and sampling of memory usage.
Abstract: Recent work shows that dynamic memory allocation consumes nearly 7% of all cycles in Google datacenters. With the trend towards increased specialization of hardware, we propose Mallacc, an in-core hardware accelerator designed for broad use across a number of high-performance, modern memory allocators. The design of Mallacc is quite different from traditional throughput-oriented hardware accelerators. Because memory allocation requests tend to be very frequent, fast, and interspersed inside other application code, accelerators must be optimized for latency rather than throughput and area overheads must be kept to a bare minimum. Mallacc accelerates the three primary operations of a typical memory allocation request: size class computation, retrieval of a free memory block, and sampling of memory usage. Our results show that malloc latency can be reduced by up to 50% with a hardware cost of less than 1500 um2 of silicon area, less than 0.006% of a typical high-performance processor core.

Proceedings ArticleDOI
24 Jul 2017
TL;DR: A novel Spin Orbit Torque Magnetic Random Access Memory (SOT-MRAM) array design that could simultaneously work as non-volatile memory and implement a reconfigurable in-memory logic (AND, OR) without add-on logic circuits to memory chip as in traditional logic-in-memory designs is proposed.
Abstract: In this paper, we propose a novel Spin Orbit Torque Magnetic Random Access Memory (SOT-MRAM) array design that could simultaneously work as non-volatile memory and implement a reconfigurable in-memory logic (AND, OR) without add-on logic circuits to memory chip as in traditional logic-in-memory designs. The computed logic output could be simply read out like a normal MRAM bit-cell using the shared memory peripheral circuits. Such intrinsic in-memory logic could be used to process data within memory to greatly reduce power-hungry and long distance data communication in conventional Von-Neumann computing systems. We further employ in-memory data encryption using Advanced Encryption Standard (AES) algorithm as a case study to demonstrate the efficiency of the proposed design. The device to architecture co-simulation results show that the proposed design can achieve 70.15% and 80.87% lower energy consumption compared to CMOS-ASIC and CMOL-AES implementations, respectively. It offers almost similar energy consumption as recent DW-AES implementation, but with 60.65% less area overhead.

Proceedings ArticleDOI
01 Jul 2017
TL;DR: This work designs a software interface that programmers can use to identify data structures that are resilient to approximations and proposes a runtime quality control framework that automatically determines the error constraints for the identified data structures such that a given target application-level quality is maintained.
Abstract: Memory subsystems are a major energy bottleneck in computing platforms due to frequent transfers between processors and off-chip memory. We propose approximate memory compression, a technique that leverages the intrinsic resilience of emerging workloads such as machine learning and data analytics to reduce off-chip memory traffic and energy. To realize approximate memory compression, we enhance the memory controller to be aware of memory regions that contain approximation-resilient data, and to transparently compress/decompress the data written to/read from these regions. To provide control over approximations, the quality-aware memory controller conforms to a specified error constraint for each approximate memory region. We design a software interface that programmers can use to identify data structures that are resilient to approximations. We also propose a runtime quality control framework that automatically determines the error constraints for the identified data structures such that a given target application-level quality is maintained. We evaluate our proposal by implementing a hardware prototype using the Intel UniPHY-DDR3 memory controller and NIOS-II processor, a Hynix DDR3 DRAM module, and a Stratix-IV FPGA development board. Across a suite of 8 machine learning benchmarks, approximate memory compression obtains a 1.28× benefit in DRAM energy and a simultaneous 11.5% improvement in execution time for a small (< 1.5%) loss in output quality.

Patent
Kang Kyu-Chang1, Yang Hui-Kap1
27 Jul 2017
TL;DR: In this article, the row selection circuit performs an access operation with respect to the memory bank and a hammer refresh operation on a row that is physically adjacent to a row accessed intensively.
Abstract: A memory device includes a memory bank, a row selection circuit and a refresh controller. The memory bank includes a plurality of memory blocks, and each memory block includes a plurality of memory cells arranged in rows and columns. The row selection circuit performs an access operation with respect to the memory bank and a hammer refresh operation with respect to a row that is physically adjacent to a row that is accessed intensively. The refresh controller controls the row selection circuit such that the hammer refresh operation is performed during a row active time for the access operation. The hammer refresh operation may be performed efficiently and performance of the memory device may be enhanced by performing the hammer refresh operation during the row active time for the access operation.

Journal ArticleDOI
TL;DR: A novel approach to schedule memory requests in Mixed Criticality Systems by enabling the MCS designer to specify memory requirements per task is proposed, and a compact time-division-multiplexing scheduler and framework that constructs optimal schedules to manage requests to off-chip memory are introduced.
Abstract: We propose a novel approach to schedule memory requests in Mixed Criticality Systems (MCS). This approach supports an arbitrary number of criticality levels by enabling the MCS designer to specify memory requirements per task. It retains locality within large-size requests to satisfy memory requirements of all tasks. To achieve this target, we introduce a compact time-division-multiplexing scheduler, and a framework that constructs optimal schedules to manage requests to off-chip memory. We also present a static analysis that guarantees meeting requirements of all tasks. We compare the proposed controller against state-of-the-art memory controllers using both a case study and synthetic experiments.

Journal ArticleDOI
TL;DR: The authors summarize the latest research progress of phase change memory, spin-transfer torque random access memory, and resistiverandom access memory in device engineering, circuit design, computer architecture, and application.
Abstract: Editor’s note: Phase change memory, spin-transfer torque random access memory, and resistive random access memory are three major emerging memory technologies that receive tremendous attentions from both academia and industry. In this survey article, the authors summarize the latest research progress of these technologies in device engineering, circuit design, computer architecture, and application. —Tei-Wei Kuo, National Taiwan University

Journal ArticleDOI
TL;DR: A hierarchical design of 4R1W memory is introduced that requires 25% fewer BRAMs than the previous approach of duplicating the 2R1w module and can achieve higher clock frequencies by alleviating the complex routing in an FPGA.
Abstract: The utilization of block RAMs (BRAMs) is a critical performance factor for multiported memory designs on field-programmable gate arrays (FPGAs). Not only does the excessive demand on BRAMs block the usage of BRAMs from other parts of a design, but the complex routing between BRAMs and logic also limits the operating frequency. This paper first introduces a brand new perspective and a more efficient way of using a conventional two reads one write (2R1W) memory as a 2R1W/4R memory. By exploiting the 2R1W/4R as the building block, this paper introduces a hierarchical design of 4R1W memory that requires 25% fewer BRAMs than the previous approach of duplicating the 2R1W module. Memories with more read/write ports can be extended from the proposed 2R1W/4R memory and the hierarchical 4R1W memory. Compared with previous xor-based and live value table-based approaches, the proposed designs can, respectively, reduce up to 53% and 69% of BRAM usage for 4R2W memory designs with 8K-depth. For complex multiported designs, the proposed BRAM-efficient approaches can achieve higher clock frequencies by alleviating the complex routing in an FPGA. For 4R3W memory with 8K-depth, the proposed design can save 53% of BRAMs and enhance the operating frequency by 20%.