scispace - formally typeset
Search or ask a question

Showing papers on "Registered memory published in 2015"


Proceedings ArticleDOI
01 Feb 2015
TL;DR: This paper proposes near-DRAM acceleration (NDA) architectures, which process data using accelerators 3D-stacked on DRAM devices comprising off-chip main memory modules, substantially reducing energy consumption and improving performance.
Abstract: Energy consumed for transferring data across the processor memory hierarchy constitutes a large fraction of total system energy consumption, and this fraction has steadily increased with technology scaling. In this paper, we propose near-DRAM acceleration (NDA) architectures, which process data using accelerators 3D-stacked on DRAM devices comprising off-chip main memory modules. NDA transfers most data through high-bandwidth and low-energy 3D interconnects between accelerators and DRAM devices instead of low-bandwidth and high-energy off-chip interconnects between a processor and DRAM devices, substantially reducing energy consumption and improving performance. Unlike previous near-memory processing architectures, NDA is built upon commodity DRAM devices; apart from inserting through-silicon vias (TSVs) to 3D-interconnect DRAM devices and accelerators, NDA requires minimal changes to the commodity DRAM device and standard memory module architectures. This allows NDA to be more easily adopted in both existing and emerging systems. Our experiments demonstrate that, on average, our NDA-based system consumes 46% (68%) lower (data transfer) energy at 1.67× higher performance than a system that integrates the same accelerator logic within the processor itself.

251 citations


Proceedings ArticleDOI
22 Jun 2015
TL;DR: This paper analyzes the memory errors in the entire fleet of servers at Facebook over the course of fourteen months, representing billions of device days, and observes several new reliability trends for memory systems that have not been discussed before in literature.
Abstract: Computing systems use dynamic random-access memory (DRAM) as main memory. As prior works have shown, failures in DRAM devices are an important source of errors in modern servers. To reduce the effects of memory errors, error correcting codes (ECC) have been developed to help detect and correct errors when they occur. In order to develop effective techniques, including new ECC mechanisms, to combat memory errors, it is important to understand the memory reliability trends in modern systems. In this paper, we analyze the memory errors in the entire fleet of servers at Facebook over the course of fourteen months, representing billions of device days. The systems we examine cover a wide range of devices commonly used in modern servers, with DIMMs manufactured by 4 vendors in capacities ranging from 2 GB to 24 GB that use the modern DDR3 communication protocol. We observe several new reliability trends for memory systems that have not been discussed before in literature. We show that (1) memory errors follow a power-law, specifically, a Pareto distribution with decreasing hazard rate, with average error rate exceeding median error rate by around 55×, (2) non-DRAM memory failures from the memory controller and memory channel cause the majority of errors, and the hardware and software overheads to handle such errors cause a kind of denial of service attack in some servers, (3) using our detailed analysis, we provide the first evidence that more recent DRAM cell fabrication technologies (as indicated by chip density) have substantially higher failure rates, increasing by 1.8× over the previous generation, (4) DIMM architecture decisions affect memory reliability: DIMMs with fewer chips and lower transfer widths have the lowest error rates, likely due to electrical noise reduction, (5) while CPU and memory utilization do not show clear trends with respect to failure rates, workload type can influence failure rate by up to 6:5×, suggesting certain memory access patterns may induce more errors, (6) we develop a model for memory reliability and show how system design choices such as using lower density DIMMs and fewer cores per chip can reduce failure rates of a baseline server by up to 57.7%, and (7) we perform the first implementation and real-system analysis of page offlining at scale, showing that it can reduce memory error rate by 67%, and identify several real-world impediments to the technique.

203 citations


Journal ArticleDOI
TL;DR: A flexible memory simulator - NVMain 2.0, is introduced to help the community for modeling not only commodity DRAMs but also emerging memory technologies, such as die-stacked DRAM caches, non-volatile memories including multi-level cells (MLC), and hybrid non-Volatile plus DRAM memory systems.
Abstract: In this letter, a flexible memory simulator - NVMain 2.0, is introduced to help the community for modeling not only commodity DRAMs but also emerging memory technologies, such as die-stacked DRAM caches, non-volatile memories (e.g., STT-RAM, PCRAM, and ReRAM) including multi-level cells (MLC), and hybrid non-volatile plus DRAM memory systems. Compared to existing memory simulators, NVMain 2.0 features a flexible user interface with compelling simulation speed and the capability of providing sub-array-level parallelism, fine-grained refresh, MLC and data encoder modeling, and distributed energy profiling.

187 citations


Proceedings ArticleDOI
05 Dec 2015
TL;DR: A hardware-assisted DRAM+NVM hybrid persistent memory design, Transparent Hybrid NVM (ThyNVM), which supports software-transparent crash consistency of memory data in a hybrid memory system and efficiently enforce crash consistency through a new dual-scheme checkpointing mechanism.
Abstract: Emerging byte-addressable nonvolatile memories (NVMs) promise persistent memory, which allows processors to directly access persistent data in main memory. Yet, persistent memory systems need to guarantee a consistent memory state in the event of power loss or a system crash (i.e., crash consistency). To guarantee crash consistency, most prior works rely on programmers to (1) partition persistent and transient memory data and (2) use specialized software interfaces when updating persistent memory data. As a result, taking advantage of persistent memory requires significant programmer effort, e.g., to implement new programs as well as modify legacy programs. Use cases and adoption of persistent memory can therefore be largely limited. In this paper, we propose a hardware-assisted DRAM+NVM hybrid persistent memory design, Transparent Hybrid NVM (ThyNVM), which supports software-transparent crash consistency of memory data in a hybrid memory system. To efficiently enforce crash consistency, we design a new dual-scheme checkpointing mechanism, which efficiently overlaps checkpointing time with application execution time. The key novelty is to enable checkpointing of data at multiple granularities, cache block or page granularity, in a coordinated manner. This design is based on our insight that there is a tradeoff between the application stall time due to checkpointing and the hardware storage overhead of the metadata for checkpointing, both of which are dictated by the granularity of checkpointed data. To get the best of the tradeoff, our technique adapts the checkpointing granularity to the write locality characteristics of the data and coordinates the management of multiple-granularity updates. Our evaluation across a variety of applications shows that ThyNVM performs within 4.9% of an idealized DRAM-only system that can provide crash consistency at no cost.

172 citations


Proceedings ArticleDOI
09 Mar 2015
TL;DR: This work explores the challenges of exposing the stacked DRAM as part of the system's physical address space, and presents an HMA approach with low hardware and software impact that can dynamically tune itself to different application scenarios, achieving performance even better than the (impractical-to-implement) baseline approaches.
Abstract: Die-stacked DRAM is a technology that will soon be integrated in high-performance systems. Recent studies have focused on hardware caching techniques to make use of the stacked memory, but these approaches require complex changes to the processor and also cannot leverage the stacked memory to increase the system's overall memory capacity. In this work, we explore the challenges of exposing the stacked DRAM as part of the system's physical address space. This non-uniform access memory (NUMA) styled approach greatly simplifies the hardware and increases the physical memory capacity of the system, but pushes the burden of managing the heterogeneous memory architecture (HMA) to the software layers. We first explore simple (and somewhat impractical) schemes to manage the HMA, and then refine the mechanisms to address a variety of hardware and software implementation challenges. In the end, we present an HMA approach with low hardware and software impact that can dynamically tune itself to different application scenarios, achieving performance even better than the (impractical-to-implement) baseline approaches.

149 citations


Patent
22 May 2015
TL;DR: In this article, a storage controller for determining an amount of data to be sent to a flash memory apparatus for storage comprises a communications interface for communicating with the flash memory equipment and a processor.
Abstract: A storage controller for determining an amount of data to be sent to a flash memory apparatus for storage comprises a communications interface for communicating with the flash memory apparatus and a processor. The flash memory apparatus comprises a block including a plurality of pages. And at least one of the pages is unavailable for storage. The processor is configured to receive information of the block sent by the flash memory apparatus, wherein the information includes capacity of one or more unavailable pages in the block. And then, the processor determines an available capacity of the block, based on the information and a total capacity of the block. Further, the processor obtains data to be sent to the flash memory apparatus, wherein an amount of the data is equal to the available capacity of the block. At last, the processor sends the data to the flash memory apparatus.

147 citations


Patent
07 May 2015
TL;DR: In this paper, a system includes a plurality of host processors and HMC devices configured as a distributed shared memory for the host processors, and logic circuitry configured to determine memory coherence state information for data stored in the memory of the plurality of memory die, communicate information regarding the access to memory, and include the memory co-herence information in the communicated information.
Abstract: A system includes a plurality of host processors and a plurality of HMC devices configured as a distributed shared memory for the host processors. An HMC device includes a plurality of integrated circuit memory die including at least a first memory die arranged on top of a second memory die and at least a portion of the memory of the memory die is mapped to include at least a portion of a memory coherence directory; and a logic base die including at least one memory controller configured to manage three-dimensional (3D) access to memory of the plurality of memory die by at least one second device, and logic circuitry configured to determine memory coherence state information for data stored in the memory of the plurality of memory die, communicate information regarding the access to memory, and include the memory coherence information in the communicated information.

134 citations


Patent
07 May 2015
TL;DR: In this paper, a hybrid memory cube (HMC) device is configured as a distributed shared memory for the host processors, which includes a plurality of HMC devices and a logic base die including at least one memory controller configured to manage three-dimensional (3D) access to memory of the plurality of memory die.
Abstract: A system includes a plurality of host processors and a plurality of hybrid memory cube (HMC) devices configured as a distributed shared memory for the host processors. An HMC device includes a plurality of integrated circuit memory die including at least a first memory die arranged on top of a second memory die, and at least a portion of the memory of the memory die is mapped to include at least a portion of a memory coherence directory; and a logic base die including at least one memory controller configured to manage three-dimensional (3D) access to memory of the plurality of memory die by at least one second device, and logic circuitry configured to implement a memory coherence protocol for data stored in the memory of the plurality of memory die.

126 citations


Proceedings ArticleDOI
19 Mar 2015
TL;DR: STT-MRAM circuit designs are presented: a short read-pulse generator with small overhead using hierarchical bitline for eliminating read disturbance, a charge-optimization scheme to avoid excessive active charging/discharging power, and ultra-fast power gating and power-on adaptive to RAM status for reducing leakage power.
Abstract: Nonvolatile memory, spin-transfer torque magnetoresistive RAM (STT-MRAM) is being developed to realize nonvolatile working memory because it provides high-speed accesses, high endurance, and CMOS-logic compatibility Furthermore, programming current has been reduced drastically by developing the advanced perpendicular STT-MRAM [1] Several-megabit STT-MRAM with sub-5ns operation is demonstrated in [2] Advanced perpendicular STT-MRAM achieve ∼3× power saving by reducing leakage current in memory cells compared with SRAM for last level cache (LLC) [3] Such high-speed RAM applications, however, entail several issues: the probability of read disturbance error increases and the active power of STT-MRAM must be decreased for higher access speed Moreover, the leakage power of peripheral circuits must be decreased, because the high-speed RAM requires high-performance transistors having high leakage current in peripheral circuitry [4], limiting the energy efficiency of STT-MRAM To resolve these issues, this paper presents STT-MRAM circuit designs: a short read-pulse generator with small overhead using hierarchical bitline for eliminating read disturbance, a charge-optimization scheme to avoid excessive active charging/discharging power, and ultra-fast power gating and power-on adaptive to RAM status for reducing leakage power

104 citations


Proceedings ArticleDOI
01 May 2015
TL;DR: This paper presents SoftWrAP, an open-source framework for Software based Write-Aside Persistence that provides lightweight atomicity and durability for SCM storage transactions, while ensuring fast paths to data in processor caches, DRAM, and persistent memory tiers.
Abstract: In-memory computing is gaining popularity as a means of sidestepping the performance bottlenecks of block storage operations. However, the volatile nature of DRAM makes these systems vulnerable to system crashes, while the need to continuously refresh massive amounts of passive memoryresident data increases power consumption. Emerging storage-class memory (SCM) technologies combine fast DRAM-like cache-line access granularity with the persistence of storage devices like disks or SSDs, resulting in potential 10x-100x performance gains, and low passive power consumption. This unification of storage and memory into a single directly-accessible persistent tier raises significant reliability and pro-grammability challenges. In this paper, we present SoftWrAP, an open-source framework for Software based Write-Aside Persistence. SoftWrAP provides lightweight atomicity and durability for SCM storage transactions, while ensuring fast paths to data in processor caches, DRAM, and persistent memory tiers. We use our framework to evaluate both handcrafted SCM-based microbenchmarks as well as existing applications, specifically the STX B+Tree library and SQLite database, backed by emulated SCM. Our results show significant benefits of SoftWrAP over existing methods such as undo logging and shadow copying, and can match non-atomic durable writes to SCM, thereby gaining atomic consistency almost for free.

94 citations


Journal ArticleDOI
TL;DR: This paper proposes a genetic algorithm to perform data allocation to different memory units, therefore, reducing memory access cost in terms of power consumption and latency and shows the merits of the heterogeneous scratchpad architecture over the traditional pure memory system and the effectiveness of the proposed algorithms.
Abstract: The gradually widening speed disparity between CPU and memory has become an overwhelming bottleneck for the development of chip multiprocessor systems. In addition, increasing penalties caused by frequent on-chip memory accesses have raised critical challenges in delivering high memory access performance with tight power and latency budgets. To overcome the daunting memory wall and energy wall issues, this paper focuses on proposing a new heterogeneous scratchpad memory architecture, which is configured from SRAM, MRAM, and Z-RAM. Based on this architecture, we propose a genetic algorithm to perform data allocation to different memory units, therefore, reducing memory access cost in terms of power consumption and latency. Extensive and experiments are performed to show the merits of the heterogeneous scratchpad architecture over the traditional pure memory system and the effectiveness of the proposed algorithms.

Patent
02 Jun 2015
TL;DR: In this paper, a comparison operation in memory using a logical representation of a first value stored in a first portion of a number of memory cells coupled to a sense line of a memory array is presented.
Abstract: One example of the present disclosure includes performing a comparison operation in memory using a logical representation of a first value stored in a first portion of a number of memory cells coupled to a sense line of a memory array and a logical representation of a second value stored in a second portion of the number of memory cells coupled to the sense line of the memory array. The comparison operation compares the first value to the second value, and the method can include storing a logical representation of a result of the comparison operation in a third portion of the number of memory cells coupled to the sense line of the memory array.

Proceedings ArticleDOI
09 Mar 2015
TL;DR: This work proposes a novel Memory Aware Scheduling and Cache Access Re-execution (Mascar) system on GPUs tailored for better performance for memory intensive workloads and achieves an average of 12% savings in energy for such workloads.
Abstract: With the prevalence of GPUs as throughput engines for data parallel workloads, the landscape of GPU computing is changing significantly Non-graphics workloads with high memory intensity and irregular access patterns are frequently targeted for acceleration on GPUs While GPUs provide large numbers of compute resources, the resources needed for memory intensive workloads are more scarce Therefore, managing access to these limited memory resources is a challenge for GPUs We propose a novel Memory Aware Scheduling and Cache Access Re-execution (Mascar) system on GPUs tailored for better performance for memory intensive workloads This scheme detects memory saturation and prioritizes memory requests among warps to enable better overlapping of compute and memory accesses Furthermore, it enables limited re-execution of memory instructions to eliminate structural hazards in the memory subsystem and take advantage of cache locality in cases where requests cannot be sent to the memory due to memory saturation Our results show that Mascar provides a 34% speedup over the baseline round-robin scheduler and 10% speedup over the state of the art warp schedulers for memory intensive workloads Mascar also achieves an average of 12% savings in energy for such workloads

Proceedings ArticleDOI
08 Jul 2015
TL;DR: In this article, the authors present a parallelism-aware worst-case memory interference delay analysis for COTS multicore systems, focusing on LLC and DRAM bank partitioned systems.
Abstract: In modern Commercial Off-The-Shelf (COTS) mul-ticore systems, each core can generate many parallel memory requests at a time. The processing of these parallel requests in the DRAM controller greatly affects the memory interference delay experienced by running tasks on the platform. In this paper, we present a new parallelism-aware worst-case memory interference delay analysis for COTS multicore systems. The analysis considers a COTS processor that can generate multiple outstanding requests and a COTS DRAM controller that has a separate read and write request buffer, prioritizes reads over writes, and supports out-of-order request processing. Focusing on LLC and DRAM bank partitioned systems, our analysis computes worst-case upper bounds on memory-interference delays, caused by competing memory requests. We validate our analysis on a Gem5 full-system simulator modeling a realistic COTS multicore platform, with a set of carefully designed synthetic benchmarks as well as SPEC2006benchmarks. The evaluation results show that our analysis produces safe upper bounds in all tested benchmarks, while the current state-of-the-art analysis significantly under-estimates the delays.

Patent
26 Aug 2015
TL;DR: In this article, the present disclosure provides apparatuses and methods related to performing swap operations in a memory, including a first group of memory cells coupled to a first sense line and configured to store a first element.
Abstract: Examples of the present disclosure provide apparatuses and methods related to performing swap operations in a memory. An example apparatus might include a first group of memory cells coupled to a first sense line and configured to store a first element. An example apparatus might also include a second group of memory cells coupled to a second sense line and configured to store a second element. An example apparatus might also include a controller configured to cause the first element to be stored in the second group of memory cells and the second element to be stored in the first group of memory cells by controlling sensing circuitry to perform a number operations without transferring data via an input/output (I/O) line.

Proceedings ArticleDOI
13 Apr 2015
TL;DR: This work introduces a novel and compact time-division-multiplexing scheduler that is adequate for mixed-time critical systems and presents a novel framework that constructs optimal offchip DRAM memory controller schedules for multi-core mixed time critical systems.
Abstract: Mixed-time critical systems are real-time systems that accommodate both hard real-time (HRT) and soft realtime (SRT) tasks. HRT tasks mandate a gurantee on the worstcase latency, while SRT tasks have average-case bandwidth (BW) demands. Memory requests in mixed-time critical systems usually have different transaction sizes based on whether the issuer task is HRT or SRT. For example, HRT tasks often issue requests with a cache line size. On the other side, SRT tasks may issue requests with a size of KBs. Requests from multimedia cores, cores controlling network interfaces and direct memory accesses (DMAs) are obvious examples of these large-size requests. Based on these observations, we promote in this work a new approach to schedule memory requests. This approach retains locality within large-size requests to minimize the worst-case latency, while maintaining the average-case BW as high as required. To achieve this target, we introduce a novel and compact time-division-multiplexing scheduler that is adequate for mixed-time critical systems. We also present a novel framework that constructs optimal offchip DRAM memory controller schedules for multi-core mixedtime critical systems. These schedules are loaded to the memory controller during boot-time. Based on the proposed schedule, we provide a detailed static analysis that guarantees predictability. We compare the proposed controller against state-of-the-art realtime memory controllers using synthetic experiments as well as a practical use-case from multimedia systems.

Journal ArticleDOI
TL;DR: The architecture, design, analysis, and simulation and measurement results of the 3D-MAPS (3D massively parallel processor with stacked memory) chip built with a 1.5 V, 130 nm process technology and a two-tier 3D stacking technology are described.
Abstract: This paper describes the architecture, design, analysis, and simulation and measurement results of the 3D-MAPS (3D massively parallel processor with stacked memory) chip built with a 1.5 V, 130 nm process technology and a two-tier 3D stacking technology using 1.2 \microm-diameter, 6 \micro m-height through-silicon vias (TSVs) and 3.4 bsp\microm-diameter face-to-face bond pads. 3D-MAPS consists of a core tier containing 64 cores and a memory tier containing 64 memory blocks. Each core communicates with its dedicated 4KB SRAM block using face-to-face bond pads, which provide negligible data transfer delay between the core and the memory tiers. The maximum operating frequency is 277 MHz and the maximum memory bandwidth is 70.9 GB/s at 277 MHz. The peak measured memory bandwidth usage is 63.8 GB/s and the peak measured power is approximately 4 W based on eight parallel benchmarks.

Patent
03 Nov 2015
TL;DR: In this article, a system that calibrates timing relationships between signals involved in performing write operations is described, where each memory chip includes a phase detector configured to calibrate a phase relationship between a data-strobe signal and a clock signal received at the memory chip from the memory controller during a write operation.
Abstract: A system that calibrates timing relationships between signals involved in performing write operations is described This system includes a memory controller which is coupled to a set of memory chips, wherein each memory chip includes a phase detector configured to calibrate a phase relationship between a data-strobe signal and a clock signal received at the memory chip from the memory controller during a write operation Furthermore, the memory controller is configured to perform one or more write-read-validate operations to calibrate a clock-cycle relationship between the data-strobe signal and the clock signal, wherein the write-read-validate operations involve varying a delay on the data-strobe signal relative to the clock signal by a multiple of a clock period

Patent
Rotem Sela1, Miki Sapir1, Amir Shaharabany1, Hadas Oshinsky1, Alon Marcu1, Nir Perry1 
28 Oct 2015
TL;DR: In this paper, the authors present a data buffer management system for non-volatile memory systems, which includes a nonvolatile array with a plurality of data latches and a data controller configured to free the data buffer for receiving new data as soon as the prior data is transferred to the data latch.
Abstract: Systems and methods for managing a data buffer of a non-volatile memory system are disclosed. The method may include a controller of a storage system retrieving host data, storing the retrieved data in a data buffer and transferring the data to a non-volatile memory. The controller may then overwrite the retrieved data in the data buffer as soon as the retrieved data has been transferred to the non-volatile memory die but prior to sending a command to program that data to the non-volatile memory array of the non-volatile memory. The system includes a non-volatile memory with a plurality of data latches and a non-volatile memory array, a data buffer and a controller configured to free the data buffer for receiving new data as soon as the prior data is transferred to the data latches and prior to any indication on success of programming prior data to the non-volatile memory array.

Patent
10 Nov 2015
TL;DR: In this paper, the memory controller transmits an auto-refresh command to the memory device, which performs refresh operations to refresh the memory cells and the command interface is placed into a calibration mode for the duration of the first time interval.
Abstract: A system includes a memory controller and a memory device having a command interface and a plurality of memory banks, each with a plurality of rows of memory cells. The memory controller transmits an auto-refresh command to the memory device. Responsive to the auto-refresh command, during a first time interval, the memory device performs refresh operations to refresh the memory cells and the command interface of the memory device is placed into a calibration mode for the duration of the first time interval. Concurrently, during at least a portion of the first time interval, the memory controller performs a calibration of the command interface of the memory device. The auto-refresh command may specify an order in which memory banks of the memory device are to be refreshed, such that the memory device sequentially refreshes a respective row in the plurality of memory banks in the specified bank order.

Patent
10 Mar 2015
TL;DR: In this article, a unified memory and network controller for an all-flash array (AFA) storage blade in a distributed flash storage clusters over a fabric network is presented. But the authors do not specify the architecture of the controller.
Abstract: System and method for a unified memory and network controller for an all-flash array (AFA) storage blade in a distributed flash storage clusters over a fabric network. The unified memory and network controller has 3-way control functions including unified memory buses to cache memories and DDR4-AFA controllers, a dual-port PCIE interconnection to two host processors of gateway clusters, and four switch fabric ports for interconnections with peer controllers (e.g., AFA blades and/or chassis) in the distributed flash storage network. The AFA storage blade includes dynamic random-access memory (DRAM) and magnetoresistive random-access memory (MRAM) configured as data read/write cache buffers, and flash memory DIMM devices as primary storage. Remote data memory access (RDMA) for clients via the data caching buffers is enabled and controlled by the host processor interconnection(s), the switch fabric ports, and a unified memory bus from the unified controller to the data buffer and the flash SSDs.

Journal ArticleDOI
TL;DR: This work proposes an architecture, DRAMA, that 3D-stacks coarse-grain reconfigurable accelerators (CGRAs) atop off-chip DRAM devices that can reduce the energy consumption to transfer data across the memory hierarchy by 66-95 percent while achieving speedups of up to 18× over a commodity processor.
Abstract: Improving energy efficiency is crucial for both mobile and high-performance computing systems while a large fraction of total energy is consumed to transfer data between storage and processing units. Thus, reducing data transfers across the memory hierarchy of a processor (i.e., off-chip memory, on-chip caches, and register file) can greatly improve the energy efficiency. To this end, we propose an architecture, DRAMA, that 3D-stacks coarse-grain reconfigurable accelerators (CGRAs) atop off-chip DRAM devices. DRAMA does not require changes to the DRAM device architecture, apart from through-silicon vias (TSVs) that connect the DRAM device's internal I/O bus to the CGRA layer. We demonstrate that DRAMA can reduce the energy consumption to transfer data across the memory hierarchy by 66-95 percent while achieving speedups of up to 18× over a commodity processor.

Patent
09 Feb 2015
TL;DR: A resistive memory structure, for example, phase change memory, includes one access device and two or more memory cells, coupled to a rectifying device to prevent parallel leak current from flowing through nonselected memory cells.
Abstract: A resistive memory structure, for example, phase change memory structure, includes one access device and two or more resistive memory cells. Each memory cell is coupled to a rectifying device to prevent parallel leak current from flowing through non-selected memory cells. In an array of resistive memory bit structures, resistive memory cells from different memory bit structures are stacked and share rectifying devices.

Patent
22 Sep 2015
TL;DR: In this article, a pair of non-volatile memory devices coupled in series may be placed in complementary memory states in a write operation by controlling a current and a voltage applied to terminals of the nonvolatile device.
Abstract: Disclosed are methods, systems and devices for operation of dual non-volatile memory devices. In one aspect, a pair of non-volatile memory device coupled in series may be placed in complementary memory states any one of multiple memory states in a write operation by controlling a current and a voltage applied to terminals of the non-volatile memory device.

Patent
10 Mar 2015
TL;DR: In this article, a method for controlling a cache having a volatile memory and a non-volatile memory during a power up sequence is provided, which includes receiving, at a controller configured to control the cache and a storage device associated with the cache, a signal indicating whether the non-vatile memory includes dirty data copied from the volatile memory to the nonvivo memory during power down sequence, the dirty data including data that has not been stored in the storage device.
Abstract: In some embodiments, a method for controlling a cache having a volatile memory and a non-volatile memory during a power up sequence is provided. The method includes receiving, at a controller configured to control the cache and a storage device associated with the cache, a signal indicating whether the non-volatile memory includes dirty data copied from the volatile memory to the non-volatile memory during a power down sequence, the dirty data including data that has not been stored in the storage device. In response to the received signal, the dirty data is restored from the non-volatile memory to the volatile memory, and flushed from the volatile memory to the storage device.

Patent
Bryan K. Casper1, R. Mooney1, Dave Dunning1, Mozhgan Mansuri1, James E. Jaussi1 
13 Feb 2015
TL;DR: In this paper, a hybrid memory may include a package substrate, a hybrid buffer chip attached to the first side of the package substrate and a memory tile that is vertically stacked on the hybrid memory buffer.
Abstract: Embodiments of the invention are generally directed to systems, methods, and apparatuses for hybrid memory. In one embodiment, a hybrid memory may include a package substrate. The hybrid memory may also include a hybrid memory buffer chip attached to the first side of the package substrate. High speed input/output (HSIO) logic supporting a HSIO interface with a processor. The hybrid memory also includes packet processing logic to support a packet processing protocol on the HSIO interface. Additionally, the hybrid memory also has one or more memory tiles that are vertically stacked on the hybrid memory buffer.

Proceedings ArticleDOI
13 Apr 2015
TL;DR: This work proposes a DRAM memory controller that meets the often conflicting requirements of tightly bounded worst-case latency for critical tasks and high performance for non-critical real-time tasks by using bank-aware address mapping and DRAM command-level priority-based scheduling with preemption.
Abstract: Mixed-criticality systems have tasks with different criticality levels running on the same hardware platform. Today’s DRAM controllers cannot adequately satisfy the often conflicting requirements of tightly bounded worst-case latency for critical tasks and high performance for non-critical real-time tasks. We propose a DRAM memory controller that meets these requirements by using bank-aware address mapping and DRAM command-level priority-based scheduling with preemption. Many standard DRAM controllers can be extended with our approach, incurring no performance penalty when critical tasks are not generating DRAM requests. Our approach is evaluated by replaying memory traces obtained from executing benchmarks on an ARM ISA-based processor with caches, which is simulated on the gem5 architecture simulator. We compare our approach against previous TDM-based approaches, showing that our proposed memory controller achieves dramatically higher performance for non-critical tasks, without any significant impact on the worstcase latency of critical tasks.

Patent
13 Aug 2015
TL;DR: In this paper, a non-volatile memory device may be placed in any one of multiple memory states in a write operation by controlling a current and a voltage applied to terminals of the NVRAM device.
Abstract: Disclosed are methods, systems and devices for operation of non-volatile memory devices. In one aspect, a non-volatile memory device may be placed in any one of multiple memory states in a write operation by controlling a current and a voltage applied to terminals of the non-volatile memory device. For example, a write operation may apply a programming signal across terminals of non-volatile memory device having a particular current and a particular voltage for placing the non-volatile memory device in a particular memory state.

Patent
Jea Hyun1, Robert Wood1
09 Jan 2015
TL;DR: In this article, the authors present an on-die buffered non-volatile memory management method for buffered NVRAMs, where data can be copied from a first set of nonvolatile storage cells to a second set based on one or more attributes associated with the data.
Abstract: Apparatuses, systems, methods, and computer program products are disclosed for on die buffered non-volatile memory management. A method includes storing data in a first set of non-volatile memory cells. A method includes determining whether to perform an error-correcting code (ECC) refresh for data to be copied from a first set of non-volatile memory cells to a second set of non-volatile memory cells based on one or more attributes associated with the data. A method includes storing data in a second set of non-volatile storage cells representing data using more storage cells per cell than a first set of non-volatile storage cells.

Proceedings ArticleDOI
01 Dec 2015
TL;DR: This paper presents an FPGA implementation of the classic PageRank algorithm, and dramatically reduces the number of random memory accesses and improves the execution time by at least 70%.
Abstract: Recently, FPGA implementation of graph algorithms arising in many areas such as social networks has been studied. However, the irregular memory access pattern of graph algorithms makes obtaining high performance challenging. In this paper, we present an FPGA implementation of the classic PageRank algorithm. Our goal is to optimize the overall system performance, especially the cost of accessing the off-chip DRAM. We optimize the data layout so that most of memory accesses to the DRAM are sequential. Post-place-and-route results show that our design on a state-of-the-art FPGA can achieve a high clock rate of over 200 MHz. Based on a realistic DRAM access model, we build a simulator to estimate the execution time including memory access overheads. The simulation results show that our design achieves at least 96% of the theoretically best performance of the target platform. Compared with a baseline design, our optimized design dramatically reduces the number of random memory accesses and improves the execution time by at least 70%.