scispace - formally typeset
Search or ask a question

Showing papers on "Registered memory published in 2014"


Proceedings ArticleDOI
23 Mar 2014
TL;DR: A number of key elements necessary in realizing efficient NDC operation are described and evaluated, including low-EPI cores, long daisy chains of memory devices, and the dynamic activation of cores and SerDes links.
Abstract: While Processing-in-Memory has been investigated for decades, it has not been embraced commercially. A number of emerging technologies have renewed interest in this topic. In particular, the emergence of 3D stacking and the imminent release of Micron's Hybrid Memory Cube device have made it more practical to move computation near memory. However, the literature is missing a detailed analysis of a killer application that can leverage a Near Data Computing (NDC) architecture. This paper focuses on in-memory MapReduce workloads that are commercially important and are especially suitable for NDC because of their embarrassing parallelism and largely localized memory accesses. The NDC architecture incorporates several simple processing cores on a separate, non-memory die in a 3D-stacked memory package; these cores can perform Map operations with efficient memory access and without hitting the bandwidth wall. This paper describes and evaluates a number of key elements necessary in realizing efficient NDC operation: (i) low-EPI cores, (ii) long daisy chains of memory devices, (iii) the dynamic activation of cores and SerDes links. Compared to a baseline that is heavily optimized for MapReduce execution, the NDC design yields up to 15X reduction in execution time and 18X reduction in system energy.

263 citations


Proceedings ArticleDOI
15 Apr 2014
TL;DR: PALLOC, a DRAM bank-aware memory allocator which exploits the page-based virtual memory system to allocate memory pages of each application to specific banks, thereby improving isolation on COTS multicore platforms without requiring any special hardware support.
Abstract: DRAM consists of multiple resources called banks that can be accessed in parallel and independently maintain state information. In Commercial Off-The-Shelf (COTS) multicore platforms, banks are typically shared among all cores, even though programs running on the cores do not share memory space. In this situation, memory performance is highly unpredictable due to contention in the shared banks.

235 citations


Proceedings ArticleDOI
15 Apr 2014
TL;DR: This work presents techniques to provide a tight upper bound on the worst-case memory interference in a COTS-based multi-core system, and is the first work bounding the request re-ordering effect of COTS memory controllers.
Abstract: In commercial-off-the-shelf (COTS) multi-core systems, a task running on one core can be delayed by other tasks running simultaneously on other cores due to interference in the shared DRAM main memory. Such memory interference delay can be large and highly variable, thereby posing a significant challenge for the design of predictable real-time systems. In this paper, we present techniques to provide a tight upper bound on the worst-case memory interference in a COTS-based multi-core system. We explicitly model the major resources in the DRAM system, including banks, buses and the memory controller. By considering their timing characteristics, we analyze the worst-case memory interference delay imposed on a task by other tasks running in parallel. To the best of our knowledge, this is the first work bounding the request re-ordering effect of COTS memory controllers. Our work also enables the quantification of the extent by which memory interference can be reduced by partitioning DRAM banks. We evaluate our approach on a commodity multi-core platform running Linux/RK. Experimental results show that our approach provides an upper bound very close to our measured worst-case interference.

234 citations


Patent
08 Jan 2014
TL;DR: In this article, a method for data storage includes, in a memory that includes multiple memory blocks, specifying at a first time a first over-provisioning overhead, and storing data in the memory while retaining in memory blocks memory areas, which do not hold valid data.
Abstract: A method for data storage includes, in a memory that includes multiple memory blocks, specifying at a first time a first over-provisioning overhead, and storing data in the memory while retaining in the memory blocks memory areas, which do not hold valid data and whose aggregated size is at least commensurate with the specified first over-provisioning overhead. Portions of the data from one or more previously-programmed memory blocks containing one or more of the retained memory areas are compacted. At a second time subsequent to the first time, a second over-provisioning overhead, different from the first over-provisioning overhead, is specified, and data storage and data portion compaction is continued while complying with the second over-provisioning overhead.

227 citations


Patent
Cory J. Reche1
03 Feb 2014
TL;DR: In this paper, the authors describe methods for storing the firmware and other data of a flash memory controller, such as using a RAID configuration across multiple flash memory devices or portions of a single memory device.
Abstract: Systems and methods are disclosed for storing the firmware and other data of a flash memory controller, such as using a RAID configuration across multiple flash memory devices or portions of a single memory device. In various embodiments, the firmware and other data used by a controller, and error correction information, such as parity information for RAID configuration, may be stored across multiple flash memory devices, multiple planes of a multi-plane flash memory device, or across multiple blocks or pages of a single flash memory device. The controller may detect the failure of a memory device or a portion thereof, and reconstruct the firmware and/or other data from the other memory devices or portions thereof.

151 citations


Proceedings ArticleDOI
13 Dec 2014
TL;DR: This paper proposes CAMEO, a hardware-based Cache-like Memory Organization that not only makes stacked DRAM visible as part of the memory address space but also exploits data locality on a fine-grained basis and proposes a low overhead Line Location Table (LLT) that tracks the physical location of all data lines.
Abstract: This paper analyzes the trade-offs in architecting stacked DRAM either as part of main memory or as a hardware-managed cache. Using stacked DRAM as part of main memory increases the effective capacity, but obtaining high performance from such a system requires Operating System (OS) support to migrate data at a page-granularity. Using stacked DRAM as a hardware cache has the advantages of being transparent to the OS and perform data management at a line-granularity but suffers from reduced main memory capacity. This is because the stacked DRAM cache is not part of the memory address space. Ideally, we want the stacked DRAM to contribute towards capacity of main memory, and still maintain the hardware-based fine-granularity of a cache. We propose CAMEO, a hardware-based Cache-like Memory Organization that not only makes stacked DRAM visible as part of the memory address space but also exploits data locality on a fine-grained basis. CAMEO retains recently accessed data lines in stacked DRAM and swaps out the victim line to off chip memory. Since CAMEO can change the physical location of a line dynamically, we propose a low overhead Line Location Table (LLT) that tracks the physical location of all data lines. We also propose an accurate Line Location Predictor (LLP) to avoid the serialization of the LLT look-up and memory access. We evaluate a system that has 4GB stacked memory and 12GB off-chip memory. Using stacked DRAM as a cache improves performance by 50%, using as part of main memory improves performance by 33%, whereas CAMEO improves performance by 78%. Our proposed design is very close to an idealized memory system that uses the 4GB stacked DRAM as a hardware-managed cache and also increases the main memory capacity by an additional 4GB.

145 citations


Proceedings ArticleDOI
01 Oct 2014
TL;DR: This paper introduces a new mechanism, called Loose-Ordering Consistency (LOC), that satisfies the ordering requirements of persistent memory writes at significantly lower performance degradation than state-of-the-art mechanisms.
Abstract: Emerging non-volatile memory (NVM) technologies enable data persistence at the main memory level at access speeds close to DRAM In such persistent memories, memory writes need to be performed in strict order to satisfy storage consistency requirements and enable correct recovery from system crashes Unfortunately, adhering to a strict order for writes to persistent memory significantly degrades system performance as it requires flushing dirty data blocks from CPU caches and waiting for their completion at the main memory in the order specified by the program This paper introduces a new mechanism, called Loose-Ordering Consistency (LOC), that satisfies the ordering requirements of persistent memory writes at significantly lower performance degradation than state-of-the-art mechanisms LOC consists of two key techniques First, Eager Commit reduces the commit overhead for writes within a transaction by eliminating the need to perform a persistent commit record write at the end of a transaction We do so by ensuring that we can determine the status of all committed transactions during recovery by storing necessary metadata information statically with blocks of data written to memory Second, Speculative Persistence relaxes the ordering of writes between transactions by allowing writes to be speculatively written to persistent memory A speculative write is made visible to software only after its associated transaction commits To enable this, our mechanism requires the tracking of committed transaction ID and support for multi-versioning in the CPU cache Our evaluations show that LOC reduces the average performance overhead of strict write ordering from 669% to 349% on a variety of workloads

145 citations


Journal ArticleDOI
TL;DR: A new page replacement algorithm called CLOCK-DWF (CLOCK with Dirty bits and Write Frequency) is presented that significantly reduces the number of write operations that occur on PCM and also increases the lifespan of PCM memory.
Abstract: Phase change memory (PCM) has emerged as one of the most promising technologies to incorporate into the memory hierarchy of future computer systems. However, PCM has two critical weaknesses to substitute DRAM memory in its entirety. First, the number of write operations allowed to each PCM cell is limited. Second, write access time of PCM is about 6–10 times slower than that of DRAM. To cope with this situation, hybrid memory architectures that use a small amount of DRAM together with PCM have been suggested. In this paper, we present a new memory management technique for hybrid PCM and DRAM memory architecture that efficiently hides the slow write performance of PCM. Specifically, we aim to estimate future write references accurately and then absorb frequent memory writes into DRAM. To do this, we analyze the characteristics of memory write references and find two noticeable phenomena. First, using write history alone performs better than using both read and write history in estimating future write references. Second, the frequency characteristic is a better estimator than temporal locality in predicting future memory writes. Based on these two observations, we present a new page replacement algorithm called CLOCK-DWF (CLOCK with Dirty bits and Write Frequency) that significantly reduces the number of write operations that occur on PCM and also increases the lifespan of PCM memory.

143 citations


Proceedings ArticleDOI
04 May 2014
TL;DR: System-level simulations incorporating various deterministic errors from analog signal chain demonstrates the limited accuracy of analog processing does not significantly degrade the system performance, which means the probability of pattern detection is minimally impacted.
Abstract: In this paper, we propose the concept of compute memory, where computation is deeply embedded into the memory (SRAM). This deep embedding enables multi-row read access and analog signal processing. Compute memory exploits the relaxed precision and linearity requirements of pattern recognition applications. System-level simulations incorporating various deterministic errors from analog signal chain demonstrates the limited accuracy of analog processing does not significantly degrade the system performance, which means the probability of pattern detection is minimally impacted. The estimated energy saving is 63 % as compared to the conventional system with standard embedded memory and parallel processing architecture, for 256×256 target image.

140 citations


Journal ArticleDOI
14 Jun 2014
TL;DR: A novel memory architecture Half-DRAM is proposed, in which the DRAM array is reorganized to enable only half of a row being activated, which can achieve both significant performance improvement and power reduction, with negligible design overhead.
Abstract: DRAM memory is a major contributor for the total power consumption in modern computing systems. Consequently, power reduction for DRAM memory is critical to improve system-level power efficiency. Fine-grained DRAM architecture [1, 2] has been proposed to reduce the activation/ precharge power. However, those prior work either incurs significant performance degradation or introduces large area overhead. In this paper, we propose a novel memory architecture Half-DRAM, in which the DRAM array is reorganized to enable only half of a row being activated. The half-row activation can effectively reduce activation power and meanwhile sustain the full bandwidth one bank can provide. In addition, the half-row activation in Half-DRAM relaxes the power constraint in DRAM, and opens up opportunities for further performance gain. Furthermore, two half-row accesses can be issued in parallel by integrating the sub-array level parallelism to improve the memory level parallelism. The experimental results show that Half-DRAM can achieve both significant performance improvement and power reduction, with negligible design overhead

123 citations


Proceedings ArticleDOI
17 Feb 2014
TL;DR: The RAMCloud storage system implements a unified log-structured mechanism both for active information in memory and backup data on disk, which uses a two-level cleaning policy, which conserves disk bandwidth and improves performance up to 6× at high memory utilization.
Abstract: Traditional memory allocation mechanisms are not suitable for new DRAM-based storage systems because they use memory inefficiently, particularly under changing access patterns. In contrast, a log-structured approach to memory management allows 80-90% memory utilization while offering high performance. The RAMCloud storage system implements a unified log-structured mechanism both for active information in memory and backup data on disk. The RAMCloud implementation of log-structured memory uses a two-level cleaning policy, which conserves disk bandwidth and improves performance up to 6× at high memory utilization. The cleaner runs concurrently with normal operations and employs multiple threads to hide most of the cost of cleaning.

Proceedings ArticleDOI
13 Dec 2014
TL;DR: The goal in this paper is to design a fair and high-performance memory control scheme for a persistent memory based system that runs both persistent and non-persistent applications, and detailed evaluations show that FIRM provides significantly higher system performance and fairness.
Abstract: Byte-addressable nonvolatile memories promise a new technology, persistent memory, which incorporates desirable attributes from both traditional main memory (byte-addressability and fast interface) and traditional storage (data persistence). To support data persistence, a persistent memory system requires sophisticated data duplication and ordering control for write requests. As a result, applications that manipulate persistent memory (persistent applications) have very different memory access characteristics than traditional (non-persistent) applications, as shown in this paper. Persistent applications introduce heavy write traffic to contiguous memory regions at a memory channel, which cannot concurrently service read and write requests, leading to memory bandwidth underutilization due to low bank-level parallelism, frequent write queue drains, and frequent bus turnarounds between reads and writes. These characteristics undermine the high-performance and fairness offered by conventional memory scheduling schemes designed for non-persistent applications. Our goal in this paper is to design a fair and high-performance memory control scheme for a persistent memory based system that runs both persistent and non-persistent applications. Our proposal, FIRM, consists of three key ideas. First, FIRM categorizes request sources as non-intensive, streaming, random and persistent, and forms batches of requests for each source. Second, FIRM strides persistent memory updates across multiple banks, thereby improving bank-level parallelism and hence memory bandwidth utilization of persistent memory accesses. Third, FIRM schedules read and write request batches from different sources in a manner that minimizes bus turnarounds and write queue drains. Our detailed evaluations show that, compared to five previous memory scheduler designs, FIRM provides significantly higher system performance and fairness.

Proceedings ArticleDOI
24 Feb 2014
TL;DR: A novel unified working memory and persistent store architecture, NVM Duet, which provides the required consistency and durability guarantees for persistent store while relaxing these constraints if accesses to NVM are for working memory is proposed.
Abstract: Emerging non-volatile memory (NVM) technologies have gained a lot of attention recently. The byte-addressability and high density of NVM enable computer architects to build large-scale main memory systems. NVM has also been shown to be a promising alternative to conventional persistent store. With NVM, programmers can persistently retain in-memory data structures without writing them to disk. Therefore, one can envision that in the future, NVM will play the role of both working memory and persistent store at the same time. Persistent store demands consistency and durability guarantees, thereby imposing new design constraints on the memory system. Consistency is achieved at the expense of serializing multiple write operations. Durability requires memory cells to guarantee non-volatility and thus reduces the write speed. Therefore, a unified architecture oblivious to these two use cases would lead to suboptimal design. In this paper, we propose a novel unified working memory and persistent store architecture, NVM Duet, which provides the required consistency and durability guarantees for persistent store while relaxing these constraints if accesses to NVM are for working memory. A cross-layer design approach is adopted to achieve the design goal. Overall, simulation results demonstrate that NVM Duet achieves up to 1.68x (1.32x on average) speedup compared with the baseline design.

Proceedings ArticleDOI
04 Dec 2014
TL;DR: The Blacklisting Memory Scheduler (BLISS), which achieves high system performance and fairness while incurring low hardware cost and complexity and is evaluated across a wide variety of workloads and system configurations.
Abstract: In a multicore system, applications running on different cores interfere at main memory. This inter-application interference degrades overall system performance and unfairly slows down applications. Prior works have developed application-aware memory request schedulers to tackle this problem. State-of-the-art application-aware memory request schedulers prioritize memory requests of applications that are vulnerable to interference, by ranking individual applications based on their memory access characteristics and enforcing a total rank order.

Proceedings ArticleDOI
13 Dec 2014
TL;DR: This paper proposes a practical, low-cost architectural solution to efficiently enable using large fast memory as Part-of-Memory (PoM) seamlessly, without the involvement of the OS.
Abstract: Recent technology advancements allow for the integration of large memory structures on-die or as a die-stacked DRAM. Such structures provide higher bandwidth and faster access time than off-chip memory. Prior work has investigated using the large integrated memory as a cache, or using it as part of a heterogeneous memory system under management of the OS. Using this memory as a cache would waste a large fraction of total memory space, especially for the systems where stacked memory could be as large as off-chip memory. An OS managed heterogeneous memory system, on the other hand, requires costly usage-monitoring hardware to migrate frequently-used pages, and is often unable to capture pages that are highly utilized for short periods of time. This paper proposes a practical, low-cost architectural solution to efficiently enable using large fast memory as Part-of-Memory (PoM) seamlessly, without the involvement of the OS. Our PoM architecture effectively manages two different types of memory (slow and fast) combined to create a single physical address space. To achieve this, PoM implements the ability to dynamically remap regions of memory based on their access patterns and expected performance benefits. Our proposed PoM architecture improves performance by 18.4% over static mapping and by 10.5% over an ideal OS-based dynamic remapping policy.

Proceedings ArticleDOI
19 Jun 2014
TL;DR: This paper designs a highly simple compressed memory architecture that focuses on complexity, energy, bandwidth, and reliability, and relies on rank subsetting and a careful placement of compressed data and metadata to achieve these benefits.
Abstract: Memory compression has been proposed and deployed in the past to grow the capacity of a memory system and reduce page fault rates. Compression also has secondary benefits: it can reduce energy and bandwidth demands. However, most prior mechanisms have been designed to focus on the capacity metric and few prior works have attempted to explicitly reduce energy or bandwidth. Further, mechanisms that focus on the capacity metric also require complex logic to locate the requested data in memory. In this paper, we design a highly simple compressed memory architecture that does not target the capacity metric. Instead, it focuses on complexity, energy, bandwidth, and reliability. It relies on rank subsetting and a careful placement of compressed data and metadata to achieve these benefits. Further, the space made available via compression is used to boost other metrics - the space can be used to implement stronger error correction codes or energy-efficient data encodings. The best performing MemZip configuration yields a 45% performance improvement and 57% memory energy reduction, compared to an uncompressed non-sub-ranked baseline. Another energy-optimized configuration yields a 29.8% performance improvement and a 79% memory energy reduction, relative to the same baseline.

Proceedings ArticleDOI
19 Jun 2014
TL;DR: A protection scheme to eliminate the interference across security domains through two main changes: a per security domain based queueing structure, and static allocation of time slots in the scheduling algorithm.
Abstract: This paper proposes a new memory controller design that enables secure sharing of main memory among mutually mistrusting parties by eliminating memory timing channels. This study demonstrates that shared memory controllers are vulnerable to both side channel and covert channel attacks that exploit memory interference as timing channels. To address this vulnerability, we identify the sources of interference in a conventional memory controller design, and propose a protection scheme to eliminate the interference across security domains through two main changes: (i) a per security domain based queueing structure, and (ii) static allocation of time slots in the scheduling algorithm. Multi-programmed workloads comprised of SPEC2006 benchmarks were used to evaluate the protection scheme. The results show that the proposed scheme completely eliminates the timing channels in the shared memory with small hardware and performance overheads.

Proceedings ArticleDOI
01 Aug 2014
TL;DR: This article consists of a collection of slides from the author's conference presentation on the special features, system design and architectures, memory management and support for processors, and targeted markets for SK Hynix Inc.'s HBM, next generation memory family of products.
Abstract: This article consists of a collection of slides from the author's conference presentation on the special features, system design and architectures, memory management and support for processors, and targeted markets for SK Hynix Inc.'s HBM, next generation memory family of products.

Patent
03 Oct 2014
TL;DR: In this article, the interleaver identifies locations in a memory space supported by the memory channels and is responsive to logic that defines virtual sectors having a desired storage capacity, and accesses the asymmetric storage capacity uniformly across the virtual sectors in response to requests to access the memory space.
Abstract: Systems and methods for uniformly interleaving memory accesses across physical channels of a memory space with a non-uniform storage capacity across the physical channels are disclosed. An interleaver is arranged in communication with one or more processors and a system memory. The interleaver identifies locations in a memory space supported by the memory channels and is responsive to logic that defines virtual sectors having a desired storage capacity. The interleaver accesses the asymmetric storage capacity uniformly across the virtual sectors in response to requests to access the memory space.

Proceedings ArticleDOI
16 Nov 2014
TL;DR: This work proposes memory scheduling mechanisms that avoid inter-warp interference in the DRAM system to reduce the average memory stall latency experienced by warps and shows that carefully orchestrating the memory scheduling policy can achieve low average latency for warps, without compromising bandwidth utilization.
Abstract: Memory controllers in modern GPUs aggressively reorder requests for high bandwidth usage, often interleaving requests from different warps. This leads to high variance in the latency of different requests issued by the threads of a warp. Since a warp in a SIMT architecture can proceed only when all of its memory requests are returned by memory, such latency divergence causes significant slowdown when running irregular GPGPU applications. To solve this issue, we propose memory scheduling mechanisms that avoid inter-warp interference in the DRAM system to reduce the average memory stall latency experienced by warps. We further reduce latency divergence through mechanisms that coordinate scheduling decisions across multiple independent memory channels. Finally we show that carefully orchestrating the memory scheduling policy can achieve low average latency for warps, without compromising bandwidth utilization. Our combined scheme yields a 10.1% performance improvement for irregular GPGPU workloads relative to a throughput-optimized GPU memory controller.

Proceedings ArticleDOI
19 Jun 2014
TL;DR: A comprehensive approach which integrates Dynamic Bank Partitioning and Thread Cluster Memory scheduling (DBP-TCM, TCM is one of the best memory scheduling) to further improve system performance and fairness is presented.
Abstract: Applications running concurrently in CMP systems interfere with each other at DRAM memory, leading to poor system performance and fairness. Memory access scheduling reorders memory requests to improve system throughput and fairness. However, it cannot resolve the interference issue effectively. To reduce interference, memory partitioning divides memory resource among threads. Memory channel partitioning maps the data of threads that are likely to severely interfere with each other to different channels. However, it allocates memory resource unfairly and physically exacerbates memory contention of intensive threads, thus ultimately resulting in the increased slowdown of these threads and high system unfairness. Bank partitioning divides memory banks among cores and eliminates interference. However, previous equal bank partitioning restricts the number of banks available to individual thread and reduces bank level parallelism.

Patent
Derek C. Tao1, Bing Wang1, Kuoyuan (Peter) Hsu1, Jacklyn Chang1, Young Suk Kim1 
16 May 2014
TL;DR: In this paper, a memory array includes a memory segment having at least one memory bank, where at least two first read tracking cells are disposed in a read tracking column of the at least memory bank.
Abstract: A memory array includes a memory segment having at least one memory bank. The at least one memory bank includes an array of memory cells, and wherein at least two first read tracking cells are disposed in a read tracking column of the at least one memory bank. The memory array further includes a read tracking circuit coupled to the at least two first read tracking cells. Outputs of the at least two first read tracking cells are connected to a tracking bit connection line (TBCL). A tracking circuit connected to the TBCL is configured to output a tracking-cells output signal to generate a global tracking result signal to a memory control circuitry. The memory control circuitry is configured to reset a memory clock based on the global tracking result signal.

Proceedings ArticleDOI
19 Jun 2014
TL;DR: NUAT is a new memory controller focusing on reducing memory access latency without any modification of the existing DRAM structure and exploits DRAM's intrinsic phenomenon: electric charge variation in DRAM cell capacitors.
Abstract: With rapid development of micro-processors, off-chip memory access becomes a system bottleneck. DRAM, a main memory in most computers, has concentrated only on capacity and bandwidth for decades to achieve high performance computing. However, DRAM access latency should also be considered to keep the development trend in multi-core era. Therefore, we propose NUAT which is a new memory controller focusing on reducing memory access latency without any modification of the existing DRAM structure. We only exploit DRAM's intrinsic phenomenon: electric charge variation in DRAM cell capacitors. Given the cost-sensitive DRAM market, it is a big advantage in terms of actual implementation. NUAT gives a score to every memory access request and the request with the highest score obtains a priority. For scoring, we introduce two new concepts: Partitioned Bank Rotation (PBR) and PBR Page Mode (PPM). First, PBR is a mechanism that draws information of access speed from refresh timing and position; the request which has faster access speed gains higher score. Second, PPM selects a better page mode between open- and close-page modes based on the information from PBR. Evaluations show that NUAT decreases memory access latency significantly for various environments.

Patent
11 Jul 2014
TL;DR: In this article, a method and system for providing a dual memory access to a non-volatile memory device using expended memory addresses are disclosed, where the digital processing unit such as a central processing unit (CPU) is capable of accessing storage space in the NVRAM device in accordance with an extended memory address and offset.
Abstract: A method and system for providing a dual memory access to a non-volatile memory device using expended memory addresses are disclosed. The digital processing system such as a computer includes a non-volatile memory device, a peripheral bus, and a digital processing unit. The non-volatile memory device such as a solid state drive can store data persistently. The peripheral bus, which can be a peripheral component interconnect express (“PCIe”) bus, is used to support memory access to the non-volatile memory device. The digital processing unit such as a central processing unit (“CPU”) is capable of accessing storage space in the non-volatile memory device in accordance with an extended memory address and offset.

Journal ArticleDOI
TL;DR: This letter proposes a new heterogeneous scratch-pad memory architecture that is configured with SRAM, MRAM, and Z-RAM and proposes two algorithms: a dynamic programming and a genetic algorithm to allocate data to different memory banks, therefore, reducing memory access cost in terms of power consumption and latency.
Abstract: This letter aims at developing new memory architecture to overcome the daunting memory wall and energy wall issues in multicore embedded systems. We propose a new heterogeneous scratch-pad memory (SPM) architecture that is configured with SRAM, MRAM, and Z-RAM. Based on this architecture, we propose two algorithms: a dynamic programming (MDPDA) and a genetic algorithm (AGADA) to allocate data to different memory banks, therefore, reducing memory access cost in terms of power consumption and latency. Extensive experiments are performed to show the merits of the hybrid SPM memory architecture and the effectiveness of the proposed algorithms.

Proceedings ArticleDOI
09 Jun 2014
TL;DR: This paper demonstrates a Cu-based resistive random access memory (ReRAM) cell that meets the SCM performance specifications for a 16Gb ReRAM with 200MB/s write and 1GB/s read.
Abstract: Hybrid memory systems that incorporate Storage Class Memory (SCM) as non-volatile cache or DRAM data backup are expected to bolster system efficiency and cost because SCM promises higher density than DRAM cache and higher speed than the storage I/F. This paper demonstrates a Cu-based resistive random access memory (ReRAM) cell that meets the SCM performance specifications for a 16Gb ReRAM with 200MB/s write and 1GB/s read [1].

Patent
06 Mar 2014
TL;DR: In this paper, a memory controller may receive a memory allocation request, which may include a logical memory address, and a method to map the logical address to an address in a memory region of the memory system based on thermal data for memory regions of the system.
Abstract: Methods of mapping memory regions to processes based on thermal data of memory regions are described. In some embodiments, a memory controller may receive a memory allocation request. The memory allocation request may include a logical memory address. The method may further include mapping the logical memory address to an address in a memory region of the memory system based on thermal data for memory regions of the memory system. Additional methods and systems are also described.

Patent
Paul A. Lassa1
04 Jun 2014
TL;DR: In this paper, a storage device such as a solid state disk (SSD), a central controller communicates with a plurality of multi-chip memory packages, where the central controller may handle management of the virtual address space while the local processor in each MCP manages the storage of data within memory tiers in the memory dies of its respective MCP.
Abstract: In a storage device such as a solid state disk (SSD), a central controller communicates with a plurality of multi-chip memory packages. Each multi-chip memory package comprises a plurality of memory dies and a local processor, wherein the plurality of memory dies includes different memory tiers. The central controller may handle management of the virtual address space while the local processor in each MCP manages the storage of data within memory tiers in the memory dies of its respective MCP.

Patent
21 Aug 2014
TL;DR: In this paper, the authors describe a memory device with volatile and non-volatile media on a circuit board and a mapping module is configured to selectively store data in either the volatile memory medium or the nonvolatile memory medium.
Abstract: Apparatuses, systems, methods, and computer program products are disclosed for providing a memory device with volatile and non-volatile media. A volatile memory medium is on a circuit board configured to be installed on a memory bus of a processor. A non-volatile memory medium is on the same circuit board. A mapping module is configured to selectively store data in either the volatile memory medium or the non-volatile memory medium. The data is provided by way of one or more commands from the processor.

Patent
09 Apr 2014
TL;DR: In this paper, a collection of central processing units are connected to at least one other central processing unit and a root path into at least 10 Tera Bytes of solid state memory resources.
Abstract: A system includes a collection of central processing units, where each central processing unit is connected to at least one other central processing unit and a root path into at least 10 Tera Bytes of solid state memory resources. Each central processing unit directly accesses solid state memory resources without swapping solid state memory contents into main memory.