scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Architecting phase change memory as a scalable dram alternative

20 Jun 2009-Vol. 37, Iss: 3, pp 2-13
TL;DR: This work proposes, crafted from a fundamental understanding of PCM technology parameters, area-neutral architectural enhancements that address these limitations and make PCM competitive with DRAM.
Abstract: Memory scaling is in jeopardy as charge storage and sensing mechanisms become less reliable for prevalent memory technologies, such as DRAM. In contrast, phase change memory (PCM) storage relies on scalable current and thermal mechanisms. To exploit PCM's scalability as a DRAM alternative, PCM must be architected to address relatively long latencies, high energy writes, and finite endurance.We propose, crafted from a fundamental understanding of PCM technology parameters, area-neutral architectural enhancements that address these limitations and make PCM competitive with DRAM. A baseline PCM system is 1.6x slower and requires 2.2x more energy than a DRAM system. Buffer reorganizations reduce this delay and energy gap to 1.2x and 1.0x, using narrow rows to mitigate write energy and multiple rows to improve locality and write coalescing. Partial writes enhance memory endurance, providing 5.6 years of lifetime. Process scaling will further reduce PCM energy costs and improve endurance.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI
11 Oct 2009
TL;DR: A file system and a hardware architecture that are designed around the properties of persistent, byteaddressable memory, which provides strong reliability guarantees and offers better performance than traditional file systems, even when both are run on top of byte-addressable, persistent memory.
Abstract: Modern computer systems have been built around the assumption that persistent storage is accessed via a slow, block-based interface. However, new byte-addressable, persistent memory technologies such as phase change memory (PCM) offer fast, fine-grained access to persistent storage.In this paper, we present a file system and a hardware architecture that are designed around the properties of persistent, byteaddressable memory. Our file system, BPFS, uses a new technique called short-circuit shadow paging to provide atomic, fine-grained updates to persistent storage. As a result, BPFS provides strong reliability guarantees and offers better performance than traditional file systems, even when both are run on top of byte-addressable, persistent memory. Our hardware architecture enforces atomicity and ordering guarantees required by BPFS while still providing the performance benefits of the L1 and L2 caches.Since these memory technologies are not yet widely available, we evaluate BPFS on DRAM against NTFS on both a RAM disk and a traditional disk. Then, we use microarchitectural simulations to estimate the performance of BPFS on PCM. Despite providing strong safety and consistency guarantees, BPFS on DRAM is typically twice as fast as NTFS on a RAM disk and 4-10 times faster than NTFS on disk. We also show that BPFS on PCM should be significantly faster than a traditional disk-based file system.

935 citations


Cites background or methods from "Architecting phase change memory as..."

  • ...For example, a 2008 Samsung prototype PCM chip holds 512 Mb [9], so with 16 chips on a high-capacity DIMM, we could reach a capacity of 1 G B per DIMM right now....

    [...]

  • ...Phase change memory, or PCM, is a new memory technology that is both non-volatile and byte-addressable; in additio n, PCM provides these features at speeds within an order of magnitu de of DRAM [1, 9]....

    [...]

  • ...As we will discuss later, implementing failure atomic ity requires having as little as 300 nanojoulesof energy available in a capacitor [9]....

    [...]

  • ...However, since DRAM and PCM have similar capacities at the same technology node [1, 9], we exp ct to have 32 GB PCM DIMMs at the 45 nm node, which is compara-...

    [...]

  • ...Finally, several papers have explored the use of PCM as a scal able replacement for DRAM [9, 18, 27] as well as possible wear leveling strategies [18, 27]....

    [...]

Proceedings ArticleDOI
05 Mar 2011
TL;DR: In tests emulating the performance characteristics of forthcoming SCMs, Mnemosyne can persist data as fast as 3 microseconds and can be up to 1400% faster than alternative persistence strategies, such as Berkeley DB or Boost serialization, that are designed for disks.
Abstract: New storage-class memory (SCM) technologies, such as phase-change memory, STT-RAM, and memristors, promise user-level access to non-volatile storage through regular memory instructions. These memory devices enable fast user-mode access to persistence, allowing regular in-memory data structures to survive system crashes.In this paper, we present Mnemosyne, a simple interface for programming with persistent memory. Mnemosyne addresses two challenges: how to create and manage such memory, and how to ensure consistency in the presence of failures. Without additional mechanisms, a system failure may leave data structures in SCM in an invalid state, crashing the program the next time it starts.In Mnemosyne, programmers declare global persistent data with the keyword "pstatic" or allocate it dynamically. Mnemosyne provides primitives for directly modifying persistent variables and supports consistent updates through a lightweight transaction mechanism. Compared to past work on disk-based persistent memory, Mnemosyne reduces latency to storage by writing data directly to memory at the granularity of an update rather than writing memory pages back to disk through the file system. In tests emulating the performance characteristics of forthcoming SCMs, we show that Mnemosyne can persist data as fast as 3 microseconds. Furthermore, it provides a 35 percent performance increase when applied in the OpenLDAP directory server. In microbenchmark studies we find that Mnemosyne can be up to 1400% faster than alternative persistence strategies, such as Berkeley DB or Boost serialization, that are designed for disks.

873 citations

Proceedings ArticleDOI
05 Mar 2011
TL;DR: A lightweight, high-performance persistent object system called NV-heaps is implemented that provides transactional semantics while preventing these errors and providing a model for persistence that is easy to use and reason about.
Abstract: Persistent, user-defined objects present an attractive abstraction for working with non-volatile program state. However, the slow speed of persistent storage (i.e., disk) has restricted their design and limited their performance. Fast, byte-addressable, non-volatile technologies, such as phase change memory, will remove this constraint and allow programmers to build high-performance, persistent data structures in non-volatile storage that is almost as fast as DRAM. Creating these data structures requires a system that is lightweight enough to expose the performance of the underlying memories but also ensures safety in the presence of application and system failures by avoiding familiar bugs such as dangling pointers, multiple free()s, and locking errors. In addition, the system must prevent new types of hard-to-find pointer safety bugs that only arise with persistent objects. These bugs are especially dangerous since any corruption they cause will be permanent.We have implemented a lightweight, high-performance persistent object system called NV-heaps that provides transactional semantics while preventing these errors and providing a model for persistence that is easy to use and reason about. We implement search trees, hash tables, sparse graphs, and arrays using NV-heaps, BerkeleyDB, and Stasis. Our results show that NV-heap performance scales with thread count and that data structures implemented using NV-heaps out-perform BerkeleyDB and Stasis implementations by 32x and 244x, respectively, by avoiding the operating system and minimizing other software overheads. We also quantify the cost of enforcing the safety guarantees that NV-heaps provide and measure the costs of NV-heap primitive operations.

850 citations

Proceedings ArticleDOI
12 Dec 2009
TL;DR: Start-Gap is proposed, a simple, novel, and effective wear-leveling technique that uses only two registers that boosts the achievable lifetime of the baseline 16 GB PCM-based system from 5% to 97% of the theoretical maximum, while incurring a total storage overhead of less than 13 bytes and obviating the latency overhead of accessing large tables.
Abstract: Phase Change Memory (PCM) is an emerging memory technology that can increase main memory capacity in a cost-effective and power-efficient manner. However, PCM cells can endure only a maximum of 107 - 108 writes, making a PCM based system have a lifetime of only a few years under ideal conditions. Furthermore, we show that non-uniformity in writes to different cells reduces the achievable lifetime of PCM system by 20x. Writes to PCM cells can be made uniform with Wear-Leveling. Unfortunately, existing wear-leveling techniques require large storage tables and indirection, resulting in significant area and latency overheads. We propose Start-Gap, a simple, novel, and effective wear-leveling technique that uses only two registers. By combining Start-Gap with simple address-space randomization techniques we show that the achievable lifetime of the baseline 16GB PCM-based system is boosted from 5% (with no wear-leveling) to 97% of the theoretical maximum, while incurring a total storage overhead of less than 13 bytes and obviating the latency overhead of accessing large tables. We also analyze the security vulnerabilities for memory systems that have limited write endurance, showing that under adversarial settings, a PCM-based system can fail in less than one minute. We provide a simple extension to Start-Gap that makes PCM-based systems robust to such malicious attacks.

782 citations


Cites background from "Architecting phase change memory as..."

  • ...For example, partial writes [8], li ne evel writeback [15], lazy write [15] and silent store removal [21 ]....

    [...]

Proceedings ArticleDOI
12 Dec 2009
TL;DR: This paper proposes and evaluates Flip-N-Write, a simple microarchitectural technique to replace a PRAM write operation with a more efficient read-modify-write operation, and achieves commensurate savings in write energy under the same instantaneous write power constraint.
Abstract: The phase-change random access memory (PRAM) technology is fast maturing to production levels. Main advantages of PRAM are non-volatility, byte addressability, in-place programmability, low-power operation, and higher write endurance than that of current flash memories. However, the relatively low write bandwidth and the less-than-desirable write endurance of PRAM remain room for improvement. This paper proposes and evaluates Flip-N-Write, a simple microarchitectural technique to replace a PRAM write operation with a more efficient read-modify-write operation. On a write, after quick bit-by-bit inspection of the original data word and the new data word, Flip-N-Write writes either the new data word or the "flipped" value of it. Flip-N-Write introduces a single bit associated with each PRAM word to indicate whether the PRAM word has been flipped or not. We analytically and experimentally show that the proposed technique reduces the PRAM write time by half, more than doubles the write endurance, and achieves commensurate savings in write energy under the same instantaneous write power constraint. Due to its simplicity, Flip-N-Write is straightforward to implement within a PRAM device.

558 citations


Cites background or methods from "Architecting phase change memory as..."

  • ...Qureshi et al. examine a hybrid main memory architecture having a DRAM buffer and PRAM main memory....

    [...]

  • ...Lee et al. show in their experiment with a baseline design using a DDR2-like configuration and a set of parallel workloads that PRAM gives a 1.6× delay penalty and a 2.2× energy penalty compared with DRAM....

    [...]

  • ...Further note that for PRAM write energy is much higher than read energy (6–10 times reported in [11])....

    [...]

  • ...However, when PRAM is used for data storage (e.g., to replace a NAND flash chip or even a DRAM chip), writing actions will occur much more frequently and increasing the write bandwidth of PRAM and improving the writerelated energy and endurance become critical design issues....

    [...]

  • ...Lastly, the scalability of the PRAM technology potentially surpasses that of the DRAM and the NAND flash technologies [9, 20]....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: The author was led to the study given in this paper from a consideration of large scale computing machines in which a large number of operations must be performed without a single error in the end result.
Abstract: The author was led to the study given in this paper from a consideration of large scale computing machines in which a large number of operations must be performed without a single error in the end result. This problem of “doing things right” on a large scale is not essentially new; in a telephone central office, for example, a very large number of operations are performed while the errors leading to wrong numbers are kept well under control, though they have not been completely eliminated. This has been achieved, in part, through the use of self-checking circuits. The occasional failure that escapes routine checking is still detected by the customer and will, if it persists, result in customer complaint, while if it is transient it will produce only occasional wrong numbers. At the same time the rest of the central office functions satisfactorily. In a digital computer, on the other hand, a single failure usually means the complete failure, in the sense that if it is detected no more computing can be done until the failure is located and corrected, while if it escapes detection then it invalidates all subsequent operations of the machine. Put in other words, in a telephone central office there are a number of parallel paths which are more or less independent of each other; in a digital machine there is usually a single long path which passes through the same piece of equipment many, many times before the answer is obtained.

5,408 citations

Proceedings ArticleDOI
01 May 1995
TL;DR: This paper quantitatively characterize the SPLASH-2 programs in terms of fundamental properties and architectural interactions that are important to understand them well, including the computational load balance, communication to computation ratio and traffic needs, important working set sizes, and issues related to spatial locality.
Abstract: The SPLASH-2 suite of parallel applications has recently been released to facilitate the study of centralized and distributed shared-address-space multiprocessors. In this context, this paper has two goals. One is to quantitatively characterize the SPLASH-2 programs in terms of fundamental properties and architectural interactions that are important to understand them well. The properties we study include the computational load balance, communication to computation ratio and traffic needs, important working set sizes, and issues related to spatial locality, as well as how these properties scale with problem size and the number of processors. The other, related goal is methodological: to assist people who will use the programs in architectural evaluations to prune the space of application and machine parameters in an informed and meaningful way. For example, by characterizing the working sets of the applications, we describe which operating points in terms of cache size and problem size are representative of realistic situations, which are not, and which re redundant. Using SPLASH-2 as an example, we hope to convey the importance of understanding the interplay of problem size, number of processors, and working sets in designing experiments and interpreting their results.

4,002 citations


"Architecting phase change memory as..." refers methods in this paper

  • ...We consider parallel workloads from the SPLASH-2 suite (fft, radix, ocean), SPEC OpenMP suite (art, equake, swim), and NAS parallel benchmarks (cg, is, mg) [3, 4, 27]....

    [...]

Journal ArticleDOI
01 Sep 1991
TL;DR: A new set of benchmarks has been developed for the performance evaluation of highly parallel supercom puters that mimic the computation and data move ment characteristics of large-scale computational fluid dynamics applications.
Abstract: A new set of benchmarks has been developed for the performance evaluation of highly parallel supercom puters. These consist of five "parallel kernel" bench marks and three "simulated application" benchmarks. Together they mimic the computation and data move ment characteristics of large-scale computational fluid dynamics applications. The principal distinguishing feature of these benchmarks is their "pencil and paper" specification-all details of these benchmarks are specified only algorithmically. In this way many of the difficulties associated with conventional bench- marking approaches on highly parallel systems are avoided.

2,246 citations

Journal ArticleDOI
TL;DR: This work discusses the critical aspects that may affect the scaling of PCRAM, including materials properties, power consumption during programming and read operations, thermal cross-talk between memory cells, and failure mechanisms, and discusses experiments that directly address the scaling properties of the phase-change materials themselves.
Abstract: Nonvolatile RAM using resistance contrast in phase-change materials [or phase-change RAM (PCRAM)] is a promising technology for future storage-class memory. However, such a technology can succeed only if it can scale smaller in size, given the increasingly tiny memory cells that are projected for future technology nodes (i.e., generations). We first discuss the critical aspects that may affect the scaling of PCRAM, including materials properties, power consumption during programming and read operations, thermal cross-talk between memory cells, and failure mechanisms. We then discuss experiments that directly address the scaling properties of the phase-change materials themselves, including studies of phase transitions in both nanoparticles and ultrathin films as a function of particle size and film thickness. This work in materials directly motivated the successful creation of a series of prototype PCRAM devices, which have been fabricated and tested at phase-change material cross-sections with extremely small dimensions as low as 3 nm × 20 nm. These device measurements provide a clear demonstration of the excellent scaling potential offered by this technology, and they are also consistent with the scaling behavior predicted by extensive device simulations. Finally, we discuss issues of device integration and cell design, manufacturability, and reliability.

1,018 citations

Journal Article
TL;DR: The original NAS Parallel Benchmarks consisted of eight individual bench- mark problems, each of which focused on some aspect of scientific computing, although most of these benchmarks have much broader relevance, since in a much larger sense they are typical of many real-world computing applications.
Abstract: TITLE: The NAS Parallel Benchmarks AUTHOR: David H Bailey 1 ACRONYMS: NAS, NPB DEFINITION: The NAS Parallel Benchmarks (NPB) are a suite of parallel computer per- formance benchmarks. They were originally developed at the NASA Ames Re- search Center in 1991 to assess high-end parallel supercomputers [?]. Although they are no longer used as widely as they once were for comparing high-end sys- tem performance, they continue to be studied and analyzed a great deal in the high-performance computing community. The acronym “NAS” originally stood for the Numerical Aeronautical Simulation Program at NASA Ames. The name of this organization was subsequently changed to the Numerical Aerospace Sim- ulation Program, and more recently to the NASA Advanced Supercomputing Center, although the acronym remains “NAS.” The developers of the original NPB suite were David H. Bailey, Eric Barszcz, John Barton, David Browning, Russell Carter, LeoDagum, Rod Fatoohi, Samuel Fineberg, Paul Frederickson, Thomas Lasinski, Rob Schreiber, Horst Simon, V. Venkatakrishnan and Sisira Weeratunga. DISCUSSION: The original NAS Parallel Benchmarks consisted of eight individual bench- mark problems, each of which focused on some aspect of scientific computing. The principal focus was in computational aerophysics, although most of these benchmarks have much broader relevance, since in a much larger sense they are typical of many real-world scientific computing applications. The NPB suite grew out of the need for a more rational procedure to select new supercomputers for acquisition by NASA. The emergence of commercially available highly parallel computer systems in the late 1980s offered an attrac- tive alternative to parallel vector supercomputers that had been the mainstay of high-end scientific computing. However, the introduction of highly parallel systems was accompanied by a regrettable level of hype, not only on the part of the commercial vendors but even, in some cases, by scientists using the sys- tems. As a result, it was difficult to discern whether the new systems offered any fundamental performance advantage over vector supercomputers, and, if so, which of the parallel offerings would be most useful in real-world scientific computation. 1 Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA, dhbailey@lbl.gov. Supported in part by the Director, Office of Computational and Technology Research, Division of Mathematical, Information, and Computational Sciences of the U.S. Department of Energy, under contract number DE-AC02-05CH11231.

875 citations


"Architecting phase change memory as..." refers methods in this paper

  • ...We consider parallel workloads from the SPLASH-2 suite (fft, radix, ocean), SPEC OpenMP suite (art, equake, swim), and NAS parallel benchmarks (cg, is, mg) [3, 4, 27]....

    [...]

Trending Questions (1)
Is PCM required for software engineering?

We propose, crafted from a fundamental understanding of PCM technology parameters, area-neutral architectural enhancements that address these limitations and make PCM competitive with DRAM.