scispace - formally typeset
Search or ask a question

Showing papers on "Memory controller published in 2009"


Proceedings ArticleDOI
26 Apr 2009
TL;DR: In this paper, the performance of non-graphics applications written in NVIDIA's CUDA programming model is evaluated on a microarchitecture performance simulator that runs NVIDIA's parallel thread execution (PTX) virtual instruction set.
Abstract: Modern Graphic Processing Units (GPUs) provide sufficiently flexible programming models that understanding their performance can provide insight in designing tomorrow's manycore processors, whether those are GPUs or otherwise. The combination of multiple, multithreaded, SIMD cores makes studying these GPUs useful in understanding tradeoffs among memory, data, and thread level parallelism. While modern GPUs offer orders of magnitude more raw computing power than contemporary CPUs, many important applications, even those with abundant data level parallelism, do not achieve peak performance. This paper characterizes several non-graphics applications written in NVIDIA's CUDA programming model by running them on a novel detailed microarchitecture performance simulator that runs NVIDIA's parallel thread execution (PTX) virtual instruction set. For this study, we selected twelve non-trivial CUDA applications demonstrating varying levels of performance improvement on GPU hardware (versus a CPU-only sequential version of the application). We study the performance of these applications on our GPU performance simulator with configurations comparable to contemporary high-end graphics cards. We characterize the performance impact of several microarchitecture design choices including choice of interconnect topology, use of caches, design of memory controller, parallel workload distribution mechanisms, and memory request coalescing hardware. Two observations we make are (1) that for the applications we study, performance is more sensitive to interconnect bisection bandwidth rather than latency, and (2) that, for some applications, running fewer threads concurrently than on-chip resources might otherwise allow can improve performance by reducing contention in the memory system.

1,558 citations


Proceedings ArticleDOI
20 Jun 2009
TL;DR: This paper analyzes a PCM-based hybrid main memory system using an architecture level model of PCM and proposes simple organizational and management solutions of the hybrid memory that reduces the write traffic to PCM, boosting its lifetime from 3 years to 9.7 years.
Abstract: The memory subsystem accounts for a significant cost and power budget of a computer system. Current DRAM-based main memory systems are starting to hit the power and cost limit. An alternative memory technology that uses resistance contrast in phase-change materials is being actively investigated in the circuits community. Phase Change Memory (PCM) devices offer more density relative to DRAM, and can help increase main memory capacity of future systems while remaining within the cost and power constraints.In this paper, we analyze a PCM-based hybrid main memory system using an architecture level model of PCM.We explore the trade-offs for a main memory system consisting of PCMstorage coupled with a small DRAM buffer. Such an architecture has the latency benefits of DRAM and the capacity benefits of PCM. Our evaluations for a baseline system of 16-cores with 8GB DRAM show that, on average, PCM can reduce page faults by 5X and provide a speedup of 3X. As PCM is projected to have limited write endurance, we also propose simple organizational and management solutions of the hybrid memory that reduces the write traffic to PCM, boosting its lifetime from 3 years to 9.7 years.

1,451 citations


18 May 2009
TL;DR: Preliminary experiments suggesting that this approach to building main memory as a hybrid between DRAM and non-volatile memory, such as flash or PC-RAM, is viable are described.
Abstract: Technology trends may soon favor building main memory as a hybrid between DRAM and non-volatile memory, such as flash or PC-RAM. We describe how the operating system might manage such hybrid memories, using semantic information not available in other layers. We describe preliminary experiments suggesting that this approach is viable.

248 citations


Proceedings ArticleDOI
12 Sep 2009
TL;DR: This paper presents fundamental details of the newly introduced Intel Nehalem microarchitecture with its integrated memory controller, Quick Path Interconnect, and ccNUMA architecture, based on sophisticated benchmarks to measure the latency and bandwidth between different locations in the memory subsystem.
Abstract: Today's microprocessors have complex memory subsystems with several cache levels. The efficient use of this memory hierarchy is crucial to gain optimal performance, especially on multicore processors. Unfortunately, many implementation details of these processors are not publicly available. In this paper we present such fundamental details of the newly introduced Intel Nehalem microarchitecture with its integrated memory controller, Quick Path Interconnect, and ccNUMA architecture. Our analysis is based on sophisticated benchmarks to measure the latency and bandwidth between different locations in the memory subsystem. Special care is taken to control the coherency state of the data to gain insight into performance relevant implementation details of the cache coherency protocol. Based on these benchmarks we present undocumented performance data and architectural properties.

243 citations


Patent
04 Mar 2009
TL;DR: In this article, a method for data storage includes storing data in a group of analog memory cells by writing respective input storage values to the memory cells in the group, and then reading the output storage values from the analog memory cell in the groups.
Abstract: A method for data storage includes storing data in a group of analog memory cells by writing respective input storage values to the memory cells in the group. After storing the data, respective output storage values are read from the analog memory cells in the group. Respective confidence levels of the output storage values are estimated, and the confidence levels are compressed. The output storage values and the compressed confidence levels are transferred from the memory cells over an interface to a memory controller.

238 citations


Patent
Robert Haas1, Xiao-Yu Hu1, Roman A. Pletka1
30 Nov 2009
TL;DR: In this paper, a method for fast reconstruction of metadata structures on a memory storage device includes writing a plurality of checkpoints holding a root of metadata structure in an increasing order of timestamps.
Abstract: A method for facilitating fast reconstruction of metadata structures on a memory storage device includes writing a plurality of checkpoints holding a root of metadata structures in an increasing order of timestamps to a plurality of blocks respectively on the memory storage device utilizing a memory controller, where each checkpoint is associated with a timestamp, and wherein the last-written checkpoint contains a root to the latest metadata information from where metadata structures are reconstructed.

199 citations


Proceedings ArticleDOI
20 Jun 2009
TL;DR: This paper shows how the location of the memory controllers can reduce contention (hot spots) in the on-chip fabric and lower the variance in reference latency, which provides predictable performance for memory-intensive applications regardless of the processing core on which a thread is scheduled.
Abstract: In the near term, Moore's law will continue to provide an increasing number of transistors and therefore an increasing number of on-chip cores. Limited pin bandwidth prevents the integration of a large number of memory controllers on-chip. With many cores, and few memory controllers, where to locate the memory controllers in the on-chip interconnection fabric becomes an important and as yet unexplored question. In this paper we show how the location of the memory controllers can reduce contention (hot spots) in the on-chip fabric and lower the variance in reference latency. This in turn provides predictable performance for memory-intensive applications regardless of the processing core on which a thread is scheduled. We explore the design space of on-chip fabrics to find optimal memory controller placement relative to different topologies (i.e. mesh and torus), routing algorithms, and workloads.

167 citations


Patent
23 Apr 2009
TL;DR: In this paper, the authors proposed a memory controller for a semiconductor memory device that can be adapted to a plurality of storage regions whose request levels are different from one another by using each of the memory blocks as a deletion unit.
Abstract: PROBLEM TO BE SOLVED: To provide a highly reliable semiconductor memory device for adapted to a plurality of storage regions whose request levels are different from one another. SOLUTION: The semiconductor memory device 20 comprises: a memory 21 which has a plurality of memory blocks having memory cells capable of storing a plurality of different kinds of data which require memory areas having different characteristics; and a memory controller 22 for managing the memory by using each of the memory blocks as a deletion unit. The memory controller 22 has a function for converting the logical address of the memory 21 into a physical address identifying the memory block, and executes processing to replace the memory block with a preregistered free block in rewriting the memory block. The memory controller 22 manages the different kinds of data stored in the memory 21 so as to store the same kind of data as before even after each of the memory blocks and the free blocks in the memory 21 is rewritten. COPYRIGHT: (C)2009,JPO&INPIT

161 citations


Journal ArticleDOI
TL;DR: An analyzable JEDEC-compliant DDRx SDRAM memory controller (AMC) for hard real-time CMPs is proposed, that reduces the impact of memory interferences caused by other tasks on WCET estimation, providing a predictable memory access time and allowing the computation of tight WCET estimations.
Abstract: Multicore processors (CMPs) represent a good solution to provide the performance required by current and future hard real-time systems. However, it is difficult to compute a tight WCET estimation for CMPs due to interferences that tasks suffer when accessing shared hardware resources. We propose an analyzable JEDEC-compliant DDRx SDRAM memory controller (AMC) for hard real-time CMPs, that reduces the impact of memory interferences caused by other tasks on WCET estimation, providing a predictable memory access time and allowing the computation of tight WCET estimations.

161 citations


Patent
29 Sep 2009
TL;DR: In this paper, a boot image from a solid state drive to an operating memory of a computing system during an initialization operation of the computing system is described, where the initialization operation initializes components of the computer system.
Abstract: One embodiment of a method includes loading, by a memory controller, a boot image from a solid state drive to an operating memory of a computing system during an initialization operation of the computing system. The initialization operation initializes components of the computing system.

153 citations


Patent
Mei-Man L. Syu1
04 Mar 2009
TL;DR: In this paper, it is determined that a second erase counter associated with a second zip code is low relative to at least one other erase counter, and based on this determination, data from blocks in the second code may be written to new blocks as part of a wear-leveling operation.
Abstract: A solid state drive includes a plurality of flash memory devices, and a memory controller coupled to the plurality of flash memory devices. The memory controller is configured to logically associate blocks from the plurality of flash memory devices to form zip codes, the zip codes associated with corresponding erase counters. The solid state drive further includes a processor and a computer-readable memory having instructions stored thereon. The processor may perform a wear-leveling operation by determining that blocks in a first zip code have been erased and incrementing a first erase counter associated with the first zip code. It may then be determined that a second erase counter associated with a second zip code is low relative to at least one other erase counter, and based on this determination, data from blocks in the second zip code may be written to new blocks as part of a wear-leveling operation.

Proceedings ArticleDOI
12 Dec 2009
TL;DR: This paper proposes a complexity-effective solution to DRAM request scheduling which recovers most of the performance loss incurred by a naive in-order first-in first-out (FIFO) DRAM Scheduler compared to an aggressive out-of-order DRAM scheduler.
Abstract: Modern DRAM systems rely on memory controllers that employ out-of-order scheduling to maximize row access locality and bank-level parallelism, which in turn maximizes DRAM bandwidth. This is especially important in graphics processing unit (GPU) architectures, where the large quantity of parallelism places a heavy demand on the memory system. The logic needed for out-of-order scheduling can be expensive in terms of area, especially when compared to an in-order scheduling approach. In this paper, we propose a complexity-effective solution to DRAM request scheduling which recovers most of the performance loss incurred by a naive in-order first-in first-out (FIFO) DRAM scheduler compared to an aggressive out-of-order DRAM scheduler. We observe that the memory request stream from individual GPU "shader cores" tends to have sufficient row access locality to maximize DRAM efficiency in most applications without significant reordering. However, the interconnection network across which memory requests are sent from the shader cores to the DRAM controller tends to finely interleave the numerous memory request streams in a way that destroys the row access locality of the resultant stream seen at the DRAM controller. To address this, we employ an interconnection network arbitration scheme that preserves the row access locality of individual memory request streams and, in doing so, achieves DRAM efficiency and system performance close to that achievable by using out-of-order memory request scheduling while doing so with a simpler design. We evaluate our interconnection network arbitration scheme using crossbar, mesh, and ring networks for a baseline architecture of 8 memory channels, each controlled by its own DRAM controller and 28 shader cores (224 ALUs), supporting up to 1,792 in-flight memory requests. Our results show that our interconnect arbitration scheme coupled with a banked FIFO in-order scheduler obtains up to 91% of the performance obtainable with an out-of-order memory scheduler for a crossbar network with eight-entry DRAM controller queues.


Journal ArticleDOI
TL;DR: The Multicore DIMM is designed to improve the energy efficiency of memory systems with small impact on system performance, where DRAM chips are grouped into multiple virtual memory devices, each of which has its own data path and receives separate commands.
Abstract: Demand for memory capacity and bandwidth keeps increasing rapidly in modern computer systems, and memory power consumption is becoming a considerable portion of the system power budget. However, the current DDR DIMM standard is not well suited to effectively serve CMP memory requests from both a power and performance perspective. We propose a new memory module called a multicore DIMM, where DRAM chips are grouped into multiple virtual memory devices, each of which has its own data path and receives separate commands. The Multicore DIMM is designed to improve the energy efficiency of memory systems with small impact on system performance. Dividing each memory modules into 4 virtual memory devices brings a simultaneous 22%, 7.6%, and 18% improvement in memory power, IPC, and system energy-delay product respectively on a set of multithreaded applications and consolidated workloads.

Patent
17 Feb 2009
TL;DR: In this article, a non-volatile memory storage system with two-stage controller is described, consisting of a plurality of flash memory devices, a plurality first-stage controllers coupled to the plurality of devices, and a storage adapter communicating with the first stage controllers through one or more internal interfaces.
Abstract: The present invention discloses a non-volatile memory storage system with two-stage controller, comprising: a plurality of flash memory devices; a plurality of first stage controllers coupled to the plurality of flash memory devices, respectively, wherein each of the first stage controllers performs data integrity management as well as writes and reads data to and from a corresponding flash memory device; and a storage adapter communicating with the plurality of first stage controllers through one or more internal interfaces.

Patent
10 Sep 2009
TL;DR: A flash memory device and method as discussed by the authors includes a memory having a plurality of nonvolatile memory cells for storing stored values of user data and a memory controller includes an encoder for encoding user write data for storage of code values as the stored values in the memory.
Abstract: A memory device and method, such as a flash memory device and method, includes a memory having a plurality of nonvolatile memory cells for storing stored values of user data. The memory device and method includes a memory controller for controlling the memory. The memory controller includes an encoder for encoding user write data for storage of code values as the stored values in the memory. The encoder includes an inserter for insertion of an indicator as part of the stored values for use in determining when the stored values are or are not in an erased state. The memory controller includes a decoder for reading the stored values from the memory to form user read data values when the stored values are not in the erased state.

Patent
27 Jul 2009
TL;DR: In this paper, a memory device comprises a first and second integrated circuit dies, and a speed test on the memory core integrated circuit is conducted, and the interface integrated circuit die is electrically coupled to the memory-core integrated circuit, based on the speed of the memory.
Abstract: A memory device comprises a first and second integrated circuit dies. The first integrated circuit die comprises a memory core as well as a first interface circuit. The first interface circuit permits full access to the memory cells (e.g., reading, writing, activating, pre-charging and refreshing operations to the memory cells). The second integrated circuit die comprises a second interface that interfaces the memory core, via the first interface circuit, an external bus, such as a synchronous interface to an external bus. A technique combines memory core integrated circuit dies with interface integrated circuit dies to configure a memory device. A speed test on the memory core integrated circuit dies is conducted, and the interface integrated circuit die is electrically coupled to the memory core integrated circuit die based on the speed of the memory core integrated circuit die.

Patent
17 Sep 2009
TL;DR: One embodiment of main memory is main memory that includes a combination of non-volatile memory (NVM) and dynamic random access memory (DRAM) as mentioned in this paper, and an operating system migrates data between the NVM and the DRAM.
Abstract: One embodiment is main memory that includes a combination of non-volatile memory (NVM) and dynamic random access memory (DRAM). An operating system migrates data between the NVM and the DRAM.

Patent
12 May 2009
TL;DR: In this article, a three dimensional memory module and system are formed with at least one slave chip stacked over a master chip, which includes a memory core for increased capacity of the memory module/system.
Abstract: A three dimensional memory module and system are formed with at least one slave chip stacked over a master chip. Through semiconductor vias (TSVs) are formed through at least one of the master and slave chips. The master chip includes a memory core for increased capacity of the memory module/system. In addition, capacity organizations of the three dimensional memory module/system resulting in efficient wiring is disclosed for forming multiple memory banks, multiple bank groups, and/or multiple ranks of the three dimensional memory module/system.

Proceedings ArticleDOI
20 Jun 2009
TL;DR: A new memory system design called decoupled DIMM is proposed that allows the memory bus to operate at a data rate much higher than that of the DRAM devices, and improves reliability, power efficiency, and cost effectiveness by using relatively slow memory devices.
Abstract: The widespread use of multicore processors has dramatically increased the demands on high bandwidth and large capacity from memory systems. In a conventional DDR2/DDR3 DRAM memory system, the memory bus and DRAM devices run at the same data rate. To improve memory bandwidth, we propose a new memory system design called decoupled DIMM that allows the memory bus to operate at a data rate much higher than that of the DRAM devices. In the design, a synchronization buffer is added to relay data between the slow DRAM devices and the fast memory bus; and memory access scheduling is revised to avoid access conflicts on memory ranks. The design not only improves memory bandwidth beyond what can be supported by current memory devices, but also improves reliability, power efficiency, and cost effectiveness by using relatively slow memory devices. The idea of decoupling, precisely the decoupling of bandwidth match between memory bus and a single rank of devices, can also be applied to other types of memory systems including FB-DIMM.Our experimental results show that a decoupled DIMM system of 2667MT/s bus data rate and 1333MT/s device data rate improves the performance of memory-intensive workloads by 51% on average over a conventional memory system of 1333MT/s data rate. Alternatively, a decoupled DIMM system of 1600MT/s bus data rate and 800MT/s device data rate incurs only 8% performance loss when compared with a conventional system of 1600MT/s data rate, with 16% reduction on the memory power consumption and 9% saving on memory energy.

Patent
11 Feb 2009
TL;DR: In this paper, a non-volatile memory is used to move data from a volatile memory to a nonvolatile one upon a loss of power of a primary power source of the volatile memory.
Abstract: A device includes: non-volatile memory; a controller in communication with the non-volatile memory, wherein the controller is programmed to move data from a volatile memory to the non-volatile memory upon a loss of power of a primary power source of the volatile memory; and a backup power supply providing temporary power to the controller and the volatile memory upon the loss of power of the primary power source, including: a capacitor bank with an output terminal; a connection to a voltage source that charges the capacitor bank to a normal operating voltage; and a state-of-health monitor that is programmed to generate a failure signal based on a voltage at the output terminal of the capacitor bank.

Proceedings ArticleDOI
03 May 2009
TL;DR: This work proposes an adaptive-rate ECC scheme with BCH codes that is implemented on the flash memory controller that can trade storage space for higher error correction capability to keep it usable even when there is a high noise level.
Abstract: ECC has been widely used to enhance flash memory endurance and reliability. In this work, we propose an adaptive-rate ECC scheme with BCH codes that is implemented on the flash memory controller. With this scheme, flash memory can trade storage space for higher error correction capability to keep it usable even when there is a high noise level.

Patent
14 May 2009
TL;DR: A memory controller for phase change memory (PCM) that can be used on a storage bus interface is described in this paper, where the memory controller includes an external bus interface coupled to a external bus to communicate read and write instructions with an external device.
Abstract: A memory controller for a phase change memory (PCM) that can be used on a storage bus interface is described. In one example, the memory controller includes an external bus interface coupled to an external bus to communicate read and write instructions with an external device, a memory array interface coupled to a memory array to perform reads and writes on a memory array, and an overwrite module to write a desired value to a desired address of the memory array.

Patent
13 Apr 2009
TL;DR: A self-testing memory module as discussed by the authors includes a printed circuit board configured to be operatively coupled to a memory controller of a computer system and includes a plurality of memory devices, each memory device of the plurality comprising data, address, and control ports.
Abstract: A self-testing memory module includes a printed circuit board configured to be operatively coupled to a memory controller of a computer system and includes a plurality of memory devices on the printed circuit board, each memory device of the plurality of memory devices comprising data, address, and control ports. The memory module also includes a control module configured to generate address and control signals for testing the memory devices. The memory module includes a data module comprising a plurality of data handlers. Each data handler is operable independently from each of the other data handlers of the plurality of data handlers. Each data handler is operatively coupled to a corresponding plurality of the data ports of one or more of the memory devices and is configured to generate data for writing to the corresponding plurality of data ports.

Proceedings ArticleDOI
29 May 2009
TL;DR: A floating-body Z-RAM® memory cell is presented to fabricate a high-density low-latency and high-bandwidth 4Mb memory macro building block, targeted at the requirements of microprocessor caches.
Abstract: To meet advancing market demands, microprocessor embedded memory applications require denser and faster memory arrays with each process generation. Recent work presented an 18.5ns 128Mb DRAM with a floating body cell for conventional DRAM products [1] and a 4Mb memory macro using a memory cell built with two floating body transistors [2]. This paper presents a floating-body Z-RAM® memory cell [3] to fabricate a high-density low-latency and high-bandwidth 4Mb memory macro building block, targeted at the requirements of microprocessor caches. It uses a single transistor (1T), unlike traditional 1T1C DRAM [4], or six transistor 6T-SRAM memory cells [5].

Patent
16 Jun 2009
TL;DR: In this article, a file system that is supported by a nonvolatile memory that is directly connected to a memory bus, and placed side by side with a dynamic random access memory (DRAM), is described.
Abstract: Implementations of a file system that is supported by a non-volatile memory that is directly connected to a memory bus, and placed side by side with a dynamic random access memory (DRAM), are described.

Patent
Chris Nga Yee Avila1, Jonathan Hsu1, Alexander Kwok-Tung Mak1, Jian Chen1, Grishma Shah1 
16 Jun 2009
TL;DR: In a nonvolatile memory system as mentioned in this paper, data received from a host by a memory controller is transferred to an on-chip cache, and new data from the host displaces the previous data before it is written to the NVRAM array.
Abstract: In a nonvolatile memory system, data received from a host by a memory controller is transferred to an on-chip cache, and new data from the host displaces the previous data before it is written to the nonvolatile memory array. A safe copy is maintained in on-chip cache so that if a program failure occurs, the data can be recovered and written to an alternative location in the nonvolatile memory array.

Patent
20 May 2009
TL;DR: In this article, a method comprising determining that a minimum operation level of an integrated circuit (100) has been reached and that a sleep mode is therefore allowable, storing minimum operation context information to a RAM (115) in response to determining that the minimum operating level has not yet been reached, switching to sleep mode code (116) in the RAM, and transferring memory control from a primary memory controller (104) to a secondary memory controller(112) where only the SMC can control the RAM is used.
Abstract: A method comprising determining that a minimum operation level of an integrated circuit (100) has been reached and that a sleep mode is therefore allowable; storing minimum operation context information to a RAM (115) in response to determining that the minimum operation level has been reached; switching to a sleep mode code (116) in the RAM (115); and transferring memory control from a primary memory controller (104) to a secondary memory controller (112) wherein only the secondary memory controller (112) controls the RAM (115). The method may include storing the sleep mode code (116) and a wakeup code (117) in the RAM (115) in response to determining that sleep mode is allowable, where the wakeup code (117) restores a minimum operation context using the minimum operation context information stored in the RAM (115). The method may also include placing a plurality of integrated circuit power islands into a sleep mode and leaving a secondary memory controller power island (109) in a normal power mode.

Patent
16 Oct 2009
TL;DR: In this paper, the authors propose to use bit-map memory to reduce the amount of memory-to-memory copying required to establish a checkpoint in a post-image checkpointing scenario.
Abstract: System-directed checkpointing is enabled in otherwise standard computers through relatively straightforward augmentations to the computer's memory controller hub. Firmware routines executed by a control and dispatch unit that is normally part of any memory controller hub enable it to implement any of six different checkpointing strategies: post-image checkpointing in which an image of the system state at the time of the last checkpoint is maintained in a local shadow memory; post-image checkpointing in which an image of the system state at the time of the last checkpoint is maintained in a shadow memory located in a second, backup computer; post-image checkpointing using a bit-map memory, having one bit representing each data block in system memory, to reduce the amount of memory-to-memory copying required to establish a checkpoint; post-image checkpointing to a local shadow memory using two bit map memories to enable normal processing to continue while the shadow is being updated, post-image checkpointing to a local shadow memory using a block-state memory that eliminates the need for any memory-to-memory copying; and local pre-image checkpointing that does not require a shadow memory. Since each of these implementations has advantages and disadvantages relative to the others and since similar mechanisms are used in the memory controller hub for all of these options, it can be designed to support all of them with hardwired or settable status bits defining which is to be supported in a given situation.