scispace - formally typeset
Search or ask a question

Showing papers on "Memory controller published in 2013"


Patent
14 Feb 2013
TL;DR: In this paper, host-controller cooperation in managing NAND flash memory is discussed, where metadata can be provided to the host identifying whether each page of an erase unit has been released and the host can specifically then command each of consolidation and erase using direct addressing.
Abstract: This disclosure provides for host-controller cooperation in managing NAND flash memory. The controller maintains information for each erase unit which tracks memory usage. This information assists the host in making decisions about specific operations, for example, initiating garbage collection, space reclamation, wear leveling or other operations. For example, metadata can be provided to the host identifying whether each page of an erase unit has been released, and the host can specifically then command each of consolidation and erase using direct addressing. By redefining host-controller responsibilities in this manner, much of the overhead association with FTL functions can be substantially removed from the memory controller, with the host directly specifying physical addresses. This reduces performance unpredictability and overhead, thereby facilitating integration of solid state drives (SSDs) with other forms of storage. The disclosed techniques are especially useful for direct-attached and/or network-attached storage.

256 citations


Proceedings ArticleDOI
04 Nov 2013
TL;DR: PHANTOM is the first demonstration of a practical, oblivious processor that can provide strong confidentiality guarantees when offloading computation to the cloud and is efficient in both area and performance.
Abstract: We introduce PHANTOM [1] a new secure processor that obfuscates its memory access trace. To an adversary who can observe the processor's output pins, all memory access traces are computationally indistinguishable (a property known as obliviousness). We achieve obliviousness through a cryptographic construct known as Oblivious RAM or ORAM. We first improve an existing ORAM algorithm and construct an empirical model for its trusted storage requirement. We then present PHANTOM, an oblivious processor whose novel memory controller aggressively exploits DRAM bank parallelism to reduce ORAM access latency and scales well to a large number of memory channels. Finally, we build a complete hardware implementation of PHANTOM on a commercially available FPGA-based server, and through detailed experiments show that PHANTOM is efficient in both area and performance. Accessing 4KB of data from a 1GB ORAM takes 26.2us (13.5us for the data to be available), a 32x slowdown over accessing 4KB from regular memory, while SQLite queries on a population database see 1.2-6x slowdown. PHANTOM is the first demonstration of a practical, oblivious processor and can provide strong confidentiality guarantees when offloading computation to the cloud.

249 citations


Proceedings ArticleDOI
07 Dec 2013
TL;DR: It is shown that any compression algorithm can be adapted to fit the requirements of LCP, and two previously-proposed compression algorithms to LCP are adapted: Frequent Pattern Compression and Base-Delta-Immediate Compression.
Abstract: Data compression is a promising approach for meeting the increasing memory capacity demands expected in future systems. Unfortunately, existing compression algorithms do not translate well when directly applied to main memory because they require the memory controller to perform non-trivial computation to locate a cache line within a compressed memory page, thereby increasing access latency and degrading system performance. Prior proposals for addressing this performance degradation problem are either costly or energy inefficient. By leveraging the key insight that all cache lines within a page should be compressed to the same size, this paper proposes a new approach to main memory compression — Linearly Compressed Pages (LCP) — that avoids the performance degradation problem without requiring costly or energy-inefficient hardware. We show that any compression algorithm can be adapted to fit the requirements of LCP, and we specifically adapt two previously-proposed compression algorithms to LCP: Frequent Pattern Compression and Base-Delta-Immediate Compression. Evaluations using benchmarks from SPEC CPU2006 and five server benchmarks show that our approach can significantly increase the effective memory capacity (by 69% on average). In addition to the capacity gains, we evaluate the benefit of transferring consecutive compressed cache lines between the memory controller and main memory. Our new mechanism considerably reduces the memory bandwidth requirements of most of the evaluated benchmarks (by 24% on average), and improves overall performance (by 6.1%/13.9%/10.7% for single-/two-/four-core workloads on average) compared to a baseline system that does not employ main memory compression. LCP also decreases energy consumed by the main memory subsystem (by 9.5% on average over the best prior mechanism).

153 citations


Proceedings ArticleDOI
03 Dec 2013
TL;DR: A novel, composable worst case analysis for DDR DRAM is presented that provides improved latency bounds compared to existing works by explicitly modeling the DRAM state and scales better with increasing number of requestors and memory speed.
Abstract: As multi-core systems are becoming more popular in real-time embedded systems, strict timing requirements for accessing shared resources must be met. In particular, a detailed latency analysis for Double Data Rate Dynamic RAM (DDR DRAM) is highly desirable. Several researchers have proposed predictable memory controllers to provide guaranteed memory access latency. However, the performance of such controllers sharply decreases as DDR devices become faster and the width of memory buses is increased. In this paper, we present a novel, composable worst case analysis for DDR DRAM that provides improved latency bounds compared to existing works by explicitly modeling the DRAM state. In particular, our approach scales better with increasing number of requestors and memory speed. Benchmark evaluations show up to 62% improvement in worst case task execution time compared to a competing predictable memory controller for a system with 8 requestors.

115 citations


Patent
31 Oct 2013
TL;DR: In this paper, the memory controller receives an indication of a row hammer event, identifies the row associated with the row hammer events, and sends one or more commands to the memory device to cause the device to perform a targeted refresh that will refresh the victim row.
Abstract: A memory controller issues a targeted refresh command. A specific row of a memory device can be the target of repeated accesses. When the row is accessed repeatedly within a time threshold (also referred to as “hammered” or a “row hammer event”), physically adjacent row (a “victim” row) may experience data corruption. The memory controller receives an indication of a row hammer event, identifies the row associated with the row hammer event, and sends one or more commands to the memory device to cause the memory device to perform a targeted refresh that will refresh the victim row.

110 citations


Proceedings ArticleDOI
23 Jun 2013
TL;DR: An analysis of DDR4 DRAM's FGR feature is conducted, and there is no one-size-fits-all option across a variety of applications, and Adaptive Refresh is presented, a simple yet effective mechanism that dynamically chooses the best FGR mode for each application and phase within the application.
Abstract: Recent DRAM specifications exhibit increasing refresh latencies. A refresh command blocks a full rank, decreasing available parallelism in the memory subsystem significantly, thus decreasing performance. Fine Granularity Refresh (FGR) is a feature recently announced as part of JEDEC's DDR4 DRAM specification that attempts to tackle this problem by creating a range of refresh options that provide a trade-off between refresh latency and frequency.In this paper, we first conduct an analysis of DDR4 DRAM's FGR feature, and show that there is no one-size-fits-all option across a variety of applications. We then present Adaptive Refresh (AR), a simple yet effective mechanism that dynamically chooses the best FGR mode for each application and phase within the application.When looking at the refresh problem more closely, we identify in high-density DRAM systems a phenomenon that we call command queue seizure, whereby the memory controller's command queue seizes up temporarily because it is full with commands to a rank that is being refreshed. To attack this problem, we propose two complementary mechanisms called Delayed Command Expansion (DCE) and Preemptive Command Drain (PCD).Our results show that AR does exploit DDR4's FGR effectively. However, once our proposed DCE and PCD mechanisms are added, DDR4's FGR becomes redundant in most cases, except in a few highly memory-sensitive applications, where the use of AR does provide some additional benefit. In all, our simulations show that the proposed mechanisms yield 8% (14%) mean speedup with respect to traditional refresh, at normal (extended) DRAM operating temperatures, for a set of diverse parallel applications.

108 citations


Patent
07 Oct 2013
TL;DR: In this paper, the host directly assigns physical addresses and performs logical-to-physical address translation in a manner that reduces or eliminates the need for a memory controller to handle these functions.
Abstract: This disclosure provides for improvements in managing multi-drive, multi-die or multi-plane NAND flash memory. In one embodiment, the host directly assigns physical addresses and performs logical-to-physical address translation in a manner that reduces or eliminates the need for a memory controller to handle these functions, and initiates functions such as wear leveling in a manner that avoids competition with host data accesses. A memory controller optionally educates the host on array composition, capabilities and addressing restrictions. Host software can therefore interleave write and read requests across dies in a manner unencumbered by memory controller address translation. For multi-plane designs, the host writes related data in a manner consistent with multi-plane device addressing limitations. The host is therefore able to “plan ahead” in a manner supporting host issuance of true multi-plane read commands.

95 citations


Patent
08 Nov 2013
TL;DR: In this article, a nonvolatile storage or memory device is accessed over a memory bus, and a controller coupled to the bus sends synchronous data access commands to the non-volatile memory device, and reads the response from the device bus based on an expected timing of a reply from the non volatile memory device.
Abstract: A nonvolatile storage or memory device is accessed over a memory bus. The memory bus has an electrical interface typically used for volatile memory devices. A controller coupled to the bus sends synchronous data access commands to the nonvolatile memory device, and reads the response from the device bus based on an expected timing of a reply from the nonvolatile memory device. The controller determines the expected timing based on when the command was sent, and characteristics of the nonvolatile memory device. The controller may not need all the electrical signal lines available on the memory bus, and could issue data access commands to different groups of nonvolatile memory devices over different groups of electrical signal lines. The memory bus may be available and configured for either use with a memory controller and volatile memory devices, or a storage controller and nonvolatile memory devices.

83 citations


Journal ArticleDOI
TL;DR: A novel hybrid SPM which consists of static random-access memory (SRAM) and nonvolatile memory (NVM) to take advantage of the ultralow leakage power and high density of latter is proposed and a novel dynamic data management algorithm is proposed to make use of the full potential of NVM.
Abstract: Embedded systems normally have a tight energy budget. Since the on-chip cache typically consumes 25%-50% of the processor's area and energy consumption, scratch pad memory (SPM), which is a software-controlled on-chip memory, has been widely adopted in many embedded systems due to its smaller area and lower power consumption. However, as the speed of the CMOS transistors increases along with density, leakage power consumption is becoming a critical issue for memory components with a large number of transistors. In this paper, we propose a novel hybrid SPM which consists of static random-access memory (SRAM) and nonvolatile memory (NVM) to take advantage of the ultralow leakage power and high density of latter. A novel dynamic data management algorithm is also proposed to make use of the full potential of NVM. According to the experimental results, with the help of the proposed algorithm, the novel hybrid SPM architecture can reduce the memory access time by 18.17%, the dynamic energy by 24.29%, and the leakage power by 37.34% compared with a baseline pure SRAM SPM with the same area.

77 citations


Patent
20 Dec 2013
TL;DR: In this article, a storage controller is coupled with a flash memory module having multiple flash memory groups, each flash memory group corresponding to a distinct flash port in the storage controller, with each flash port comprising an associated processor.
Abstract: A storage controller is provided that contains multiple processors. In some embodiments, the storage controller is coupled to a flash memory module having multiple flash memory groups, each flash memory group corresponding to a distinct flash port in the storage controller, each flash port comprising an associated processor. Each processor handles a portion of one or more host commands, including reads and writes, allowing multiple parallel pipelines to handle one or more host commands simultaneously.

77 citations


Proceedings ArticleDOI
23 Jun 2013
TL;DR: This paper proposes a mechanism, pairing up a processor-side load criticality predictor with a lean memory controller that prioritizes load requests based on ranking information supplied from the processor side, and demonstrates that this mechanism can improve performance significantly on a CMP, with minimal overhead and virtually no changes to the processor itself.
Abstract: We hypothesize that performing processor-side analysis of load instructions, and providing this pre-digested information to memory schedulers judiciously, can increase the sophistication of memory decisions while maintaining a lean memory controller that can take scheduling actions quickly. This is increasingly important as DRAM frequencies continue to increase relative to processor speed. In this paper we propose one such mechanism, pairing up a processor-side load criticality predictor with a lean memory controller that prioritizes load requests based on ranking information supplied from the processor side. Using a sophisticated multi-core simulator that includes a detailed quad-channel DDR3 DRAM model, we demonstrate that this mechanism can improve performance significantly on a CMP, with minimal overhead and virtually no changes to the processor itself. We show that our design compares favorably to several state-of-the-art schedulers.

Journal ArticleDOI
TL;DR: An energy profiler tool for the systems that use ARM7TDMI processors is developed by embedding the model parameters in an instruction-level profiler from the SimpleScalar toolset which provides valuable information and guidelines for software energy optimization.
Abstract: Estimating the energy consumption of applications is a key aspect in optimizing embedded systems energy consumption. This paper proposes a simple yet accurate instruction-level energy estimation model for embedded systems. As a case study, the model parameters were determined for a commonly used ARM7TDMI-based microcontroller. The total energy includes the energy consumption of the processor core, Flash memory, memory controller, and SRAM. The model parameters are instructions opcode, number of shift operations, register bank bit flips, instructions weight and their Hamming distance, and different types of memory accesses. Also, the effect of pipeline stalls have been considered. In order to validate the proposed model, a physical hardware platform equipped with energy measurement capabilities was developed. We have conducted experiments on several embedded applications from MiBench benchmark suite and the results show less than 6% error in the energy consumption estimation. We have also developed an energy profiler tool for the systems that use ARM7TDMI processors by embedding the model parameters in an instruction-level profiler from the SimpleScalar toolset which provides valuable information and guidelines for software energy optimization.

Proceedings ArticleDOI
18 Mar 2013
TL;DR: A conservative open-page policy is proposed that improves average-case performance of a FRT controller in terms of bandwidth and latency without sacrificing real-time guarantees and shows that the overall average- case performance improves if there is at least one FRT or SRT application that exploits locality.
Abstract: Complex Systems-on-Chips (SoC) are mixed time-criticality systems that have to support firm real-time (FRT) and soft real-time (SRT) applications running in parallel. This is challenging for critical SoC components, such as memory controllers. Existing memory controllers focus on either firm real-time or soft real-time applications. FRT controllers use a close-page policy that maximizes worst-case performance and ignore opportunities to exploit locality, since it cannot be guaranteed. Conversely, SRT controllers try to reduce latency and consequently processor stalling by speculating on locality. They often use an open-page policy that sacrifices guaranteed performance, but is beneficial in the average case. This paper proposes a conservative open-page policy that improves average-case performance of a FRT controller in terms of bandwidth and latency without sacrificing real-time guarantees. As a result, the memory controller efficiently handles both FRT and SRT applications. The policy keeps pages open as long as possible without sacrificing guarantees and captures locality in this window. Experimental results show that on average 70% of the locality is captured for applications in the CHStone benchmark, reducing the execution time by 17% compared to a close-page policy. The effectiveness of the policy is also evaluated in a multi-application use-case, and we show that the overall average-case performance improves if there is at least one FRT or SRT application that exploits locality.

Patent
Ravi H. Motwani1, Kiran Pangal1
24 Sep 2013
TL;DR: In this paper, a read module of a memory controller may read a codeword stored in a memory and determine a number of hard errors in the codewords and store ECP information associated with the hard errors.
Abstract: Methods, apparatuses, and systems related to use of error correction pointers (ECPs) to handle hard errors in memory are described herein. In embodiments, a read module of a memory controller may read a codeword stored in a memory. The read module may determine a number of hard errors in the codeword. Responsive to a determination that the number of hard errors exceeds a threshold, the read module may store ECP information associated with the hard errors. The read module may include an error correction code (ECC) module to perform an ECC process on the codeword. The read module may use the ECP information to decode the codeword to recover the data responsive to a determination that the ECC process failed. Other embodiments may be described and claimed.

Patent
Yan Li1
14 Mar 2013
TL;DR: In this article, a NAND-type flash memories are used to protect themselves from temporary and short power drops by detecting the supply voltage dropping below a function voltage for a period of time.
Abstract: A mechanism is presented memory circuits, such a NAND-type flash memories, to autonomously protect themselves from temporary and short power drops. A detection mechanism looks for the supply voltage to drop below a function voltage for a period of time. When such an event occurs, a suspend mechanism is activated, and after completing the last micro-operation (such as a program pulse) the memory freezes. When power is again stable at an operational level, the suspended operation is resumed. The memory controller can then be notified upon occurrence of such voltage drop by polling a special status bit. Examples of how the pausing can be implemented include altering of clock signals and suspending sub-phases of larger operations.

Patent
30 Apr 2013
TL;DR: In this paper, a load reduction dual in-line memory module (LRDIMM) is used to compensate for the delay introduced by the load reduction buffer in the data path.
Abstract: A load reduction dual in-line memory module (LRDIMM) is similar to a registered dual in-line memory module (RDIMM) in which control signals are synchronously buffered but the LRDIMM includes a load reduction buffer (LRB) in the data path as well. To make an LRDIMM which appears compatible with RDIMMs on a system memory bus, the serial presence detector (SPD) of the LRDIMM is programmed with modified latency support and minimum delay values. When the dynamic read only memory (DRAMs) devices of the LRDIMM are subsequently set up by the host at boot time based on the parameters provided by the SPD, selected latency values are modified on the fly in an enhanced register phase look loop (RPLL) device. This has the effect of compensating for the delay introduced by the LRB without violating DRAM constraints, and provides memory bus timing for a LRDIMM that is indistinguishable from that of a RDIMM.

Patent
20 Aug 2013
TL;DR: In this paper, a memory module comprises a module control device to receive command signals from the memory controller and to output module command signals and module control signals to a plurality of buffer circuits to control data paths in the buffer circuits.
Abstract: A memory module is operatable in a memory system with a memory controller. The memory module comprises a module control device to receive command signals from the memory controller and to output module command signals and module control signals. The module command signals are provided to memory devices organized in groups, each group including at least one memory device, while the module control signals are provided to a plurality of buffer circuits to control data paths in the buffer circuits. The plurality of buffer circuits are associated with respective groups of memory devices and are distributed across a surface of the memory module such that each module control signal arrives at the plurality of buffer circuits at different points in time. The plurality of buffer circuits are configured to align read data signals received from the memory devices such that the read data signals are transmitted to the memory controller from the memory module substantially aligned with each other and in accordance with a read latency parameter of the memory system.

Journal ArticleDOI
TL;DR: An analytical model is presented that computes the worst-case delay, also known as Upper Bound Delay (UBD), that a memory request can suffer due to memory interferences generated by other co-running tasks, thereby ensuring the time composability property and enabling the use of multicores in integrated architectures.
Abstract: Multicore processors are an effective solution to cope with the performance requirements of real-time embedded systems due to their good performance-per-watt ratio and high performance capabilities. Unfortunately, their use in integrated architectures such as IMA or AUTOSAR is limited by the fact that multicores do not guarantee a time composable behavior for the applications: the WCET of a task depends on inter-task interferences introduced by other tasks running simultaneously. This article focuses on the off-chip memory system: the hardware shared resource with the highest impact on the WCET and hence the main impediment for the use of multicores in integrated architectures. We present an analytical model that computes the worst-case delay, also known as Upper Bound Delay (UBD), that a memory request can suffer due to memory interferences generated by other co-running tasks. By considering the UBD in the WCET analysis, the resulting WCET estimation is independent from the other tasks, hence ensuring the time composability property and enabling the use of multicores in integrated architectures. We propose a memory controller for hard real-time multicores compliant with the analytical model that implements extra hardware features to deal with refresh operations and interferences generated by co-running non hard real-time tasks.

Proceedings ArticleDOI
07 Dec 2013
TL;DR: This paper is the first to characterize the relationship between the power delivery network and the maximum supported activity in a 3D-stacked DRAM memory device and defines an IR-drop-aware scheduler that encodes a number of activity constraints.
Abstract: Many of the pins on a modern chip are used for power delivery. If fewer pins were used to supply the same current, the wires and pins used for power delivery would have to carry larger currents over longer distances. This results in an "IR-drop" problem, where some of the voltage is dropped across the long resistive wires making up the power delivery network, and the eventual circuits experience fluctuations in their supplied voltage. The same problem also manifests if the pin count is the same, but the current draw is higher. IR-drop can be especially problematic in 3D DRAM devices because (i) low cost (few pins and TSVs) is a high priority,(ii)3D-stacking increases current draw within the package without providing proportionate room for more pins, and (iii) TSVs add to the resistance of the power delivery network. This paper is the first to characterize the relationship between the power delivery network and the maximum supported activity in a 3D-stacked DRAM memory device. The design of the power delivery network determines if some banks can handle less activity than others. It also determines the combinations of bank activities that are permissible. Both of these attributes can feed into architectural policies. For example, if some banks can handle more activities than others, the architecture benefits by placing data from high-priority threads or data from frequently accessed pages into those banks. The memory controller can also derive higher performance if it schedules requests to specific combinations of banks that do not violate the IR-drop constraint. We first define an IR-drop-aware scheduler that encodes a number of activity constraints. This scheduler, however, falls short of the performance of an unrealistic ideal PDN that imposes no scheduling constraints by 4.6x. By addressing starvation phenomena in the scheduler, the gap is reduced to only 1.47×. Finally, by adding a dynamic page placement policy, performance is within 1.2× of the unreal­istic ideal PDN. We thus show that architectural polices can help mitigate the limitations imposed by a cost constrained design.

Proceedings Article
11 Jun 2013
TL;DR: Measurements of the 3D-IC show that the targeted 12.8 GByte/s bandwidth is achieved in worst case conditions, while offering a 0.9 pJ/bit 3D I/O link power efficiency.
Abstract: 3D Integrated Circuit (3D-IC) opens architecture opportunities for improved SoC-to-memory interconnect bandwidth between dies. This paper presents the design of a two-tier 3D-IC composed of one NoC-based MPSoC and one multi-channel WideIO mobile SDRAM stacked in a face-to-back configuration. Measurements of the 3D-IC show that the targeted 12.8 GByte/s bandwidth is achieved in worst case conditions, while offering a 0.9 pJ/bit 3D I/O link power efficiency.

Patent
Sanghoon Lee1, Sung-hwan Bae1, Jong-Nam Baek1, Hyun-seok Kim1, Sung-Bin Kim1 
18 Jan 2013
TL;DR: In this paper, the wear-out table for indexing each of the blocks of the flash memory and setting a start read level to start read retry on the selected block is presented.
Abstract: A read method in a flash memory system containing a flash memory and a memory controller includes updating a selected one of indexes of a selected one of blocks of the flash memory, in a wear-out table for indexing each of the blocks of the flash memory, and setting a start read level to start read retry on the selected block by referring to a read retry table corresponding to a wear-out degree included in the selected index when a current request of read retry on the selected block is received.

Patent
25 Feb 2013
TL;DR: In this article, a method for providing memory cell bias information for use in memory operations is presented, where one or more memory die are selected from a group of memory die, and one ormore memory blocks selected from the selected one or multiple memory die.
Abstract: Disclosed is an apparatus and method for providing memory cell bias information for use in memory operations. One or more memory die are selected from a group of memory die, and one or more memory blocks selected from the selected one or more memory die. A group of cells are programmed within the selected memory blocks, and one or more distributions of cell program levels associated with a group of wordlines are determined. A bias value for each wordline is then generated based on comparing one or more program levels in a distribution of program levels associated with the respective wordline with predetermined programming levels. The bias values are stored lookup table that is configured to be accessible at runtime by a memory controller for retrieval of the bias value during a program or read operation.

Proceedings ArticleDOI
07 Nov 2013
TL;DR: In order to improve endurance of MLC RAM, which is much smaller than single-level cell (SLC) ReRM due to the complex programming method, the Dynamic Data ReMapping (DDRM) is proposed to selectively regulate memory blocks from IDM state back to complete data mapping (CDM) state.
Abstract: Phase change memory (PCM) has been widely studied as a potential DRAM alternative. The multi-level cell (MLC) can further increase the memory density and reduce the fabrication cost by storing multiple bits in a single cell. Nevertheless, large write power, high write latency, as well as reliability issue resulted from the resistance drift, bring in challenges for MLC PCM based memory design. In contrast, the emerging Resistive Random Access Memory (ReRAM), which has similar MLC property as PCM, demonstrates better performance and energy efficiency compared to PCM. In addition, due to the physical switching behaviors of ReRAM cell, the resistance drift phenomenon does not exist. In this paper, we propose a low power MLC ReRAM design. We first study the programming method of MLC ReRAM and identify that programming latency and energy are highly dependent on the data pattern written to the cell. Based on this observation, we propose incomplete data mapping (IDM), which maps an eight-level-cell into six states to prevent the time/energy consuming data patterns from appearing in the cell. Furthermore, in order to improve endurance of MLC RAM, which is much smaller than single-level cell (SLC) ReRM due to the complex programming method, we propose Dynamic Data ReMapping (DDRM) to selectively regulate memory blocks from IDM state back to complete data mapping (CDM) state. We demonstrate that the proposed design can work effectively with existing error-correction schemes but requires much smaller space overhead. Experimental results show that, IDM can reduce the energy performance by at most 15% with negligible performance overhead. By combining the DDRM with existing error-correction scheme, DDRM can improve the memory lifetime by 2.75× compared with conventional memory architectures.

Proceedings ArticleDOI
18 Mar 2013
TL;DR: A real-time multi-channel memory controller architecture with a new programmable Multi-Channel Interleaver unit and a novel method for logical-to-physical address translation that enables inter-leaving memory requests across multiple memory channels at different granularities are contributed.
Abstract: Optimal utilization of a multi-channel memory, such as Wide IO DRAM, as shared memory in multi-processor platforms depends on the mapping of memory clients to the memory channels, the granularity at which the memory requests are interleaved in each channel, and the bandwidth and memory capacity allocated to each memory client in each channel. Firm real-time applications in such platforms impose strict requirements on shared memory bandwidth and latency, which must be guaranteed at design-time to reduce verification effort. However, there is currently no real-time memory controller for multichannel memories, and there is no methodology to optimally configure multi-channel memories in real-time systems. This paper has four key contributions: (1) A real-time multi-channel memory controller architecture with a new programmable Multi-Channel Interleaver unit. (2) A novel method for logical-to-physical address translation that enables inter-leaving memory requests across multiple memory channels at different granularities. (3) An optimal algorithm based on an Integer Linear Program (ILP) formulation to map memory clients to memory channels considering their communication dependencies, and to configure the memory controller for minimum bandwidth utilization. (4) We experimentally evaluate the run-time of the algorithm and show that an optimal solution can be found within 15 minutes for realistically sized problems. We also demonstrate configuring a multi-channel Wide IO DRAM in a High-Definition (HD) video and graphics processing system to emphasize the effectiveness of our approach.

Journal ArticleDOI
TL;DR: The proposed controller is evaluated by mapping previously proposed DRAM scheduling, address mapping, refresh scheduling, and power management algorithms onto PARDIS, a programmable memory controller that can meet the performance requirements of a high-speed DDRx interface.
Abstract: Modern memory controllers employ sophisticated address mapping, command scheduling, and power management optimizations to alleviate the adverse effects of DRAM timing and resource constraints on system performance. A promising way of improving the versatility and efficiency of these controllers is to make them programmable—a proven technique that has seen wide use in other control tasks, ranging from DMA scheduling to NAND Flash and directory control. Unfortunately, the stringent latency and throughput requirements of modern DDRx devices have rendered such programmability largely impractical, confining DDRx controllers to fixed-function hardware.This article presents the instruction set architecture (ISA) and hardware implementation of PARDIS, a programmable memory controller that can meet the performance requirements of a high-speed DDRx interface. The proposed controller is evaluated by mapping previously proposed DRAM scheduling, address mapping, refresh scheduling, and power management algorithms onto PARDIS. Simulation results show that the average performance of PARDIS comes within 8p of fixed-function hardware for each of these techniques; moreover, by enabling application-specific optimizations, PARDIS improves system performance by 6 to 17p and reduces DRAM energy by 9 to 22p over four existing memory controllers.

Journal ArticleDOI
TL;DR: This paper proposes a highly energy-efficient DRAM subsystem for next-generation 3-D-integrated SoCs, consisting of a SDR/DDR 3- D-DRAM controller and an attached 3-Ds-stacked DRAM cube with fine-grained access and a flexible (WIDE-IO) interface.
Abstract: Energy efficiency is the major optimization criterion for systems-on-chip (SoCs) for mobile devices (smartphones and tablets). Through silicon via (TSV) technology enables 3-D integration of dies and the heterogeneous stacking of multiple memory or logic layers, allowing increased bandwidth and lower energy consumption of the memory interface compared to traditional approaches. In this paper, we explore the 3-D-DRAM architecture design space. The result is an optimized 2 Gb 3-D-DRAM, which shows a 83% lower energy/bit than a 2 Gb device. Furthermore, we propose a highly energy-efficient DRAM subsystem for next-generation 3-D-integrated SoCs, consisting of a SDR/DDR 3-D-DRAM controller and an attached 3-D-DRAM cube with fine-grained access and a flexible (WIDE-IO) interface. We assess the energy efficiency using a synthesizable model of the SDR/DDR 3-D-DRAM channel controller (CC) as well as functional models of the 3-D-stacked DRAM, including an accurate power estimation engine. We also investigate different DRAM families (WIDE IO SDR/DDR, LPDDR, and LPDDR2) and densities from 256 Mb to 4 Gb per channel. The implementation results of the proposed 3-D-DRAM subsystem show that energy optimized accesses to the 3-D-DRAM enable up to 50% energy savings compared to standard accesses. To the best of our knowledge this is the first design space exploration for 3-D-stacked DRAM considering different technologies based on real-world physical data and the first design of a 3-D-DRAM CC and 3-D-DRAM model featuring co-optimization of memory and controller architecture.

Patent
15 Mar 2013
TL;DR: A memory controller of a data storage device, which communicates with a host, is configurable to have at least two different pinout assignments for interfacing with respective different types of memory devices.
Abstract: A memory controller of a data storage device, which communicates with a host, is configurable to have at least two different pinout assignments for interfacing with respective different types of memory devices. Each pinout assignment corresponds to a specific memory interface protocol. Each memory interface port of the memory controller includes port buffer circuitry configurable for different functional signal assignments, based on the selected memory interface protocol to be used. The interface circuitry configuration for each memory interface port is selectable by setting a predetermined port or registers of the memory controller.

Proceedings ArticleDOI
23 Jun 2013
TL;DR: A programmable guard that is integrated with the advanced memory buffer of FB-DIMM to continuously monitor all the memory traffic and detect the system integrity violations and MGuard offers a hardware drop-in solution transparent to the host CPU and memory controller.
Abstract: Increasingly, cyber attacks (e.g., kernel rootkits) target the inner rings of a computer system, and they have seriously undermined the integrity of the entire computer systems. To eliminate these threats, it is imperative to develop innovative solutions running below the attack surface. This paper presents MGuard, a new most inner ring solution for inspecting the system integrity that is directly integrated with the DRAM DIMM devices. More specifically, we design a programmable guard that is integrated with the advanced memory buffer of FB-DIMM to continuously monitor all the memory traffic and detect the system integrity violations. Unlike the existing approaches that are either snapshot-based or lack compatibility and flexibility, MGuard continuously monitors the integrity of all the outer rings including both OS kernel and hypervisor of interest, with a greater extendibility enabled by a programmable interface. It offers a hardware drop-in solution transparent to the host CPU and memory controller. Moreover, MGuard is isolated from the host software and hardware, leading to strong security for remote attackers. Our simulation-based experimental results show that MGuard introduces no speed overhead, and is able to detect nearly all the OS-kernel and hypervisor control data related rootkits we tested.

Journal ArticleDOI
TL;DR: This paper presents FlashPower, a detailed power model for the two most popular variants of NAND flash, namely, the single-level cell (SLC) and 2-bit Multi-Level Cell (MLC) based flash memory chips.
Abstract: Flash is the most popular solid-state memory technology used today. A range of consumer electronics products, such as cell-phones and music players, use flash memory for storage and flash memory is increasingly displacing hard disk drives as the primary storage device in laptops, desktops, and servers. There is a rich microarchitectural design space for flash memory, and there are several architectural options for incorporating flash into the memory hierarchy. Exploring this design space requires detailed insights into the power characteristics of flash memory. In this paper, we present FlashPower, a detailed power model for the two most popular variants of NAND flash, namely, the single-level cell (SLC) and 2-bit Multi-Level Cell (MLC) based flash memory chips. FlashPower is built on top of CACTI, a widely used tool in the architecture community for studying various memory organizations. FlashPower takes several parameters like the device technology, microarchitectural layout, bias voltages and workload parameters as input to estimate the power consumption of a flash chip during its various operating modes. We validate FlashPower against chip power measurements from several different manufacturers and show that our results are comparable to the actual chip measurements. We illustrate the versatility of the tool in a design space exploration of power optimal flash memory array configurations.

Journal ArticleDOI
28 Mar 2013
TL;DR: An asymmetric 6.4-Gb/s memory interface for a wide range of DIMM configurations for desktop and server applications using a fly-by quadrature forwarded clock to enable fast startup and power-mode transitions on the DRAM and per-bit timing adjustment on the controller to enable the high-speed signaling.
Abstract: The emergence of cloud computing has driven the demand for high-density, low-latency and high-speed memory interfaces. For such applications the use of multiple dual-inline memory modules (DIMMs) with multiple ranks enables time-efficient processing of high-volume data. However, the deterioration of the channel frequency response due to the presence of DIMM connectors and multiple ranks makes it challenging to perform low-power read and write (R/W) operations at high-speed. Recent works have demonstrated the use of near-ground signaling (NGS) for low-power operation and signal-integrity enhancement with the aid of transmit supply regulation. In contrast to their differential nature, this paper introduces a single-ended NGS transceiver that achieves 6.4Gb/s R/W operations with the aid of low-power equalization and in-situ reference-voltage calibration over a 3.5" total FR4 PCB routing with more than 25mm of package traces in a dual-rank DIMM memory interface system.