Author
Abraham Mathews
Bio: Abraham Mathews is an academic researcher from IBM. The author has contributed to research in topics: eDRAM & Dram. The author has an hindex of 8, co-authored 18 publications receiving 266 citations.
Topics: eDRAM, Dram, Logic gate, Clock signal, Synchronous circuit
Papers
More filters
••
IBM1
TL;DR: A 1.35 ns random access and 1.7 ns-random-cycle SOI embedded-DRAM macro has been developed for the POWER7™ high-performance microprocessor, allowing the embedded DRAM to operate reliably without constraining of the microprocessor voltage supply windows.
Abstract: A 1.35 ns random access and 1.7 ns-random-cycle SOI embedded-DRAM macro has been developed for the POWER7™ high-performance microprocessor. The macro employs a 6 transistor micro sense-amplifier architecture with extended precharge scheme to enhance the sensing margin for product quality. The detailed study shows a 67% bit-line power reduction with only 1.7% area overhead, while improving a read zero margin by more than 500ps. The array voltage window is improved by the programmable BL voltage generator, allowing the embedded DRAM to operate reliably without constraining of the microprocessor voltage supply windows. The 2.5nm gate oxide transistor cell with deep-trench capacitor is accessed by the 1.7 V wordline high voltage (VPP) with V WL low voltage (VWL), and both are generated internally within the microprocessor. This results in a 32 MB on-chip L3 on-chip-cache for 8 cores in a 567 mm POWER7™ die.
63 citations
•
IBM1
TL;DR: In this paper, a distributed charge pump system uses a delay element and frequency dividers to generate out of phase pump clock signals that drive different charge pumps, to offset peak current clock edges for each charge pump and thereby reduce overall peak power.
Abstract: A distributed charge pump system uses a delay element and frequency dividers to generate out of phase pump clock signals that drive different charge pumps, to offset peak current clock edges for each charge pump and thereby reduce overall peak power. Clock signal division and phase offset may be extended to multiple levels for further smoothing of the pump clock signal transitions. A dual frequency divider may be used which receives the clock signal and its complement, and generates two divided signals that are 90° out of phase. In an illustrative embodiment the clock generator comprises a variable-frequency clock source, and a voltage regulator senses an output voltage of the charge pumps, generates a reference voltage based on a currently selected frequency of the variable-frequency clock source, and temporarily disables the charge pumps (by turning off local pump clocks) when the output voltage is greater than the reference voltage.
40 citations
••
IBM1
TL;DR: This high performance DRAM macro is used to construct a large 32MB L3 cache on-chip, eliminating delay, area and power from the off-chip interface, simultaneously improving system performance, reducing cost, power and soft error vulnerability.
Abstract: Logic-based embedded DRAM has matured into a wide range of ASIC applications, SRAM replacements [1] and off-chip caches for microprocessors [2]. While embedded DRAM has been leveraged in supercomputers such as IBM's BlueGene/L [3], it's use has been limited to moderate performance bulk logic technologies. Although prototypes have been demonstrated [4], DRAM has yet to be embedded on a high performance microprocessor. This paper discloses an SOI DRAM macro implemented on-chip with the IBM POWER7™ high performance microprocessor [5], and introduces enhancements to the micro sense amp (µSA) architecture [6]. This high performance DRAM macro is used to construct a large 32MB L3 cache on-chip, eliminating delay, area and power from the off-chip interface, simultaneously improving system performance, reducing cost, power and soft error vulnerability. Figure 19.1.1a shows an SEM of the 45nm SOI DRAM Device and Deep Trench (DT) capacitor [7]. DT offers 25x more capacitance than planar structures and was also utilized to reduce on-chip voltage island supply noise.
39 citations
•
IBM1
TL;DR: In this article, a two-phase charging circuit, cross-coupled transistors connected to output nodes of the switched capacitors, and a pump output connected to source terminals of the transistors are described.
Abstract: A switched-capacitor charge pump comprises a two-phase charging circuit, cross-coupled transistors connected to output nodes of the switched capacitors, and a pump output connected to source terminals of the cross-coupled transistors. The charge pump has side transistors for boosting charge transfer, and gating logic of the side transistors includes level shifters which control connections to the pump output or a reference voltage. Negative and positive charge pump embodiments are provided. The charging circuit advantageously utilizes non-overlapping wide and narrow clock signals to generate multiple gating signals. The pump clock circuit preferably provides independent, programmable adjustment of the widths of the wide and narrow clock signals. An override mode can be provided using clamping circuits which shunt the pump output to the second nodes of the switched capacitors.
33 citations
••
TL;DR: A single voltage supply, 1 MB cache subsystem prototype that integrates 2 GHz embedded DRAM (eDRAM) macros with on- chip word-line voltage supply generation, a 4 Kb one-time-programmable read-only memory (OTPROM) for redundancy and repair control, and on-chip OTPROM programming voltage generation, clock generation and distribution are described.
Abstract: We describe a single voltage supply, 1 MB cache subsystem prototype that integrates 2 GHz embedded DRAM (eDRAM) macros with on-chip word-line voltage supply generation , a 4 Kb one-time-programmable read-only memory (OTPROM) for redundancy and repair control, on-chip OTPROM programming voltage generation, clock generation and distribution, array built-in self-test circuitry (ABIST), user logic and pervasive logic. The eDRAM employs a programmable pipeline, achieving 1.8 ns latency, and features concurrent refresh capability.
25 citations
Cited by
More filters
••
TL;DR: The studies based on the proposed scaling methodology show that in-plane STT-MRAM will outperform SRAM from 15 nm node, while its perpendicular counterpart requires further innovations in MTJ material in order to overcome the poor write performance scaling from 22 nm node onwards.
Abstract: This paper explores the scalability of in-plane and perpendicular MTJ based STT-MRAMs from 65 nm to 8 nm while taking into consideration realistic variability effects. We focus on the read and write performances of a STT-MRAM based cache rather than the obvious advantages such as the denser bit-cell and zero static power. An accurate MTJ macromodel capturing key MTJ properties was adopted for efficient Monte Carlo simulations. For the simulation of access devices and peripheral circuitries, ITRS projected transistor parameters were utilized and calibrated using the MASTAR tool that has been widely used in industry. 6T SRAM and STT-MRAM arrays were implemented with aggressive assist schemes to mimic industrial memory designs. A constant JC0·RA/VDD scaling scenario was used which to the first order gives the optimal balance between read and write margins of STT-MRAMs. The thermal stability factor ensuring a 10 year retention time was obtained by adjusting the free layer thickness as well as assuming improvement in the crystalline anisotropy. Our studies based on the proposed scaling methodology show that in-plane STT-MRAM will outperform SRAM from 15 nm node, while its perpendicular counterpart requires further innovations in MTJ material in order to overcome the poor write performance scaling from 22 nm node onwards.
322 citations
••
14 Oct 2017TL;DR: DRISA, a DRAM-based Reconfigurable In-Situ Accelerator architecture, is proposed to provide both powerful computing capability and large memory capacity/bandwidth to address the memory wall problem in traditional von Neumann architecture.
Abstract: Data movement between the processing units and the memory in traditional von Neumann architecture is creating the “memory wall” problem. To bridge the gap, two approaches, the memory-rich processor (more on-chip memory) and the compute-capable memory (processing-in-memory) have been studied. However, the first one has strong computing capability but limited memory capacity/bandwidth, whereas the second one is the exact the opposite.To address the challenge, we propose DRISA, a DRAM-based Reconfigurable In-Situ Accelerator architecture, to provide both powerful computing capability and large memory capacity/bandwidth. DRISA is primarily composed of DRAM memory arrays, in which every memory bitline can perform bitwise Boolean logic operations (such as NOR). DRISA can be reconfigured to compute various functions with the combination of the functionally complete Boolean logic operations and the proposed hierarchical internal data movement designs. We further optimize DRISA to achieve high performance by simultaneously activating multiple rows and subarrays to provide massive parallelism, unblocking the internal data movement bottlenecks, and optimizing activation latency and energy. We explore four design options and present a comprehensive case study to demonstrate significant acceleration of convolutional neural networks. The experimental results show that DRISA can achieve 8.8× speedup and 1.2× better energy efficiency compared with ASICs, and 7.7× speedup and 15× better energy efficiency over GPUs with integer operations.CCS CONCEPTS• Hardware → Dynamic memory; • Computer systems organization → reconfigurable computing; Neural networks;
315 citations
••
09 Mar 2015
TL;DR: DESTINY is presented, a microarchitecture-level tool for modeling 3D (and 2D) cache designs using SRAM, embedded DRAM (eDRAM), spin transfer torque RAM (STT-RAM), resistive RAM (ReRAM) and phase change RAM (PCM), and has been validated against industrial cache prototypes.
Abstract: The continuous drive for performance has pushed the researchers to explore novel memory technologies (e.g. non-volatile memory) and novel fabrication approaches (e.g. 3D stacking) in the design of caches. However, a comprehensive tool which models both conventional and emerging memory technologies for both 2D and 3D designs has been lacking. We present DESTINY, a microarchitecture-level tool for modeling 3D (and 2D) cache designs using SRAM, embedded DRAM (eDRAM), spin transfer torque RAM (STT-RAM), resistive RAM (ReRAM) and phase change RAM (PCM). DESTINY facilitates design-space exploration across several dimensions, such as optimizing for a target (e.g. latency or area) for a given memory technology, choosing the suitable memory technology or fabrication method (i.e. 2D v/s 3D) for a desired optimization target etc. DESTINY has been validated against industrial cache prototypes. We believe that DESTINY will drive architecture and system-level studies and will be useful for researchers and designers.
142 citations
••
01 Jan 2015TL;DR: An overview of the history and the current status of the various spintronic devices being pursued by the research community is provided, and how spin-based components are integrated into a computing system and the advantages that result are described.
Abstract: As the end draws near for Moore's law, the search for low-power alternatives to complementary metal–oxide–semiconductor (CMOS) technology is intensifying. Among the various post-CMOS candidates, spintronic devices have gained special attention for their potential to overcome the power and performance limitations of CMOS. In particular, all spin logic (ASL) technology, which performs Boolean operations and transfers the output in the spin domain, has been proposed for enabling new capabilities—such as high density, low device count, and nonvolatility—that were previously impossible with CMOS technology. In this paper, first we provide an overview of the history and the current status of the various spintronic devices being pursued by the research community. Then, we describe how spin-based components are integrated into a computing system and the advantages that result. We use a hypothetical spintronic-based Intel Core i7 as a test vehicle to compare the system-level power requirements of ASL- and CMOS-based systems, taking into consideration the unique demands of spin-based interconnects. We conclude with a brief analysis of current limitations and future directions of spintronic research.
135 citations
•
12 Mar 2008TL;DR: In this paper, the read circuit senses a change in a voltage of the bitline of a bitline, and applies a voltage which is different from the first voltage to the gate of the first transistor when it senses a voltage change.
Abstract: A semiconductor memory device comprises memory cells, a bitline connected to the memory cells, a read circuit including a precharge circuit, and a first transistor connected between the bitline and the read circuit, wherein a first voltage is applied to a gate of the first transistor when the precharge circuit precharges the bitline, and a second voltage which is different from the first voltage is applied to the gate of the first transistor when the read circuit senses a change in a voltage of the bitline.
119 citations