scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A 14 nm 1.1 Mb Embedded DRAM Macro With 1 ns Access

TL;DR: A 1.1 Mb embedded DRAM macro (eDRAM), for next-generation IBM SOI processors, employs 14 nm FinFET logic technology with 0.0174 μm2 deep-trench capacitor cell that enables a high voltage gain of a power-gated inverter at mid-level input voltage.
Abstract: A 1.1 Mb embedded DRAM macro (eDRAM), for next-generation IBM SOI processors, employs 14 nm FinFET logic technology with $\hbox{0.0174}~\mu\hbox{m}^{2}$ deep-trench capacitor cell. A Gated-feedback sense amplifier enables a high voltage gain of a power-gated inverter at mid-level input voltage, while supporting 66 cells per local bit-line. A dynamic-and-gate-thin-oxide word-line driver that tracks standard logic process variation improves the eDRAM array performance with reduced area. The 1.1 $~$ Mb macro composed of 8 $\times$ 2 72 Kb subarrays is organized with a center interface block architecture, allowing 1 ns access latency and 1 ns bank interleaving operation using two banks, each having 2 ns random access cycle. 5 GHz operation has been demonstrated in a system prototype, which includes 6 instances of 1.1 Mb eDRAM macros, integrated with an array-built-in-self-test engine, phase-locked loop (PLL), and word-line high and word-line low voltage generators. The advantage of the 14 nm FinFET array over the 22 nm array was confirmed using direct tester control of the 1.1 Mb eDRAM macros integrated in 16 Mb inline monitor.
Citations
More filters
Proceedings ArticleDOI
14 Oct 2017
TL;DR: DRISA, a DRAM-based Reconfigurable In-Situ Accelerator architecture, is proposed to provide both powerful computing capability and large memory capacity/bandwidth to address the memory wall problem in traditional von Neumann architecture.
Abstract: Data movement between the processing units and the memory in traditional von Neumann architecture is creating the “memory wall” problem. To bridge the gap, two approaches, the memory-rich processor (more on-chip memory) and the compute-capable memory (processing-in-memory) have been studied. However, the first one has strong computing capability but limited memory capacity/bandwidth, whereas the second one is the exact the opposite.To address the challenge, we propose DRISA, a DRAM-based Reconfigurable In-Situ Accelerator architecture, to provide both powerful computing capability and large memory capacity/bandwidth. DRISA is primarily composed of DRAM memory arrays, in which every memory bitline can perform bitwise Boolean logic operations (such as NOR). DRISA can be reconfigured to compute various functions with the combination of the functionally complete Boolean logic operations and the proposed hierarchical internal data movement designs. We further optimize DRISA to achieve high performance by simultaneously activating multiple rows and subarrays to provide massive parallelism, unblocking the internal data movement bottlenecks, and optimizing activation latency and energy. We explore four design options and present a comprehensive case study to demonstrate significant acceleration of convolutional neural networks. The experimental results show that DRISA can achieve 8.8× speedup and 1.2× better energy efficiency compared with ASICs, and 7.7× speedup and 15× better energy efficiency over GPUs with integer operations.CCS CONCEPTS• Hardware → Dynamic memory; • Computer systems organization → reconfigurable computing; Neural networks;

315 citations

Proceedings ArticleDOI
02 Jun 2018
TL;DR: This paper presents the first proposal to enable scientific computing on memristive crossbars, and three techniques are explored — reducing overheads by exploiting exponent range locality, early termination of fixed-point computation, and static operation scheduling — that together enable a fixed- Point Memristive accelerator to perform high-precision floating point without the exorbitant cost of naïve floating-point emulation on fixed-pointers.
Abstract: Linear algebra is ubiquitous across virtually every field of science and engineering, from climate modeling to macroeconomics. This ubiquity makes linear algebra a prime candidate for hardware acceleration, which can improve both the run time and the energy efficiency of a wide range of scientific applications. Recent work on memristive hardware accelerators shows significant potential to speed up matrix-vector multiplication (MVM), a critical linear algebra kernel at the heart of neural network inference tasks. Regrettably, the proposed hardware is constrained to a narrow range of workloads: although the eight- to 16-bit computations afforded by memristive MVM accelerators are acceptable for machine learning, they are insufficient for scientific computing where high-precision floating point is the norm. This paper presents the first proposal to enable scientific computing on memristive crossbars. Three techniques are explored---reducing overheads by exploiting exponent range locality, early termination of fixed-point computation, and static operation scheduling---that together enable a fixed-point memristive accelerator to perform high-precision floating point without the exorbitant cost of naive floating-point emulation on fixed-point hardware. A heterogeneous collection of crossbars with varying sizes is proposed to efficiently handle sparse matrices, and an algorithm for mapping the dense subblocks of a sparse matrix to an appropriate set of crossbars is investigated. The accelerator can be combined with existing GPU-based systems to handle datasets that cannot be efficiently handled by the memristive accelerator alone. The proposed optimizations permit the memristive MVM concept to be applied to a wide range of problem domains, respectively improving the execution time and energy dissipation of sparse linear solvers by 10.3x and 10.9x over a purely GPU-based system.

54 citations


Cites methods from "A 14 nm 1.1 Mb Embedded DRAM Macro ..."

  • ...SRAM buffers within each cluster and the eDRAM memory are modeled using CACTI7 [49] using 14nm eDRAM parameters from [50]....

    [...]

Journal ArticleDOI
TL;DR: In this paper, the material and device physics, fabrication, operational principles, and commercial status of scaled 2D flash, 3D flash and emerging memory technologies are discussed, including the physics of and errors caused by total ionizing dose, displacement damage, and single event effects.
Abstract: Despite hitting major roadblocks in 2-D scaling, NAND flash continues to scale in the vertical direction and dominate the commercial nonvolatile memory market. However, several emerging nonvolatile technologies are under development by major commercial foundries or are already in small volume production, motivated by storage-class memory and embedded application drivers. These include spin-transfer torque magnetic random access memory (STT-MRAM), resistive random access memory (ReRAM), phase change random access memory (PCRAM), and conductive bridge random access memory (CBRAM). Emerging memories have improved resilience to radiation effects compared to flash, which is based on storing charge, and hence may offer an expanded selection from which radiation-tolerant system designers can choose from in the future. This review discusses the material and device physics, fabrication, operational principles, and commercial status of scaled 2-D flash, 3-D flash, and emerging memory technologies. Radiation effects relevant to each of these memories are described, including the physics of and errors caused by total ionizing dose, displacement damage, and single-event effects, with an eye toward the future role of emerging technologies in radiation environments.

27 citations

Journal ArticleDOI
TL;DR: The design and implementation of an 80-kb logic-embedded non-volatile multi-time programmable memory (MTPM) with no added process complexity is described and high-temperature stress results show a projected data retention of 10 years at 125 °C.
Abstract: This paper describes the design and implementation of an 80-kb logic-embedded non-volatile multi-time programmable memory (MTPM) with no added process complexity. Charge trap transistors (CTTs) that exploit charge trapping and de-trapping behavior in high-K dielectric of 32-/22-nm Logic FETs are used as storage elements with logic-compatible programming voltages. A high-gain slew-sense amplifier (SA) is used to efficiently detect the threshold voltage difference ( $\Delta V_{\textrm {DIF}}$ ) between the true and complement FETs in the twin cell. Design-assist techniques including multi-step programming with over-write protection and block write algorithm are used to enhance the programming efficiency without causing a dielectric breakdown. High-temperature stress results show a projected data retention of 10 years at 125 °C with a signal loss of <30% that is margined in while programming, by employing a sense margining logic in the SA. Scalability of CTT has been established by the first demonstration of CTT-based MTPM in 14-nm bulk FinFET technology with read cycle time of 40 ns at 0.7-V VDD.

15 citations


Cites background from "A 14 nm 1.1 Mb Embedded DRAM Macro ..."

  • ...Embedding dynamic-random access memory [1], [2], built with...

    [...]

Proceedings ArticleDOI
29 Jul 2019
TL;DR: This work introduces a dot-product processing macro using eDRAM array and explores its capability as an in-memory computing processing element and investigated a method to maximize the retention time in conjunction with analyzing the device mismatch.
Abstract: Modern deep neural network (DNN) systems evolved under the ever-increasing demands of handling more complex and computation-heavy tasks. Traditional hardware designed for such tasks had larger size memory and power consumption issue due to extensive on/off-chip memory access. In-memory computing, one of the promising solutions to resolve the issue, dramatically reduced memory access and improved energy efficiency by utilizing the memory cell to function as both a data storage and a computing element. Embedded DRAM (eDRAM) is one of the potential candidates for in-memory computation. Its minimal use of circuit components and low static power consumption provided design advantage while its relatively short retention time made eDRAM unsuitable for certain applications. This work introduces a dot-product processing macro using eDRAM array and explores its capability as an in-memory computing processing element. The proposed architecture implemented a pair of 2T eDRAM cells as a processing unit that can store and operate with ternary weights using only four transistors. Besides, we investigated a method to maximize the retention time in conjunction with analyzing the device mismatch. An input/weight bit-precision reconfigurable 4T eDRAM processing array shows the energy efficiency of 1.81fJ/OP (including refresh energy) when it operates with binary inputs and ternary weights.

14 citations


Additional excerpts

  • ...With these advantages, high-performance server processors [9,10] and deep neural network (DNN) hardware accelerators [11,12] have adopted eDRAM as their on-chip memory....

    [...]

References
More filters
Proceedings ArticleDOI
07 Apr 2011
TL;DR: The microprocessor chip for the IBM zEnterprise 196 (z 196) system is a high-frequency, high-performance design that adds support for out-of-order instruction execution and increases operating frequency by almost 20% compared to the previous 65nm design, while still fitting within the same power envelope.
Abstract: The microprocessor chip for the IBM zEnterprise 196 (z 196) system is a high-frequency, high-performance design that adds support for out-of-order instruction execution and increases operating frequency by almost 20% compared to the previous 65nm design, while still fitting within the same power envelope. Despite the many difficult engineering hurdles to be overcome, the design team was able to achieve a product frequency of 5.2GHz, providing a significant performance boost for the new system.

35 citations

Proceedings ArticleDOI
01 Dec 2010
TL;DR: The HKMG access transistor developed in high performance optimized technology features sub 3fA leakage and well-controlled threshold voltage sigma of 40mV and set the industry benchmark for the most efficient decoupling in any 32nm technology.
Abstract: We present industry's smallest eDRAM cell and the densest embedded memory integrated into the highest performance 32nm High-K Metal Gate (HKMG) SOI based logic technology. The cell is aggressively scaled at 58% (vs. 45nm) and features the key innovation of High-K Metal (HK/M) stack in the Deep Trench (DT) capacitor. This has enabled 25% higher capacitance and 70% lower resistance compared to conventional SiON/Poly stack at matched leakage and reliability. The HKMG access transistor developed in high performance optimized technology features sub 3fA leakage and well-controlled threshold voltage sigma of 40mV. The fully integrated 32Mb product prototypes demonstrate state of the art performance with excellent retention and yield characteristics. The sub 1.5ns latency and 2ns cycle time have been verified with preliminary testing whereas even better performance is expected with further characterization. In addition, the trench capacitors set the industry benchmark for the most efficient decoupling in any 32nm technology.

27 citations

Journal ArticleDOI
TL;DR: The paper describes the new methodology and the models, provides analysis of the sources of variability and their impact on power and frequency, and describes the work done to achieve correlation between the models and hardware measurements.
Abstract: The IBM POWER7+™ microprocessor is the next-generation IBM POWER® processor implemented in IBM's 32-nm silicon-on-insulator process. In addition to enhancing the chip functionality, implementing core-level and chiplet-level power gating and significantly increasing the size of the on-chip cache, the chip achieves a frequency boost of 15% to 25% compared with its predecessor at the same power. To achieve these challenging goals and deliver a serviceable power-frequency limited yield (PFLY), the IBM team made significant innovations in the post-silicon hardware-tuning methodology to counteract the inherent process variability and developed new PFLY models that account for several sources of variability in power and frequency. The paper describes the new methodology and the models, provides analysis of the sources of variability and their impact on power and frequency, and describes the work done to achieve correlation between the models and hardware measurements.

16 citations

Proceedings ArticleDOI
19 Mar 2015
TL;DR: Carrizo (CZ, Fig. 4.8.7) is AMD's next-generation mobile performance accelerated processing unit (APU), which includes four Excavator (XV) processor cores and eight Radeon™ graphics core next (GCN) cores, implemented in a 28nm HKMG planar dual-oxide FET technology featuring 3 Vts of thin-oxide devices and 12 layers of Cu-based metallization.
Abstract: Carrizo (CZ, Fig. 4.8.7) is AMD's next-generation mobile performance accelerated processing unit (APU), which includes four Excavator (XV) processor cores and eight Radeon™ graphics core next (GCN) cores, implemented in a 28nm HKMG planar dual-oxide FET technology featuring 3 V t s of thin-oxide devices and 12 layers of Cu-based metallization. This 28nm technology is a density-focused version of the 28nm technology used by Steamroller (SR) [1] featuring eight 1× metals for dense routing, one 2× and one 4× for low-RC routing and two 16x metals for power distribution.

15 citations

Book ChapterDOI
01 Jan 2010
TL;DR: The chapter starts with a discussion of the evolution of high-performance embedded DRAMs for the previous 15 years, and looks into the principles of the embeddedDRAMs, which include technology, macro and array architectures, mode of operations, wordline and bitline architectures, and sensing schemes.
Abstract: Described are the high performance embedded DRAMs in nano-scale technology. The chapter starts with a discussion of the evolution of high-performance embedded DRAMs for the previous 15 years. It will then look into the principles of the embedded DRAMs, which include technology, macro and array architectures, mode of operations, wordline and bitline architectures, and sensing schemes. The discussion will also address ideas unique to the high-performance embedded DRAM such as Direct Write, Negative Wordline, Concurrent Refresh, Dataline Redundancy, BIST and Self-Repair methodology. After covering these key technical attributes, the chapter will detail the IBM embedded DRAM macros starting from ASIC to the most recent SOI embedded DRAM and the cache prototype designs for microprocessors. To conclude the chapter research, and development for future embedded DRAM with floating body cell, gain cell, and 3-dimensional embedded DRAM approach will be explored.

10 citations

Related Papers (5)