scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A 14 nm 1.1 Mb Embedded DRAM Macro With 1 ns Access

TL;DR: A 1.1 Mb embedded DRAM macro (eDRAM), for next-generation IBM SOI processors, employs 14 nm FinFET logic technology with 0.0174 μm2 deep-trench capacitor cell that enables a high voltage gain of a power-gated inverter at mid-level input voltage.
Abstract: A 1.1 Mb embedded DRAM macro (eDRAM), for next-generation IBM SOI processors, employs 14 nm FinFET logic technology with $\hbox{0.0174}~\mu\hbox{m}^{2}$ deep-trench capacitor cell. A Gated-feedback sense amplifier enables a high voltage gain of a power-gated inverter at mid-level input voltage, while supporting 66 cells per local bit-line. A dynamic-and-gate-thin-oxide word-line driver that tracks standard logic process variation improves the eDRAM array performance with reduced area. The 1.1 $~$ Mb macro composed of 8 $\times$ 2 72 Kb subarrays is organized with a center interface block architecture, allowing 1 ns access latency and 1 ns bank interleaving operation using two banks, each having 2 ns random access cycle. 5 GHz operation has been demonstrated in a system prototype, which includes 6 instances of 1.1 Mb eDRAM macros, integrated with an array-built-in-self-test engine, phase-locked loop (PLL), and word-line high and word-line low voltage generators. The advantage of the 14 nm FinFET array over the 22 nm array was confirmed using direct tester control of the 1.1 Mb eDRAM macros integrated in 16 Mb inline monitor.
Citations
More filters
Proceedings ArticleDOI
14 Oct 2017
TL;DR: DRISA, a DRAM-based Reconfigurable In-Situ Accelerator architecture, is proposed to provide both powerful computing capability and large memory capacity/bandwidth to address the memory wall problem in traditional von Neumann architecture.
Abstract: Data movement between the processing units and the memory in traditional von Neumann architecture is creating the “memory wall” problem. To bridge the gap, two approaches, the memory-rich processor (more on-chip memory) and the compute-capable memory (processing-in-memory) have been studied. However, the first one has strong computing capability but limited memory capacity/bandwidth, whereas the second one is the exact the opposite.To address the challenge, we propose DRISA, a DRAM-based Reconfigurable In-Situ Accelerator architecture, to provide both powerful computing capability and large memory capacity/bandwidth. DRISA is primarily composed of DRAM memory arrays, in which every memory bitline can perform bitwise Boolean logic operations (such as NOR). DRISA can be reconfigured to compute various functions with the combination of the functionally complete Boolean logic operations and the proposed hierarchical internal data movement designs. We further optimize DRISA to achieve high performance by simultaneously activating multiple rows and subarrays to provide massive parallelism, unblocking the internal data movement bottlenecks, and optimizing activation latency and energy. We explore four design options and present a comprehensive case study to demonstrate significant acceleration of convolutional neural networks. The experimental results show that DRISA can achieve 8.8× speedup and 1.2× better energy efficiency compared with ASICs, and 7.7× speedup and 15× better energy efficiency over GPUs with integer operations.CCS CONCEPTS• Hardware → Dynamic memory; • Computer systems organization → reconfigurable computing; Neural networks;

315 citations

Proceedings ArticleDOI
02 Jun 2018
TL;DR: This paper presents the first proposal to enable scientific computing on memristive crossbars, and three techniques are explored — reducing overheads by exploiting exponent range locality, early termination of fixed-point computation, and static operation scheduling — that together enable a fixed- Point Memristive accelerator to perform high-precision floating point without the exorbitant cost of naïve floating-point emulation on fixed-pointers.
Abstract: Linear algebra is ubiquitous across virtually every field of science and engineering, from climate modeling to macroeconomics. This ubiquity makes linear algebra a prime candidate for hardware acceleration, which can improve both the run time and the energy efficiency of a wide range of scientific applications. Recent work on memristive hardware accelerators shows significant potential to speed up matrix-vector multiplication (MVM), a critical linear algebra kernel at the heart of neural network inference tasks. Regrettably, the proposed hardware is constrained to a narrow range of workloads: although the eight- to 16-bit computations afforded by memristive MVM accelerators are acceptable for machine learning, they are insufficient for scientific computing where high-precision floating point is the norm. This paper presents the first proposal to enable scientific computing on memristive crossbars. Three techniques are explored---reducing overheads by exploiting exponent range locality, early termination of fixed-point computation, and static operation scheduling---that together enable a fixed-point memristive accelerator to perform high-precision floating point without the exorbitant cost of naive floating-point emulation on fixed-point hardware. A heterogeneous collection of crossbars with varying sizes is proposed to efficiently handle sparse matrices, and an algorithm for mapping the dense subblocks of a sparse matrix to an appropriate set of crossbars is investigated. The accelerator can be combined with existing GPU-based systems to handle datasets that cannot be efficiently handled by the memristive accelerator alone. The proposed optimizations permit the memristive MVM concept to be applied to a wide range of problem domains, respectively improving the execution time and energy dissipation of sparse linear solvers by 10.3x and 10.9x over a purely GPU-based system.

54 citations


Cites methods from "A 14 nm 1.1 Mb Embedded DRAM Macro ..."

  • ...SRAM buffers within each cluster and the eDRAM memory are modeled using CACTI7 [49] using 14nm eDRAM parameters from [50]....

    [...]

Journal ArticleDOI
TL;DR: In this paper, the material and device physics, fabrication, operational principles, and commercial status of scaled 2D flash, 3D flash and emerging memory technologies are discussed, including the physics of and errors caused by total ionizing dose, displacement damage, and single event effects.
Abstract: Despite hitting major roadblocks in 2-D scaling, NAND flash continues to scale in the vertical direction and dominate the commercial nonvolatile memory market. However, several emerging nonvolatile technologies are under development by major commercial foundries or are already in small volume production, motivated by storage-class memory and embedded application drivers. These include spin-transfer torque magnetic random access memory (STT-MRAM), resistive random access memory (ReRAM), phase change random access memory (PCRAM), and conductive bridge random access memory (CBRAM). Emerging memories have improved resilience to radiation effects compared to flash, which is based on storing charge, and hence may offer an expanded selection from which radiation-tolerant system designers can choose from in the future. This review discusses the material and device physics, fabrication, operational principles, and commercial status of scaled 2-D flash, 3-D flash, and emerging memory technologies. Radiation effects relevant to each of these memories are described, including the physics of and errors caused by total ionizing dose, displacement damage, and single-event effects, with an eye toward the future role of emerging technologies in radiation environments.

27 citations

Journal ArticleDOI
TL;DR: The design and implementation of an 80-kb logic-embedded non-volatile multi-time programmable memory (MTPM) with no added process complexity is described and high-temperature stress results show a projected data retention of 10 years at 125 °C.
Abstract: This paper describes the design and implementation of an 80-kb logic-embedded non-volatile multi-time programmable memory (MTPM) with no added process complexity. Charge trap transistors (CTTs) that exploit charge trapping and de-trapping behavior in high-K dielectric of 32-/22-nm Logic FETs are used as storage elements with logic-compatible programming voltages. A high-gain slew-sense amplifier (SA) is used to efficiently detect the threshold voltage difference ( $\Delta V_{\textrm {DIF}}$ ) between the true and complement FETs in the twin cell. Design-assist techniques including multi-step programming with over-write protection and block write algorithm are used to enhance the programming efficiency without causing a dielectric breakdown. High-temperature stress results show a projected data retention of 10 years at 125 °C with a signal loss of <30% that is margined in while programming, by employing a sense margining logic in the SA. Scalability of CTT has been established by the first demonstration of CTT-based MTPM in 14-nm bulk FinFET technology with read cycle time of 40 ns at 0.7-V VDD.

15 citations


Cites background from "A 14 nm 1.1 Mb Embedded DRAM Macro ..."

  • ...Embedding dynamic-random access memory [1], [2], built with...

    [...]

Proceedings ArticleDOI
29 Jul 2019
TL;DR: This work introduces a dot-product processing macro using eDRAM array and explores its capability as an in-memory computing processing element and investigated a method to maximize the retention time in conjunction with analyzing the device mismatch.
Abstract: Modern deep neural network (DNN) systems evolved under the ever-increasing demands of handling more complex and computation-heavy tasks. Traditional hardware designed for such tasks had larger size memory and power consumption issue due to extensive on/off-chip memory access. In-memory computing, one of the promising solutions to resolve the issue, dramatically reduced memory access and improved energy efficiency by utilizing the memory cell to function as both a data storage and a computing element. Embedded DRAM (eDRAM) is one of the potential candidates for in-memory computation. Its minimal use of circuit components and low static power consumption provided design advantage while its relatively short retention time made eDRAM unsuitable for certain applications. This work introduces a dot-product processing macro using eDRAM array and explores its capability as an in-memory computing processing element. The proposed architecture implemented a pair of 2T eDRAM cells as a processing unit that can store and operate with ternary weights using only four transistors. Besides, we investigated a method to maximize the retention time in conjunction with analyzing the device mismatch. An input/weight bit-precision reconfigurable 4T eDRAM processing array shows the energy efficiency of 1.81fJ/OP (including refresh energy) when it operates with binary inputs and ternary weights.

14 citations


Additional excerpts

  • ...With these advantages, high-performance server processors [9,10] and deep neural network (DNN) hardware accelerators [11,12] have adopted eDRAM as their on-chip memory....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: For both low-power and high-performance applications, DGCMOS-FinFET offers a most promising direction for continued progress in VLSI.
Abstract: Double-gate devices will enable the continuation of CMOS scaling after conventional scaling has stalled. DGCMOS/FinFET technology offers a tactical solution to the gate dielectric barrier and a strategic path for silicon scaling to the point where only atomic fluctuations halt further progress. The conventional nature of the processes required to fabricate these structures has enabled rapid experimental progress in just a few years. Fully integrated CMOS circuits have been demonstrated in a 180 nm foundry-compatible process, and methods for mapping conventional, planar CMOS product designs to FinFET have been developed. For both low-power and high-performance applications, DGCMOS-FinFET offers a most promising direction for continued progress in VLSI.

413 citations

Journal ArticleDOI
TL;DR: Power Systems™ continue strong 7th Generation Power chip: Balanced Multi-Core design EDRAM technology SMT4 greater then 4X performance in same power envelope as previous generation.
Abstract: The Power7 is IBM's first eight-core processor, with each core capable of four-way simultaneous-multithreading operation. Its key architectural features include an advanced memory hierarchy with three levels of on-chip cache; embedded-DRAM devices used in the highest level of the cache; and a new memory interface. This balanced multicore design scales from 1 to 32 sockets in commercial and scientific environments.

259 citations


"A 14 nm 1.1 Mb Embedded DRAM Macro ..." refers methods in this paper

  • ...IBM introduced trench capacitor eDRAM into its high performance microprocessors beginning with 45nm and Power 7 [1] to provide a higher density cache without chip crossings....

    [...]

Proceedings ArticleDOI
C-H. Lin1, Brian J. Greene1, Shreesh Narasimha1, J. Cai1, A. Bryant1, Carl J. Radens1, Vijay Narayanan1, Barry Linder1, Herbert L. Ho1, A. Aiyar1, E. Alptekin1, J-J. An1, Michael V. Aquilino1, Ruqiang Bao1, V. Basker1, Nicolas Breil1, MaryJane Brodsky1, William Y. Chang1, Clevenger Leigh Anne H1, Dureseti Chidambarrao1, Cathryn Christiansen1, D. Conklin1, C. DeWan1, H. Dong1, L. Economikos1, Bernard A. Engel1, Sunfei Fang1, D. Ferrer1, A. Friedman1, Allen H. Gabor1, Fernando Guarin1, Ximeng Guan1, M. Hasanuzzaman1, J. Hong1, D. Hoyos1, Basanth Jagannathan1, S. Jain1, S.-J. Jeng1, J. Johnson1, B. Kannan1, Y. Ke1, Babar A. Khan1, Byeong Y. Kim1, Siyuranga O. Koswatta1, Amit Kumar1, T. Kwon1, Unoh Kwon1, L. Lanzerotti1, H-K Lee1, W-H. Lee1, A. Levesque1, Wai-kin Li1, Zhengwen Li1, Wei Liu1, S. Mahajan1, Kevin McStay1, Hasan M. Nayfeh1, W. Nicoll1, G. Northrop1, A. Ogino1, Chengwen Pei1, S. Polvino1, Ravikumar Ramachandran1, Z. Ren1, Robert R. Robison1, Saraf Iqbal Rashid1, Viraj Y. Sardesai1, S. Saudari1, Dominic J. Schepis1, Christopher D. Sheraw1, Shariq Siddiqui1, Liyang Song1, Kenneth J. Stein1, C. Tran1, Henry K. Utomo1, Reinaldo A. Vega1, Geng Wang1, Han Wang1, W. Wang1, X. Wang1, D. Wehelle-Gamage1, E. Woodard1, Yongan Xu1, Y. Yang1, N. Zhan1, Kai Zhao1, C. Zhu1, K. Boyd1, E. Engbrecht1, K. Henson1, E. Kaste1, Siddarth A. Krishnan1, Edward P. Maciejewski1, Huiling Shang1, Noah Zamdmer1, R. Divakaruni1, J. Rice1, Scott R. Stiffler1, Paul D. Agnello1 
01 Dec 2014
TL;DR: In this article, the authors present a fully integrated 14nm CMOS technology featuring fin-FET architecture on an SOI substrate for a diverse set of SoC applications including HP server microprocessors and LP ASICs.
Abstract: We present a fully integrated 14nm CMOS technology featuring finFET architecture on an SOI substrate for a diverse set of SoC applications including HP server microprocessors and LP ASICs. This SOI finFET architecture is integrated with a 4th generation deep trench embedded DRAM to provide an ultra-dense (0.0174um2) memory solution for industry leading ‘scale-out’ processor design. A broad range of Vts is enabled on chip through a unique dual workfunction process applied to both NFETs and PFETs. This enables simultaneous optimization of both lowVt (HP) and HiVt (LP) devices without reliance on problematic approaches like heavy doping or Lgate modulation to create Vt differentiation. The SOI finFET's excellent subthreshold behavior allows gate length scaling to the sub 20nm regime and superior low Vdd operation. This leads to a substantial (>35%) performance gain for Vdd ∼0.8V compared to the HP 22nm planar predecessor technology. At the same time, the exceptional FE/BE reliability enables high Vdd (>1.1V) operation essential to the high single thread performance for processors intended for ‘scale-up’ enterprise systems. A hierarchical BEOL with 15 levels of copper interconnect delivers both high performance wire-ability as well as effective power supply and clock distribution for very large >600mm2 SoCs.

137 citations


"A 14 nm 1.1 Mb Embedded DRAM Macro ..." refers methods in this paper

  • ...1(a), utilizes Replacement Metal Gate (RMG) SOI FinFET devices, resulting in a 33% shrink from the 22 nm bit cell [16], [17]....

    [...]

  • ...The pass transistor/access device of the cell is a 3.5 nm thick oxide, fully depleted (FD) FinFET with an undoped channel....

    [...]

  • ...This 22 nm design style has been successfully migrated into a 14 nm FinFET eDRAM [20] learning vehicle, complete with an ABIST engine, word-line charge pumps (VPP & VWL), and pad-cage interface circuitry as a system prototype....

    [...]

  • ...Hence, FD SOI FinFETs can achieve high write back current without increasing the LBL capacitance, which is a key advantage to previous eDRAM technologies employing planar pass transistors....

    [...]

  • ...By nature, SOI fully depleted FinFET devices, have low overall junction capacitance as the only junctions created in the SOI are in the horizontal plane of the device, i.e., between the body of the device and diffusion....

    [...]

Proceedings ArticleDOI
06 Mar 2014
TL;DR: This paper presents 14nm FinFET-based 128Mb 6T SRAM chips featuring low-VMIN with newly developed assist techniques, and presents peripheral-assist techniques required to overcome the bitcell challenges to high yield.
Abstract: With the explosive growth of battery-operated portable devices, the demand for low power and small size has been increasing for system-on-a-chip (SoC). The FinFET is considered as one of the most promising technologies for future low-power mobile applications because of its good scaling ability, high on-current, better SCE and subthreshold slope, and small leakage current [1]. As a key approach for low-power, supply-voltage (VDD) scaling has been widely used in SoC design. However, SRAM is the limiting factor of voltage-scaling, since all SRAM functions of read, write, and hold-stability are highly influenced by increased variations at low VDD, resulting in lower yield. In addition, the width-quantization property of FinFET device reduces the design window for transistor sizing, and increases the failure probability due to the un-optimized bitcell sizing [1]. In order to overcome the bitcell challenges to high yield, peripheral-assist techniques are required. In this paper, we present 14nm FinFET-based 128Mb 6T SRAM chips featuring low-VMIN with newly developed assist techniques.

113 citations

Proceedings ArticleDOI
01 Dec 2012
TL;DR: A hierarchical BEOL with 15 levels of copper interconnect including self-aligned via processing delivers high performance with exceptional reliability in SOI CMOS 22nm technology.
Abstract: We present a fully-integrated SOI CMOS 22nm technology for a diverse array of high-performance applications including server microprocessors, memory controllers and ASICs. A pre-doped substrate enables scaling of this third generation of SOI deep-trench-based embedded DRAM for a dense high-performance memory hierarchy. Dual-Embedded stressor technology including SiGe and Si:C for improved carrier mobility in both PMOS and NMOS FETs is presented for the first time. A hierarchical BEOL with 15 levels of copper interconnect including self-aligned via processing delivers high performance with exceptional reliability.

92 citations

Related Papers (5)