scispace - formally typeset

Journal ArticleDOI

A 14 nm 1.1 Mb Embedded DRAM Macro With 1 ns Access

01 Jan 2016-IEEE Journal of Solid-state Circuits (IEEE)-Vol. 51, Iss: 1, pp 230-239

TL;DR: A 1.1 Mb embedded DRAM macro (eDRAM), for next-generation IBM SOI processors, employs 14 nm FinFET logic technology with 0.0174 μm2 deep-trench capacitor cell that enables a high voltage gain of a power-gated inverter at mid-level input voltage.
Abstract: A 1.1 Mb embedded DRAM macro (eDRAM), for next-generation IBM SOI processors, employs 14 nm FinFET logic technology with $\hbox{0.0174}~\mu\hbox{m}^{2}$ deep-trench capacitor cell. A Gated-feedback sense amplifier enables a high voltage gain of a power-gated inverter at mid-level input voltage, while supporting 66 cells per local bit-line. A dynamic-and-gate-thin-oxide word-line driver that tracks standard logic process variation improves the eDRAM array performance with reduced area. The 1.1 $~$ Mb macro composed of 8 $\times$ 2 72 Kb subarrays is organized with a center interface block architecture, allowing 1 ns access latency and 1 ns bank interleaving operation using two banks, each having 2 ns random access cycle. 5 GHz operation has been demonstrated in a system prototype, which includes 6 instances of 1.1 Mb eDRAM macros, integrated with an array-built-in-self-test engine, phase-locked loop (PLL), and word-line high and word-line low voltage generators. The advantage of the 14 nm FinFET array over the 22 nm array was confirmed using direct tester control of the 1.1 Mb eDRAM macros integrated in 16 Mb inline monitor.
Topics: eDRAM (62%), Sense amplifier (51%), Logic gate (50%)
Citations
More filters

Proceedings ArticleDOI
Shuangchen Li1, Niu Dimin2, Malladi Krishna T2, Zheng Hongzhong2  +2 moreInstitutions (2)
14 Oct 2017-
TL;DR: DRISA, a DRAM-based Reconfigurable In-Situ Accelerator architecture, is proposed to provide both powerful computing capability and large memory capacity/bandwidth to address the memory wall problem in traditional von Neumann architecture.
Abstract: Data movement between the processing units and the memory in traditional von Neumann architecture is creating the “memory wall” problem. To bridge the gap, two approaches, the memory-rich processor (more on-chip memory) and the compute-capable memory (processing-in-memory) have been studied. However, the first one has strong computing capability but limited memory capacity/bandwidth, whereas the second one is the exact the opposite.To address the challenge, we propose DRISA, a DRAM-based Reconfigurable In-Situ Accelerator architecture, to provide both powerful computing capability and large memory capacity/bandwidth. DRISA is primarily composed of DRAM memory arrays, in which every memory bitline can perform bitwise Boolean logic operations (such as NOR). DRISA can be reconfigured to compute various functions with the combination of the functionally complete Boolean logic operations and the proposed hierarchical internal data movement designs. We further optimize DRISA to achieve high performance by simultaneously activating multiple rows and subarrays to provide massive parallelism, unblocking the internal data movement bottlenecks, and optimizing activation latency and energy. We explore four design options and present a comprehensive case study to demonstrate significant acceleration of convolutional neural networks. The experimental results show that DRISA can achieve 8.8× speedup and 1.2× better energy efficiency compared with ASICs, and 7.7× speedup and 15× better energy efficiency over GPUs with integer operations.CCS CONCEPTS• Hardware → Dynamic memory; • Computer systems organization → reconfigurable computing; Neural networks;

179 citations


Proceedings ArticleDOI
02 Jun 2018-
TL;DR: This paper presents the first proposal to enable scientific computing on memristive crossbars, and three techniques are explored — reducing overheads by exploiting exponent range locality, early termination of fixed-point computation, and static operation scheduling — that together enable a fixed- Point Memristive accelerator to perform high-precision floating point without the exorbitant cost of naïve floating-point emulation on fixed-pointers.
Abstract: Linear algebra is ubiquitous across virtually every field of science and engineering, from climate modeling to macroeconomics. This ubiquity makes linear algebra a prime candidate for hardware acceleration, which can improve both the run time and the energy efficiency of a wide range of scientific applications. Recent work on memristive hardware accelerators shows significant potential to speed up matrix-vector multiplication (MVM), a critical linear algebra kernel at the heart of neural network inference tasks. Regrettably, the proposed hardware is constrained to a narrow range of workloads: although the eight- to 16-bit computations afforded by memristive MVM accelerators are acceptable for machine learning, they are insufficient for scientific computing where high-precision floating point is the norm. This paper presents the first proposal to enable scientific computing on memristive crossbars. Three techniques are explored---reducing overheads by exploiting exponent range locality, early termination of fixed-point computation, and static operation scheduling---that together enable a fixed-point memristive accelerator to perform high-precision floating point without the exorbitant cost of naive floating-point emulation on fixed-point hardware. A heterogeneous collection of crossbars with varying sizes is proposed to efficiently handle sparse matrices, and an algorithm for mapping the dense subblocks of a sparse matrix to an appropriate set of crossbars is investigated. The accelerator can be combined with existing GPU-based systems to handle datasets that cannot be efficiently handled by the memristive accelerator alone. The proposed optimizations permit the memristive MVM concept to be applied to a wide range of problem domains, respectively improving the execution time and energy dissipation of sparse linear solvers by 10.3x and 10.9x over a purely GPU-based system.

35 citations


Cites methods from "A 14 nm 1.1 Mb Embedded DRAM Macro ..."

  • ...SRAM buffers within each cluster and the eDRAM memory are modeled using CACTI7 [49] using 14nm eDRAM parameters from [50]....

    [...]


Journal ArticleDOI
TL;DR: 3D DRAMs including DDR3, wide I/O mobile DRAM, and more recently, the hybrid-memory cube (HMC) and high-bandwidth memory (HBM) targeted for high-performance computing systems are reviewed.
Abstract: This paper describes orthogonal scaling of dynamic-random-access-memories (DRAMs) using through-silicon-vias (TSVs). We review 3D DRAMs including DDR3, wide I/O mobile DRAM (WIDE I/O), and more recently, the hybrid-memory cube (HMC) and high-bandwidth memory (HBM) targeted for high-performance computing systems. We then cover embedded 3D DRAM for high-performance cache memories, reviewing an early cache prototype employing face-to-face 3D stacking which confirmed negligible performance and retention degradation using 32 nm server and ASIC embedded DRAM macros. A second cache system prototype based on POWER7 was developed to confirm feasibility of stacking $\mu {\rm P}$ and high density cache memory, with $> 2~{\rm GHz}$ operation. For test and assembly, a micro-electro-mechanical-system (MEMS) probe-card with an integrated active silicon chip, realized a 50 $\mu{\rm m}$ pitch micro-probing at-speed-active-test for known-good-die (KGD) sorting. Finally, oxide wafer bonding with Cu TSV demonstrated wafer-scale 3D integration, with TSV diameters as small as 1 $\mu{\rm m}$ . The paper concludes with comments on the challenges for future 3D DRAMs.

11 citations


Cites methods from "A 14 nm 1.1 Mb Embedded DRAM Macro ..."

  • ...Embedded DRAM (eDRAM) [6]–[8], employing a one-transistor and one-capacitor cell, provides 2....

    [...]


Journal ArticleDOI
TL;DR: The design and implementation of an 80-kb logic-embedded non-volatile multi-time programmable memory (MTPM) with no added process complexity is described and high-temperature stress results show a projected data retention of 10 years at 125 °C.
Abstract: This paper describes the design and implementation of an 80-kb logic-embedded non-volatile multi-time programmable memory (MTPM) with no added process complexity. Charge trap transistors (CTTs) that exploit charge trapping and de-trapping behavior in high-K dielectric of 32-/22-nm Logic FETs are used as storage elements with logic-compatible programming voltages. A high-gain slew-sense amplifier (SA) is used to efficiently detect the threshold voltage difference ( $\Delta V_{\textrm {DIF}}$ ) between the true and complement FETs in the twin cell. Design-assist techniques including multi-step programming with over-write protection and block write algorithm are used to enhance the programming efficiency without causing a dielectric breakdown. High-temperature stress results show a projected data retention of 10 years at 125 °C with a signal loss of <30% that is margined in while programming, by employing a sense margining logic in the SA. Scalability of CTT has been established by the first demonstration of CTT-based MTPM in 14-nm bulk FinFET technology with read cycle time of 40 ns at 0.7-V VDD.

10 citations


Cites background from "A 14 nm 1.1 Mb Embedded DRAM Macro ..."

  • ...Embedding dynamic-random access memory [1], [2], built with...

    [...]


Proceedings ArticleDOI
Taegeun Yoo1, Hyunjoon Kim1, Qian Chen1, Tony Tae-Hyoung Kim1  +1 moreInstitutions (1)
29 Jul 2019-
TL;DR: This work introduces a dot-product processing macro using eDRAM array and explores its capability as an in-memory computing processing element and investigated a method to maximize the retention time in conjunction with analyzing the device mismatch.
Abstract: Modern deep neural network (DNN) systems evolved under the ever-increasing demands of handling more complex and computation-heavy tasks. Traditional hardware designed for such tasks had larger size memory and power consumption issue due to extensive on/off-chip memory access. In-memory computing, one of the promising solutions to resolve the issue, dramatically reduced memory access and improved energy efficiency by utilizing the memory cell to function as both a data storage and a computing element. Embedded DRAM (eDRAM) is one of the potential candidates for in-memory computation. Its minimal use of circuit components and low static power consumption provided design advantage while its relatively short retention time made eDRAM unsuitable for certain applications. This work introduces a dot-product processing macro using eDRAM array and explores its capability as an in-memory computing processing element. The proposed architecture implemented a pair of 2T eDRAM cells as a processing unit that can store and operate with ternary weights using only four transistors. Besides, we investigated a method to maximize the retention time in conjunction with analyzing the device mismatch. An input/weight bit-precision reconfigurable 4T eDRAM processing array shows the energy efficiency of 1.81fJ/OP (including refresh energy) when it operates with binary inputs and ternary weights.

9 citations


Additional excerpts

  • ...With these advantages, high-performance server processors [9,10] and deep neural network (DNN) hardware accelerators [11,12] have adopted eDRAM as their on-chip memory....

    [...]


References
More filters

Journal ArticleDOI
E.J. Nowak1, Ingo Dr Aller1, Thomas Ludwig1, Keunwoo Kim  +4 moreInstitutions (1)
TL;DR: For both low-power and high-performance applications, DGCMOS-FinFET offers a most promising direction for continued progress in VLSI.
Abstract: Double-gate devices will enable the continuation of CMOS scaling after conventional scaling has stalled. DGCMOS/FinFET technology offers a tactical solution to the gate dielectric barrier and a strategic path for silicon scaling to the point where only atomic fluctuations halt further progress. The conventional nature of the processes required to fabricate these structures has enabled rapid experimental progress in just a few years. Fully integrated CMOS circuits have been demonstrated in a 180 nm foundry-compatible process, and methods for mapping conventional, planar CMOS product designs to FinFET have been developed. For both low-power and high-performance applications, DGCMOS-FinFET offers a most promising direction for continued progress in VLSI.

397 citations


Journal ArticleDOI
01 Mar 2010-IEEE Micro
TL;DR: Power Systems™ continue strong 7th Generation Power chip: Balanced Multi-Core design EDRAM technology SMT4 greater then 4X performance in same power envelope as previous generation.
Abstract: The Power7 is IBM's first eight-core processor, with each core capable of four-way simultaneous-multithreading operation. Its key architectural features include an advanced memory hierarchy with three levels of on-chip cache; embedded-DRAM devices used in the highest level of the cache; and a new memory interface. This balanced multicore design scales from 1 to 32 sockets in commercial and scientific environments.

257 citations


"A 14 nm 1.1 Mb Embedded DRAM Macro ..." refers methods in this paper

  • ...IBM introduced trench capacitor eDRAM into its high performance microprocessors beginning with 45nm and Power 7 [1] to provide a higher density cache without chip crossings....

    [...]


Proceedings ArticleDOI
C-H. Lin1, Brian J. Greene1, Shreesh Narasimha1, J. Cai1  +95 moreInstitutions (1)
01 Dec 2014-
Abstract: We present a fully integrated 14nm CMOS technology featuring finFET architecture on an SOI substrate for a diverse set of SoC applications including HP server microprocessors and LP ASICs. This SOI finFET architecture is integrated with a 4th generation deep trench embedded DRAM to provide an ultra-dense (0.0174um2) memory solution for industry leading ‘scale-out’ processor design. A broad range of Vts is enabled on chip through a unique dual workfunction process applied to both NFETs and PFETs. This enables simultaneous optimization of both lowVt (HP) and HiVt (LP) devices without reliance on problematic approaches like heavy doping or Lgate modulation to create Vt differentiation. The SOI finFET's excellent subthreshold behavior allows gate length scaling to the sub 20nm regime and superior low Vdd operation. This leads to a substantial (>35%) performance gain for Vdd ∼0.8V compared to the HP 22nm planar predecessor technology. At the same time, the exceptional FE/BE reliability enables high Vdd (>1.1V) operation essential to the high single thread performance for processors intended for ‘scale-up’ enterprise systems. A hierarchical BEOL with 15 levels of copper interconnect delivers both high performance wire-ability as well as effective power supply and clock distribution for very large >600mm2 SoCs.

116 citations


"A 14 nm 1.1 Mb Embedded DRAM Macro ..." refers methods in this paper

  • ...1(a), utilizes Replacement Metal Gate (RMG) SOI FinFET devices, resulting in a 33% shrink from the 22 nm bit cell [16], [17]....

    [...]

  • ...The pass transistor/access device of the cell is a 3.5 nm thick oxide, fully depleted (FD) FinFET with an undoped channel....

    [...]

  • ...This 22 nm design style has been successfully migrated into a 14 nm FinFET eDRAM [20] learning vehicle, complete with an ABIST engine, word-line charge pumps (VPP & VWL), and pad-cage interface circuitry as a system prototype....

    [...]

  • ...Hence, FD SOI FinFETs can achieve high write back current without increasing the LBL capacitance, which is a key advantage to previous eDRAM technologies employing planar pass transistors....

    [...]

  • ...By nature, SOI fully depleted FinFET devices, have low overall junction capacitance as the only junctions created in the SOI are in the horizontal plane of the device, i.e., between the body of the device and diffusion....

    [...]


Proceedings ArticleDOI
Taejoong Song1, Woojin Rim1, Jong-Hoon Jung1, Giyong Yang1  +14 moreInstitutions (1)
06 Mar 2014-
TL;DR: This paper presents 14nm FinFET-based 128Mb 6T SRAM chips featuring low-VMIN with newly developed assist techniques, and presents peripheral-assist techniques required to overcome the bitcell challenges to high yield.
Abstract: With the explosive growth of battery-operated portable devices, the demand for low power and small size has been increasing for system-on-a-chip (SoC). The FinFET is considered as one of the most promising technologies for future low-power mobile applications because of its good scaling ability, high on-current, better SCE and subthreshold slope, and small leakage current [1]. As a key approach for low-power, supply-voltage (VDD) scaling has been widely used in SoC design. However, SRAM is the limiting factor of voltage-scaling, since all SRAM functions of read, write, and hold-stability are highly influenced by increased variations at low VDD, resulting in lower yield. In addition, the width-quantization property of FinFET device reduces the design window for transistor sizing, and increases the failure probability due to the un-optimized bitcell sizing [1]. In order to overcome the bitcell challenges to high yield, peripheral-assist techniques are required. In this paper, we present 14nm FinFET-based 128Mb 6T SRAM chips featuring low-VMIN with newly developed assist techniques.

111 citations


Proceedings ArticleDOI
Shreesh Narasimha1, Paul Chang1, Claude Ortolland1, David M. Fried1  +51 moreInstitutions (1)
01 Dec 2012-
TL;DR: A hierarchical BEOL with 15 levels of copper interconnect including self-aligned via processing delivers high performance with exceptional reliability in SOI CMOS 22nm technology.
Abstract: We present a fully-integrated SOI CMOS 22nm technology for a diverse array of high-performance applications including server microprocessors, memory controllers and ASICs. A pre-doped substrate enables scaling of this third generation of SOI deep-trench-based embedded DRAM for a dense high-performance memory hierarchy. Dual-Embedded stressor technology including SiGe and Si:C for improved carrier mobility in both PMOS and NMOS FETs is presented for the first time. A hierarchical BEOL with 15 levels of copper interconnect including self-aligned via processing delivers high performance with exceptional reliability.

90 citations


Performance
Metrics
No. of citations received by the Paper in previous years
YearCitations
20212
20194
20187
20172
20161