scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A 1 Gb 2 GHz 128 GB/s bandwidth embedded DRAM in 22 nm tri-gate CMOS technology

TL;DR: An embedded DRAM (eDRAM) integrated into 22 nm CMOS logic technology using tri-gate high-k metal gate transistor and MIM capacitor is described, which provides up to 75% performance improvement in silicon, across a wide range of workloads.
Abstract: An embedded DRAM (eDRAM) integrated into 22 nm CMOS logic technology using tri-gate high-k metal gate transistor and MIM capacitor is described A 1 Gb eDRAM die is designed, which includes fully integrated programmable charge pumps to over- and underdrive wordlines with output voltage regulation The die area is 77 mm 2 and provides 64 GB/s Read and 64 GB/s Write at 105 V 100 μs retention time is achieved at 95°C using the worst case memory array stress patterns The 1 Gb eDRAM die is multi-chip-packaged with Haswell family Iris Pro™ die to achieve a high-end graphics part, which provides up to 75% performance improvement in silicon, across a wide range of workloads
Citations
More filters
Journal ArticleDOI
TL;DR: Trends in the design of device and circuits for on-chip nonvolatile memory using memrisitive devices as well as the challenges faced by researchers in its further development are examined.
Abstract: Memristive devices have shown considerable promise for on-chip nonvolatile memory and computing circuits in energy-efficient systems. However, this technology is limited with regard to speed, power, VDDmin, and yield due to process variation in transistors and memrisitive devices as well as the issue of read disturbance. This paper examines trends in the design of device and circuits for on-chip nonvolatile memory using memristive devices as well as the challenges faced by researchers in its further development. Several silicon-verified examples of circuitry are reviewed in this paper, including those aimed at high-speed, area-efficient, and low-voltage applications.

51 citations

Proceedings ArticleDOI
23 May 2016
TL;DR: In this paper, the authors proposed threshold voltage-defined switches that will camouflage the logic gate both logically and physically to resist reverse engineering (RE) and IP piracy, which can function as NAND, AND, NOR, OR, XOR, and XNOR robustly using threshold defined switches.
Abstract: Semiconductor supply chain is increasingly getting exposed to variety of security attacks such as Trojan insertion, cloning, counterfeiting, reverse engineering (RE), piracy of Intellectual Property (IP) or Integrated Circuit (IC) and side-channel analysis due to involvement of untrusted parties. In this paper, we propose threshold voltage-defined switches that will camouflage the logic gate both logically and physically to resist RE and IP piracy. The proposed gate can function as NAND, AND, NOR, OR, XOR, and XNOR robustly using threshold defined switches. We also propose a flavor of camouflaged gate that represents reduced functionality (NAND, NOR and NOT) at much lower overhead. The camouflaged design operates at nominal voltage and obeys conventional reliability limits. A small fraction of gates can be camouflaged to increase the RE effort extremely high. Simulation results indicate 46–53% area, 59–68% delay and 52–76% power overhead when 5–15% gates are identified and camouflaged using the proposed gate. A significant higher RE effort is achieved when the proposed gate is employed in the netlist using controllability, observability and hamming distance sensitivity based gate selection metrics.

40 citations


Cites methods from "A 1 Gb 2 GHz 128 GB/s bandwidth emb..."

  • ...VT modulation is a well-known technique used extensively [7,8] in semiconductor industry for trade-off between power, performance and...

    [...]

Journal ArticleDOI
TL;DR: Proposed topology for CPU/GPU HSA improves application performance by 29 percent and reduces latency by 50 percent, while reducing energy consumption by 64.5 percent and area by 17.39 percent as compared to baseline mesh.
Abstract: Heterogeneous System Architectures (HSA) that integrate cores of different architectures (CPU, GPU, etc.) on single chip are gaining significance for many class of applications to achieve high performance. Networks-on-Chip (NoCs) in HSA are monopolized by high volume GPU traffic, penalizing CPU application performance. In addition, building efficient interfaces between systems of different specifications while achieving optimal performance is a demanding task. Homogeneous NoCs, widely used for many core systems, fall short in meeting these communication requirements. To achieve high performance interconnection in HSA, we propose HyWin topology using mm-wave wireless links. The proposed topology implements sandboxed heterogeneous sub-networks, each designed to match needs of a processing subsystem, which are then interconnected at second level using wireless network. The sandboxed sub-networks avoid conflict of network requirements, while providing optimal performance for their respective subsystems. The long range wireless links provide low latency and low energy inter-subsystem network to provide easy access to memory controllers, lower level caches across the entire system. By implementing proposed topology for CPU/GPU HSA, we show that it improves application performance by 29 percent and reduces latency by 50 percent, while reducing energy consumption by 64.5 percent and area by 17.39 percent as compared to baseline mesh.

37 citations


Cites methods from "A 1 Gb 2 GHz 128 GB/s bandwidth emb..."

  • ...Emerging processor designs like [28] implement an off-chip embedded DRAM on the package for higher performance....

    [...]

Journal ArticleDOI
TL;DR: This paper proposes a fully backward compatible extension to the standard HMC called the smart memory cube, and designs a high bandwidth, low latency, and Advanced eXtensible Interface-4.0 compatible logic base (LoB) interconnect to serve the huge bandwidth demand by the HMCs serial links, and to provide extra bandwidth to a generic processor-in-memory (PIM) device embedded in the LoB.
Abstract: Hybrid memory cube (HMC) has promised to improve bandwidth, power consumption, and density for the next-generation main memory systems. In addition, 3-D integration gives a second shot for revisiting near memory computation to fill the gap between processors and memories. In this paper, we study the required infrastructure inside the HMC to support near memory computation in a modular and flexible fashion. We propose a fully backward compatible extension to the standard HMC called the smart memory cube, and design a high bandwidth, low latency, and Advanced eXtensible Interface-4.0 compatible logic base (LoB) interconnect to serve the huge bandwidth demand by the HMCs serial links, and to provide extra bandwidth to a generic processor-in-memory (PIM) device embedded in the LoB. This interconnect features a novel address scrambling mechanism for the reduction in the vault/bank conflicts and robust operation even in the presence of pathological traffic patterns. Our cycle accurate simulation results demonstrate that this interconnect can easily meet the demands of the latest HMC specifications (up to 205 GB/s read bandwidth with 4 serial links and 32 memory vaults for injected random traffic). It further shown that the default addressing scheme of the HMC (low interleaving) is not reliable enough and operates poorly in the presence of specific traffic patterns from real applications. This is while the proposed scrambling mechanism operates robustly even in those cases. The interference between the PIM traffic and the main links is shown to be negligible when the number of PIM ports is limited to 2, requesting up to 64 GB/s without pushing the system into saturation. Finally, logic synthesis with Synopsys Design Compiler confirms that our interconnect is implementable and effective in terms of power, area, and timing (power consumption less than 5 mW up to 1 GHz and area less than 0.4 mm2).

31 citations


Cites background from "A 1 Gb 2 GHz 128 GB/s bandwidth emb..."

  • ...1It is worth mentioning that the integration of DRAM in logic process has recently become successful with acceptable cost margins [5]; nevertheless, the scalability and the flexibility of 3-D stacks in placing multigigabyte DRAMs close to logic make them more interesting candidates for near memory computation....

    [...]

Proceedings ArticleDOI
30 May 2020
TL;DR: This paper proposes a comprehensive design spanning across the device, circuit, architecture and algorithm levels to build an ultra low-power architecture for SNN and ANN inference, using spintronics-based magnetic tunnel junction devices that have been shown to function as both neuro-synaptic crossbars as well as thresholding neurons and can operate at ultra low voltage and current levels.
Abstract: Brain-inspired cognitive computing has so far followed two major approaches - one uses multi-layered artificial neural networks (ANNs) to perform pattern-recognition-related tasks, whereas the other uses spiking neural networks (SNNs) to emulate biological neurons in an attempt to be as efficient and fault-tolerant as the brain. While there has been considerable progress in the former area due to a combination of effective training algorithms and acceleration platforms, the latter is still in its infancy due to the lack of both. SNNs have a distinct advantage over their ANN counterparts in that they are capable of operating in an event-driven manner, thus consuming very low power. Several recent efforts have proposed various SNN hardware design alternatives, however, these designs still incur considerable energy overheads. In this context, this paper proposes a comprehensive design spanning across the device, circuit, architecture and algorithm levels to build an ultra low-power architecture for SNN and ANN inference. For this, we use spintronics-based magnetic tunnel junction (MTJ) devices that have been shown to function as both neuro-synaptic crossbars as well as thresholding neurons and can operate at ultra low voltage and current levels. Using this MTJ-based neuron model and synaptic connections, we design a low power chip that has the flexibility to be deployed for inference of SNNs, ANNs as well as a combination of SNN-ANN hybrid networks - a distinct advantage compared to prior works. We demonstrate the competitive performance and energy efficiency of the SNNs as well as hybrid models on a suite of workloads. Our evaluations show that the proposed design, NEBULA, is up to 7.9x more energy efficient than a state-of-the-art design, ISAAC, in the ANN mode. In the SNN mode, our design is about 45x more energy-efficient than a contemporary SNN architecture, INXS. Power comparison between NEBULA ANN and SNN modes indicates that the latter is at least 6.25x more power-efficient for the observed benchmarks.

28 citations


Cites background from "A 1 Gb 2 GHz 128 GB/s bandwidth emb..."

  • ...Neural Core (NC) Component Param Spec Power Area (mmˆ2) eDRAM [25] size 32 KB 9....

    [...]

References
More filters
Proceedings ArticleDOI
03 Apr 2012
TL;DR: A high-performance, voltage-scalable 162Mb SRAM array is developed in a 22nm tri-gate bulk technology featuring 3rd-generation high-k metal-gate transistors and 5th-generation strained silicon to address process variation and fin quantization at 22nm.
Abstract: Future product applications demand increasing performance with reduced power consumption, which motivates the pursuit of high-performance at reduced operating voltages. Random and systematic device variations pose significant challenges to SRAM V MIN and low-voltage performance as technology scaling follows Moore's law to the 22nm node. A high-performance, voltage-scalable 162Mb SRAM array is developed in a 22nm tri-gate bulk technology featuring 3rd-generation high-k metal-gate transistors and 5th-generation strained silicon. Tri-gate technology reduces short-channel effects (SCE) and improves subthreshold slope to provide 37% improved device performance at 0.7V. Continuous device width sizing in planar technology is replaced by combining parallel silicon fins to multiply drive current. Process-circuit co-optimization of transient voltage collapse write assist (TVC-WA) and wordline underdrive read assist (WLUD-RA) features address process variation and fin quantization at 22nm and enable a 175mV reduction in the supply voltage required for 2GHz SRAM operation. Figure 13.1.1 shows an SEM top-down view of a 0.092μm2 high-density 6T SRAM bitcell (HDC) and a 0.108μm2 low-voltage 6T SRAM cell (LVC) after gate and diffusion processing. Computational OPC/RET techniques extend the capabilities of 193nm immersion lithography to allow a 1.85× increase in array density relative to 32nm designs [1].

177 citations


"A 1 Gb 2 GHz 128 GB/s bandwidth emb..." refers background or methods in this paper

  • ...029 m , less than one-third of the high-density 6T-SRAM bitcell offered in the same 22 nm technology [8], enabling design of high-density memory....

    [...]

  • ...An SRAM in the same 22 nm technology [8] is compared to eDRAM in Table I....

    [...]

Journal ArticleDOI
TL;DR: Performance enhancements include a 102 GB/sec L4 eDRAM cache, hardware support for transactional synchronization, and new FMA instructions that double FP operations per clock.
Abstract: We describe the 4th Generation Intel® Core™ processor family (codenamed “Haswell”) implemented on Intel® 22 nm technology and intended to support form factors from desktops to fan-less Ultrabooks™. Performance enhancements include a 102 GB/sec L4 eDRAM cache, hardware support for transactional synchronization, and new FMA instructions that double FP operations per clock. Power improvements include Fully-Integrated Voltage Regulators ( ~ 50% battery life extension), new low-power states (95% standby power savings), optimized MCP I/O system (1.0-1.22 pJ/b), and improved DDR I/O circuits (40% active and 100x idle power savings). Other improvements include full-platform optimization via integrated display I/O interfaces.

107 citations


"A 1 Gb 2 GHz 128 GB/s bandwidth emb..." refers methods in this paper

  • ...13 shows a 4th generation CoreTM processor, where the CPU is connected to the 1 Gb eDRAM die through on-package IO (OPIO) [9]....

    [...]

  • ...A 1 Gb eDRAM array, as well as charge pumps and OPIO are well characterized in silicon, at both wafer and package levels....

    [...]

  • ...The die also has embedded fuse arrays, test access port (TAP) for low-frequency DFx, high-frequency on-package IO (OPIO), programmable built-in self-test (PBIST) for wafer-level high-frequency array testing, and digital thermal sensors (DTS) to trigger if the die exceeds thermal limits....

    [...]

  • ...13 shows a 4th generation Core™ processor, where the CPU is connected to the 1 Gb eDRAM die through on-package IO (OPIO) [9]....

    [...]

  • ...This way, at 2 GHz array clock and 4 GHz OPIO clock, 64 GB/s Read and 64 GB/s Write bandwidth can be supported....

    [...]

Proceedings ArticleDOI
06 Mar 2014
TL;DR: The primary goals for the Haswell program are platform integration and low power to enable smaller form factors and an Intel AVX2 instruction set that supports floating-point multiply-add (FMA), and 256b SIMD integer achieving 2× the number of floating- point and integer operations over its predecessor.
Abstract: The 4th Generation Intel® Core™ processor, codenamed Haswell, is a family of products implemented on Intel 22nm Tri-gate process technology [1]. The primary goals for the Haswell program are platform integration and low power to enable smaller form factors. Haswell incorporates several building blocks, including: platform controller hubs (PCHs), memory, CPU, graphics and media processing engines, thus creating a portfolio of product segments from fan-less Ultrabooks™ to high-performance desktop, as shown in Fig. 5.9.1. It also integrates a number of new technologies: a fully integrated voltage regulator (VR) consolidating 5 platform VRs down to 1, on-die eDRAM cache for improved graphics performance, lower-power states, optimized IO interfaces, an Intel AVX2 instruction set that supports floating-point multiply-add (FMA), and 256b SIMD integer achieving 2× the number of floating-point and integer operations over its predecessor. The 22nm process is optimized for Haswell and includes 11 metal layers (2 additional metal layers vs. Ivy Bridge [2]), high-density metal-insulator-metal (MIM) capacitors, and is tuned for different leakage/speed targets based on the market segment. For example, in some low-power products, the process is optimized to reduce leakage by 75% at Vmin, while paying only 12% intrinsic device degradation at the high-voltage corner.

95 citations

Proceedings ArticleDOI
06 Mar 2014
TL;DR: The next-generation enterprise Xeon® server processor has 15 dual-threaded 64b Ivybridge cores and 37.5MB shared L3 cache and CMOS muxes embedded in the ring bus are programmably operable in a 2-or-3-columns configuration.
Abstract: The next-generation enterprise Xeon® server processor has 15 dual-threaded 64b Ivybridge cores [1] and 37.5MB shared L3 cache. The system interface includes two on-chip memory controllers, each with two memory channels and supports multiple system topologies. The processor has 4.31B transistors in a high-κ metal-gate tri-gate 22nm CMOS technology with 9 metal layers [2]. The design supports a wide array of product offerings with thermal design power ranging from 40 to 150W and frequencies ranging from 1.4 to 3.8GHz. Fig. 5.4.1(a) shows the processor block diagram. The floorplan (Fig. 5.4.1(b)) is driven by the ring bus routability and latency, as well as the chop requirements to smaller core counts. The cores and associated L3 cache are organized in columns of five, with the ring bus segment embedded. The fully populated die has 15-cores in three columns. The 10-core chop removes the rightmost 3rd column and its dedicated top and bottom IOs. CMOS muxes embedded in the ring bus are programmably operable in a 2-or-3-columns configuration. The 6-core chop removes the 2nd and 4th rows from the 10-core die.

45 citations


"A 1 Gb 2 GHz 128 GB/s bandwidth emb..." refers methods in this paper

  • ...1 shows SRAM Last Level Cache (LLC) area in Intel Ivytown Xeon® processor in 22 nm technology [1]....

    [...]

Proceedings ArticleDOI
18 Mar 2010
TL;DR: This high performance DRAM macro is used to construct a large 32MB L3 cache on-chip, eliminating delay, area and power from the off-chip interface, simultaneously improving system performance, reducing cost, power and soft error vulnerability.
Abstract: Logic-based embedded DRAM has matured into a wide range of ASIC applications, SRAM replacements [1] and off-chip caches for microprocessors [2]. While embedded DRAM has been leveraged in supercomputers such as IBM's BlueGene/L [3], it's use has been limited to moderate performance bulk logic technologies. Although prototypes have been demonstrated [4], DRAM has yet to be embedded on a high performance microprocessor. This paper discloses an SOI DRAM macro implemented on-chip with the IBM POWER7™ high performance microprocessor [5], and introduces enhancements to the micro sense amp (µSA) architecture [6]. This high performance DRAM macro is used to construct a large 32MB L3 cache on-chip, eliminating delay, area and power from the off-chip interface, simultaneously improving system performance, reducing cost, power and soft error vulnerability. Figure 19.1.1a shows an SEM of the 45nm SOI DRAM Device and Deep Trench (DT) capacitor [7]. DT offers 25x more capacitance than planar structures and was also utilized to reduce on-chip voltage island supply noise.

39 citations

Related Papers (5)