Showing papers in "IEEE Journal of Solid-state Circuits in 2020"

PDF

Open Access

Journal Article•DOI•

C3SRAM: An In-Memory-Computing SRAM Macro Based on Robust Capacitive Coupling Computing Mechanism

[...]

Zhewei Jiang¹, Shihui Yin², Jae-sun Seo², Mingoo Seok¹•Institutions (2)

Columbia University¹, Arizona State University²

18 May 2020-IEEE Journal of Solid-state Circuits

TL;DR: The macro is an SRAM module with the circuits embedded in bitcells and peripherals to perform hardware acceleration for neural networks with binarized weights and activations and utilizes analog-mixed-signal capacitive-coupling computing to evaluate the main computations of binary neural networks, binary-multiply-and-accumulate operations.

...read moreread less

Abstract: This article presents C3SRAM, an in-memory-computing SRAM macro. The macro is an SRAM module with the circuits embedded in bitcells and peripherals to perform hardware acceleration for neural networks with binarized weights and activations. The macro utilizes analog-mixed-signal (AMS) capacitive-coupling computing to evaluate the main computations of binary neural networks, binary-multiply-and-accumulate operations. Without the need to access the stored weights by individual row, the macro asserts all its rows simultaneously and forms an analog voltage at the read bitline node through capacitive voltage division. With one analog-to-digital converter (ADC) per column, the macro realizes fully parallel vector–matrix multiplication in a single cycle. The network type that the macro supports and the computing mechanism it utilizes are determined by the robustness and error tolerance necessary in AMS computing. The C3SRAM macro is prototyped in a 65-nm CMOS. It demonstrates an energy efficiency of 672 TOPS/W and a speed of 1638 GOPS (20.2 TOPS/mm2), achieving 3975 $\times $ better energy–delay product than the conventional digital baseline performing the same operation. The macro achieves 98.3% accuracy for MNIST and 85.5% for CIFAR-10, which is among the best in-memory computing works in terms of energy efficiency and inference accuracy tradeoff.

...read moreread less

144 citations

Journal Article•DOI•

XNOR-SRAM: In-Memory Computing SRAM Macro for Binary/Ternary Deep Neural Networks

[...]

Shihui Yin¹, Zhewei Jiang², Jae-sun Seo¹, Mingoo Seok²•Institutions (2)

Arizona State University¹, Columbia University²

14 Jan 2020-IEEE Journal of Solid-state Circuits

TL;DR: XNOR-SRAM is a mixed-signal in-memory computing (IMC) SRAM macro that computes ternary-X NOR-and-accumulate (XAC) operations in binary/ternary deep neural networks (DNNs) without row-by-row data access and represents among the best tradeoff in energy efficiency and DNN accuracy.

...read moreread less

Abstract: We present XNOR-SRAM, a mixed-signal in-memory computing (IMC) SRAM macro that computes ternary-XNOR-and-accumulate (XAC) operations in binary/ternary deep neural networks (DNNs) without row-by-row data access. The XNOR-SRAM bitcell embeds circuits for ternary XNOR operations, which are accumulated on the read bitline (RBL) by simultaneously turning on all 256 rows, essentially forming a resistive voltage divider. The analog RBL voltage is digitized with a column-multiplexed 11-level flash analog-to-digital converter (ADC) at the XNOR-SRAM periphery. XNOR-SRAM is prototyped in a 65-nm CMOS and achieves the energy efficiency of 403 TOPS/W for ternary-XAC operations with 88.8% test accuracy for the CIFAR-10 data set at 0.6-V supply. This marks $33\times $ better energy efficiency and $300\times $ better energy–delay product than conventional digital hardware and also represents among the best tradeoff in energy efficiency and DNN accuracy.

...read moreread less

130 citations

Journal Article•DOI•

A Programmable Heterogeneous Microprocessor Based on Bit-Scalable In-Memory Computing

[...]

Hongyang Jia¹, Hossein Valavi¹, Yinqi Tang¹, Jintao Zhang², Naveen Verma¹ - Show less +1 more•Institutions (2)

Princeton University¹, IBM²

29 Apr 2020-IEEE Journal of Solid-state Circuits

TL;DR: This paper presents a programmable in-memory-computing processor, demonstrated in a 65nm CMOS technology, and takes the approach of tight coupling with an embedded CPU, through accelerator interfaces enabling integration in the standard processor memory space.

...read moreread less

Abstract: In-memory computing (IMC) addresses the cost of accessing data from memory in a manner that introduces a tradeoff between energy/throughput and computation signal-to-noise ratio (SNR). However, low SNR posed a primary restriction to integrating IMC in larger, heterogeneous architectures required for practical workloads due to the challenges with creating robust abstractions necessary for the hardware and software stack. This work exploits recent progress in high-SNR IMC to achieve a programmable heterogeneous microprocessor architecture implemented in 65-nm CMOS and corresponding interfaces to the software that enables mapping of application workloads. The architecture consists of a 590-Kb IMC accelerator, configurable digital near-memory-computing (NMC) accelerator, RISC-V CPU, and other peripherals. To enable programmability, microarchitectural design of the IMC accelerator provides the integration in the standard processor memory space, area- and energy-efficient analog-to-digital conversion for interfacing to NMC, bit-scalable computation (1–8 b), and input-vector sparsity-proportional energy consumption. The IMC accelerator demonstrates excellent matching between computed outputs and idealized software-modeled outputs, at 1b TOPS/W of 192|400 and 1b-TOPS/mm2 of 0.60|0.24 for MAC hardware, at $V_{DD}$ of 1.2|0.85 V, both of which scale directly with the bit precision of the input vector and matrix elements. Software libraries developed for application mapping are used to demonstrate CIFAR-10 image classification with a ten-layer CNN, achieving accuracy, throughput, and energy of 89.3%|92.4%, 176|23 images/s, and $5.31\mid 105.2~\mu \text{J}$ /image, for 1|4 b quantization levels.

...read moreread less

121 citations

Journal Article•DOI•

A Twin-8T SRAM Computation-in-Memory Unit-Macro for Multibit CNN-Based AI Edge Processors

[...]

Xin Si¹, Rui Liu², Shimeng Yu³, Ren-Shuo Liu⁴, Chih-Cheng Hsieh⁴, Kea-Tiong Tang⁴, Qiang Li¹, Meng-Fan Chang⁴, Jia-Jing Chen⁴, Yung-Ning Tu⁴, Wei-Hsing Huang⁴, Jing-Hong Wang⁴, Yen-Cheng Chiu⁴, Wei-Chen Wei⁴, Ssu-Yen Wu⁴, Xiaoyu Sun³ - Show less +12 more•Institutions (4)

University of Electronic Science and Technology of China¹, Arizona State University², Georgia Institute of Technology³, National Tsing Hua University⁴

01 Jan 2020-IEEE Journal of Solid-state Circuits

TL;DR: An static random access memory (SRAM) CIM unit-macro using compact-rule compatible twin-8T cells for weighted CIM MAC operations to reduce area overhead and vulnerability to process variation and an even–odd dual-channel (EODC) input mapping scheme to extend input bandwidth is presented.

...read moreread less

Abstract: Computation-in-memory (CIM) is a promising candidate to improve the energy efficiency of multiply-and-accumulate (MAC) operations of artificial intelligence (AI) chips. This work presents an static random access memory (SRAM) CIM unit-macro using: 1) compact-rule compatible twin-8T (T8T) cells for weighted CIM MAC operations to reduce area overhead and vulnerability to process variation; 2) an even–odd dual-channel (EODC) input mapping scheme to extend input bandwidth; 3) a two’s complement weight mapping (C2WM) scheme to enable MAC operations using positive and negative weights within a cell array in order to reduce area overhead and computational latency; and 4) a configurable global–local reference voltage generation (CGLRVG) scheme for kernels of various sizes and bit precision. A 64 $\times $ 60 b T8T unit-macro with 1-, 2-, 4-b inputs, 1-, 2-, 5-b weights, and up to 7-b MAC-value (MACV) outputs was fabricated as a test chip using a foundry 55-nm process. The proposed SRAM-CIM unit-macro achieved access times of 5 ns and energy efficiency of 37.5–45.36 TOPS/W under 5-b MACV output.

...read moreread less

120 citations

Journal Article•DOI•

A 39-GHz 64-Element Phased-Array Transceiver With Built-In Phase and Amplitude Calibrations for Large-Array 5G NR in 65-nm CMOS

[...]

Yun Wang¹, Rui Wu¹, Jian Pang¹, Dongwon You¹, Ashbir Aviat Fadila¹, Rattanan Saengchan¹, Xi Fu¹, Daiki Matsumoto¹, Takeshi Nakamura¹, Ryo Kubozoe¹, Masaru Kawabuchi¹, Bangan Liu¹, Haosheng Zhang¹, Junjun Qiu¹, Hanli Liu¹, Naoki Oshima², Keiichi Motoi², Shinichi Hori², Kazuaki Kunihiro², Tomoya Kaneko², Atsushi Shirane¹, Kenichi Okada¹ - Show less +18 more•Institutions (2)

Tokyo Institute of Technology¹, NEC²

01 Apr 2020-IEEE Journal of Solid-state Circuits

TL;DR: This article presents the first 39-GHz phased-array transceiver (TRX) chipset for fifth-generation new radio (5G NR), consisting of 4 sub-array TRX elements with local-oscillator (LO) phase-shifting architecture and built-in calibration on phase and amplitude.

...read moreread less

Abstract: This article presents the first 39-GHz phased-array transceiver (TRX) chipset for fifth-generation new radio (5G NR). The proposed transceiver chipset consists of 4 sub-array TRX elements with local-oscillator (LO) phase-shifting architecture and built-in calibration on phase and amplitude. The calibration scheme is proposed to alleviate phase and amplitude mismatch between each sub-array TRX element, especially for a large-array transceiver system in the base station (BS). Based on LO phase-shifting architecture, the transceiver has a 0.04-dB maximum gain variation over the 360° full tuning range, allowing constant-gain characteristic during phase calibration. A phase-to-digital converter (PDC) and a high-resolution phase-detection mechanism are proposed for highly accurate phase calibration. The built-in calibration has a measured accuracy of 0.08° rms phase error and 0.01-dB rms amplitude error. Moreover, a pseudo-single-balanced mixer is proposed for LO-feedthrough (LOFT) cancellation and sub-array TRX LO-to-LO isolation. The transceiver is fabricated in standard 65-nm CMOS technology with flip-chip packaging. The 8TX–8RX phased-array transceiver module 1-m OTA measurement supports 5G NR 400-MHz 256-QAM OFDMA modulation with −30.0-dB EVM. The 64-element transceiver has a EIRPMAX of 53 dBm. The four-element chip consumes a power of 1.5 W in the TX mode and 0.5 W in the RX mode.

...read moreread less

118 citations

Journal Article•DOI•

A 28-GHz CMOS Phased-Array Beamformer Utilizing Neutralized Bi-Directional Technique Supporting Dual-Polarized MIMO for 5G NR

[...]

Jian Pang¹, Zheng Li¹, Ryo Kubozoe¹, Xueting Luo¹, Rui Wu¹, Yun Wang¹, Dongwon You¹, Ashbir Aviat Fadila¹, Rattanan Saengchan¹, Takeshi Nakamura¹, Joshua Alvin¹, Daiki Matsumoto¹, Bangan Liu¹, Aravind Tharayil Narayanan¹, Junjun Qiu¹, Hanli Liu¹, Zheng Sun¹, Hongye Huang¹, Korkut Kaan Tokgoz¹, Keiichi Motoi², Naoki Oshima², Shinichi Hori², Kazuaki Kunihiro², Tomoya Kaneko², Atsushi Shirane¹, Kenichi Okada¹ - Show less +22 more•Institutions (2)

Tokyo Institute of Technology¹, NEC²

27 May 2020-IEEE Journal of Solid-state Circuits

TL;DR: A neutralized bi-directional technique is introduced in this work to reduce the chip area significantly and Compact and low-cost 5G millimeter-wave MIMO systems could be realized.

...read moreread less

Abstract: This article presents a low-cost and area-efficient 28-GHz CMOS phased-array beamformer chip for 5G millimeter-wave dual-polarized multiple-in-multiple-out (MIMO) (DP-MIMO) systems. A neutralized bi-directional technique is introduced in this work to reduce the chip area significantly. With the proposed technique, completely the same circuit chain is shared between the transmitter and receiver. To further minimize the area, an active bi-directional vector-summing phase shifter is also introduced. Area-efficient and high-resolution active phase shifting could be realized in both TX and RX modes. In measurement, the achieved saturated output power for the TX-mode beamformer is 15.1 dBm. The RX-mode noise figure is 4.2 dB at 28 GHz. To evaluate the over-the-air performance, 16 H+16 V sub-array modules are implemented in this work. Each of the sub-array modules consists of four 4 H+4 V chips. Two sub-array modules in this work are capable of scanning the beam from −50° to +50°. A saturated EIRP of 45.6 dBm is realized by 32 TX-mode beamformers. Within 1-m distance, a maximum SC-mode data rate of 15 Gb/s and the 5G new radio downlink packets transmission in 256-QAM could be supported by the module. A $2\times 2$ DP-MIMO communication is also demonstrated with two 5G new radio 64-QAM uplink streams. Thanks to the proposed area-efficient bi-directional technique, the required core area for a single element-beamformer is only 0.58 mm2. Compact and low-cost 5G millimeter-wave MIMO systems could be realized.

...read moreread less

113 citations

Journal Article•DOI•

A 28-nm Compute SRAM With Bit-Serial Logic/Arithmetic Operations for Programmable In-Memory Vector Computing

[...]

Jingcheng Wang¹, Xiaowei Wang¹, Charles Eckert¹, Arun Subramaniyan¹, Reetuparna Das¹, David Blaauw¹, Dennis Sylvester¹ - Show less +3 more•Institutions (1)

University of Michigan¹

01 Jan 2020-IEEE Journal of Solid-state Circuits

TL;DR: A general-purpose hybrid in-/near-memory compute SRAM (CRAM) that combines an 8T transposable bit cell with vector-based, bit-serial in-memory arithmetic to accommodate a wide range of bit-widths, as well as a complete set of operation types, including integer and floating-point addition, multiplication, and division.

...read moreread less

Abstract: This article proposes a general-purpose hybrid in-/near-memory compute SRAM (CRAM) that combines an 8T transposable bit cell with vector-based, bit-serial in-memory arithmetic to accommodate a wide range of bit-widths, from single to 32 or 64 bits, as well as a complete set of operation types, including integer and floating-point addition, multiplication, and division. This approach provides the flexibility and programmability necessary for evolving software algorithms ranging from neural networks to graph and signal processing. The proposed design was implemented in a small Internet of Things (IoT) processor in the 28-nm CMOS consisting of a Cortex-M0 CPU and 8 CRAM banks of 16 kB each (128 kB total). The system achieves 475-MHz operation at 1.1 V and, with all CRAMs active, produces 30 GOPS or 1.4 GFLOPS on 32-bit operands. It achieves an energy efficiency of 0.56 TOPS/W for 8-bit multiplication and 5.27 TOPS/W for 8-bit addition at 0.6 V and 114 MHz.

...read moreread less

103 citations

Journal Article•DOI•

An Energy-Efficient Comparator With Dynamic Floating Inverter Amplifier

[...]

Xiyuan Tang¹, Linxiao Shen¹, Begum Kasap², Xiangxing Yang¹, Wei Shi¹, Abhishek Mukherjee¹, David Z. Pan¹, Nan Sun¹ - Show less +4 more•Institutions (2)

University of Texas at Austin¹, University of California, Davis²

01 Jan 2020-IEEE Journal of Solid-state Circuits

TL;DR: An energy-efficient comparator design that achieves the highest reported comparator energy efficiency to the best of the authors' knowledge and greatly reduces the influence of the process corner and the input common-mode voltage on the comparator performance, including noise, offset, and delay.

...read moreread less

Abstract: This article presents an energy-efficient comparator design. The pre-amplifier adopts an inverter-based input pair powered by a floating reservoir capacitor; it realizes both current reuse and dynamic bias, thereby significantly boosting $g_{m}/I_{D}$ and reducing noise. Moreover, it greatly reduces the influence of the process corner and the input common-mode voltage on the comparator performance, including noise, offset, and delay. A prototype comparator in 180 nm achieves 46- $\mu \text{V}$ input-referred noise while consuming only 1 pJ per comparison under a 1.2-V supply. This represents greater than seven-time energy efficiency boost compared with a strong-arm (SA) latch. It achieves the highest reported comparator energy efficiency to the best of our knowledge.

...read moreread less

99 citations

Journal Article•DOI•

300-GHz-Band 120-Gb/s Wireless Front-End Based on InP-HEMT PAs and Mixers

[...]

Hiroshi Hamada¹, Takuya Tsutsumi¹, Hideaki Matsuzaki¹, Takuya Fujimura², Ibrahim Abdo², Atsushi Shirane², Kenichi Okada², Go Itami¹, Ho-Jin Song¹, Hiroki Sugiyama¹, Hideyuki Nosaka¹ - Show less +7 more•Institutions (2)

Nippon Telegraph and Telephone¹, Tokyo Institute of Technology²

13 Jul 2020-IEEE Journal of Solid-state Circuits

TL;DR: A 300-GHz-band 120-Gb/s wireless transceiver front-ends (TRX) using the in-house InP-based high-electron-mobility-transistor (InP-HEMT) technology for beyond-5G is developed.

...read moreread less

Abstract: We developed a 300-GHz-band 120-Gb/s wireless transceiver front-ends (TRX) using our in-house InP-based high-electron-mobility-transistor (InP-HEMT) technology for beyond-5G. The TRX is composed of the RF power amplifiers (PAs), mixers, and local oscillation (LO) PAs which are all packaged in individual waveguide (WG) modules by using a ridge coupler for low-loss WG-to-IC transition. RF PAs are designed using the low-impedance inter-stage-matching technique to reduce the inter-stage matching loss of the amplifier stages, and the back-side DC line (BDCL) technique is used to simplify the layout and to improve the gain of the PAs. The fabricated RF PAs show a high output 1-dB compression point of more than 6 dBm from 278 to 302 GHz. The mixers are used for both up- and down-conversion in the transmitter and receiver. These mixers are designed to have high conversion gain (CG) over the wideband even after packaging by enhancing the isolation between the RF and IF ports. The measured CG of mixer module is −15 dB, and the 3-dB IF-bandwidth is 32 GHz. The LO PAs are also designed using the BDCL technique so that they can supply the required LO power to the mixers. The TRX with these InP building blocks enables the data transmission of a 120 Gb/s 16QAM signal over a link distance of 9.8 m.

...read moreread less

87 citations

Journal Article•DOI•

Embedded 1-Mb ReRAM-Based Computing-in- Memory Macro With Multibit Input and Weight for CNN-Based AI Edge Processors

[...]

Cheng-Xin Xue¹, Ting-Wei Chang¹, Tung-Cheng Chang¹, Hui-Yao Kao¹, Yen-Cheng Chiu¹, Chun-Ying Lee¹, Ya-Chin King¹, Chrong Jung Lin¹, Ren-Shuo Liu¹, Chih-Cheng Hsieh¹, Kea-Tiong Tang¹, Wei-Hao Chen¹, Meng-Fan Chang¹, Je-Syu Liu¹, Jiafang Li¹, Wei-Yu Lin¹, Wei-En Lin¹, Jing-Hong Wang¹, Wei-Chen Wei¹, Tsung-Yuan Huang¹ - Show less +16 more•Institutions (1)

National Tsing Hua University¹

01 Jan 2020-IEEE Journal of Solid-state Circuits

TL;DR: This article proposes a serial-input non-weighted product (SINWP) structure; a down-scaling weighted current translator and positive–negative current-subtractor scheme; a current-aware bitline clamper scheme; and a triple-margin small-offset current-mode sense amplifier (TMCSA).

...read moreread less

Abstract: Computing-in-memory (CIM) based on embedded nonvolatile memory is a promising candidate for energy-efficient multiply-and-accumulate (MAC) operations in artificial intelligence (AI) edge devices. However, circuit design for NVM-based CIM (nvCIM) imposes a number of challenges, including an area-latency-energy tradeoff for multibit MAC operations, pattern-dependent degradation in signal margin, and small read margin. To overcome these challenges, this article proposes the following: 1) a serial-input non-weighted product (SINWP) structure; 2) a down-scaling weighted current translator (DSWCT) and positive–negative current-subtractor (PN-ISUB); 3) a current-aware bitline clamper (CABLC) scheme; and 4) a triple-margin small-offset current-mode sense amplifier (TMCSA). A 55-nm 1-Mb ReRAM-CIM macro was fabricated to demonstrate the MAC operation of 2-b-input, 3-b-weight with 4-b-out. This nvCIM macro achieved $T_{\text {MAC}}= 14.6$ ns at 4-b-out with peak energy efficiency of 53.17 TOPS/W.

...read moreread less

76 citations

Journal Article•DOI•

Tianjic: A Unified and Scalable Chip Bridging Spike-Based and Continuous Neural Computation

[...]

Lei Deng¹, Guanrui Wang¹, Guoqi Li¹, Shuangchen Li², Ling Liang², Maohua Zhu², Yujie Wu¹, Z. Yang¹, Zhe Zou¹, Jing Pei¹, Zhenzhi Wu, Xing Hu², Yufei Ding², Wei He, Yuan Xie², Luping Shi¹ - Show less +12 more•Institutions (2)

Tsinghua University¹, University of California, Santa Barbara²

13 Feb 2020-IEEE Journal of Solid-state Circuits

TL;DR: A unified model description framework and a unified processing architecture (Tianjic), which covers the full stack from software to hardware, and a compatible routing infrastructure that enables homogeneous and heterogeneous scalability on a decentralized many-core network.

...read moreread less

Abstract: Toward the long-standing dream of artificial intelligence, two successful solution paths have been paved: 1) neuromorphic computing and 2) deep learning. Recently, they tend to interact for simultaneously achieving biological plausibility and powerful accuracy. However, models from these two domains have to run on distinct substrates, i.e., neuromorphic platforms and deep learning accelerators, respectively. This architectural incompatibility greatly compromises the modeling flexibility and hinders promising interdisciplinary research. To address this issue, we build a unified model description framework and a unified processing architecture (Tianjic), which covers the full stack from software to hardware. By implementing a set of integration and transformation operations, Tianjic is able to support spiking neural networks, biological dynamic neural networks, multilayered perceptron, convolutional neural networks, recurrent neural networks, and so on. A compatible routing infrastructure enables homogeneous and heterogeneous scalability on a decentralized many-core network. Several optimization methods are incorporated, such as resource and data sharing, near-memory processing, compute/access skipping, and intra-/inter-core pipeline, to improve performance and efficiency. We further design streaming mapping schemes for efficient network deployment with a flexible tradeoff between execution throughput and resource overhead. A 28-nm prototype chip is fabricated with >610-GB/s internal memory bandwidth. A variety of benchmarks are evaluated and compared with GPUs and several existing specialized platforms. In summary, the fully unfolded mapping can achieve significantly higher throughput and power efficiency; the semi-folded mapping can save 30x resources while still presenting comparable performance on average. Finally, two hybrid-paradigm examples, a multimodal unmanned bicycle and a hybrid neural network, are demonstrated to show the potential of our unified architecture. This article paves a new way to explore neural computing.

...read moreread less

Journal Article•DOI•

Vocell: A 65-nm Speech-Triggered Wake-Up SoC for 10- $\mu$ W Keyword Spotting and Speaker Verification

[...]

J. S. P. Giraldo¹, Steven Lauwereins, Komail Badami, Marian Verhelst¹•Institutions (1)

Katholieke Universiteit Leuven¹

03 Feb 2020-IEEE Journal of Solid-state Circuits

TL;DR: This article presents a complete mixed-signal system-on-chip, capable of directly interfacing to an analog microphone and performing keyword spotting (KWS) and speaker verification (SV), without any need for further external accesses.

...read moreread less

Abstract: The use of speech-triggered wake-up interfaces has grown significantly in the last few years for use in ubiquitous and mobile devices. Since these interfaces must always be active, power consumption is one of their primary design metrics. This article presents a complete mixed-signal system-on-chip, capable of directly interfacing to an analog microphone and performing keyword spotting (KWS) and speaker verification (SV), without any need for further external accesses. Through the use of: 1) an integrated single-chip digital-friendly design; b) hardware-aware algorithmic optimization; and c) memory- and power-optimized accelerators, ultra-low power is achieved while maintaining high accuracy for speech recognition tasks. The 65-nm implementation achieves 18.3- $\mu \text{W}$ worst case power consumption or 10.6- $\mu \text{W}$ power for typical real-time scenarios, $10\times $ below state of the art (SoA).

...read moreread less

Journal Article•DOI•

A 12-b 18-GS/s RF Sampling ADC With an Integrated Wideband Track-and-Hold Amplifier and Background Calibration

[...]

Ahmed Mohamed Abdelatty Ali¹, Huseyin Dinc¹, Paritosh Bhoraskar¹, Scott Gregory Bardsley¹, Christopher Daniel Dillon¹, Matthew D. McShea¹, Joel Prabhakar Periathambi¹, Scott Puckett¹ - Show less +4 more•Institutions (1)

Analog Devices¹

30 Sep 2020-IEEE Journal of Solid-state Circuits

TL;DR: A 12-b 18-GS/s analog-to-digital converter (ADC) implemented in 16-nm FinFET process achieves 80% higher sample rate and 2.4 $\times $ higher input bandwidth, and incorporates a THA that supports a 3.3 non-interleaved sample rate.

...read moreread less

Abstract: We discuss a 12-b 18-GS/s analog-to-digital converter (ADC) implemented in 16-nm FinFET process. The ADC is composed of an integrated high-speed track-and-hold amplifier (THA) driving up to eight interleaved pipeline ADCs that employ open-loop inter-stage amplifiers. Up to 10 GS/s, the THA operates at the full sampling rate using a non-interleaved single sample network, thereby eliminating the interleaving sampling time and bandwidth mismatch. Above 10 GS/s, the THA is programmed to use two ping-ponged, or an optional (2 + 1) randomized, sample networks to spread the residual post-calibration interleaving spurs in the noise floor. The THA enables an input bandwidth of 18 GHz and employs dither injection and optional pseudorandom chopping. In the pipeline stages, dither-based background calibration detects and corrects gain, settling, memory, and kick-back errors. New dither-based background calibration algorithms are employed to detect and correct the arbitrary non-linearity in the form of integral non-linearity (INL) breaks and harmonic distortion up to the fifth order in the THA and in the references, DACs, and inter-stage open-loop amplifiers of the pipeline ADCs. Moreover, new dither-based background calibration is implemented to detect and correct the chopping non-idealities, memory errors, interleaving mismatches, and order-dependent randomization errors. Compared to the fastest state-of-the-art with similar performance, this ADC achieves 80% higher sample rate and 2.4 $\times $ higher input bandwidth, and incorporates a THA that supports a 3.3 $\times $ higher non-interleaved sample rate.

...read moreread less

Journal Article•DOI•

A Scalable Cryo-CMOS Controller for the Wideband Frequency-Multiplexed Control of Spin Qubits and Transmons

[...]

Jeroen P. G. van Dijk¹, Bishnu Patra¹, Sushil Subramanian², Xiao Xue¹, Nodar Samkharadze, Andrea Corna¹, Charles Jeon², Farhana Sheikh², Esdras Juarez-Hernandez², Brando Perez Esparza², Huzaifa Rampurawala², Brent Carlton², Surej Ravikumar², Carlos Nieva², Sungwon Kim², Hyung-Jin Lee², Amir Sammak, Giordano Scappucci¹, Menno Veldhorst¹, Lieven M. K. Vandersypen¹, Edoardo Charbon³, Stefano Pellerano², Masoud Babaie¹, Fabio Sebastiano¹ - Show less +20 more•Institutions (3)

Delft University of Technology¹, Intel², Kavli Institute of Nanoscience³

29 Sep 2020-IEEE Journal of Solid-state Circuits

TL;DR: The capability to translate quantum algorithms to microwave signals has been demonstrated by coherently controlling a spin qubit at both 14 and 18 GHz, thus enabling high-fidelity qubit control and exploiting the on-chip 4096-instruction memory.

...read moreread less

Abstract: Building a large-scale quantum computer requires the co-optimization of both the quantum bits (qubits) and their control electronics. By operating the CMOS control circuits at cryogenic temperatures (cryo-CMOS), and hence in close proximity to the cryogenic solid-state qubits, a compact quantum-computing system can be achieved, thus promising scalability to the large number of qubits required in a practical application. This work presents a cryo-CMOS microwave signal generator for frequency-multiplexed control of $4\times 32$ qubits (32 qubits per RF output). A digitally intensive architecture offering full programmability of phase, amplitude, and frequency of the output microwave pulses and a wideband RF front end operating from 2 to 20 GHz allow targeting both spin qubits and transmons. The controller comprises a qubit-phase-tracking direct digital synthesis (DDS) back end for coherent qubit control and a single-sideband (SSB) RF front end optimized for minimum leakage between the qubit channels. Fabricated in Intel 22-nm FinFET technology, it achieves a 48-dB SNR and 45-dB spurious-free dynamic range (SFDR) in a 1-GHz data bandwidth when operating at 3 K, thus enabling high-fidelity qubit control. By exploiting the on-chip 4096-instruction memory, the capability to translate quantum algorithms to microwave signals has been demonstrated by coherently controlling a spin qubit at both 14 and 18 GHz.

...read moreread less

Journal Article•DOI•

A 0.32–128 TOPS, Scalable Multi-Chip-Module-Based Deep Neural Network Inference Accelerator With Ground-Referenced Signaling in 16 nm

[...]

Brian Zimmer¹, Rangharajan Venkatesan¹, Yakun Sophia Shao², Jason Clemons¹, Matthew Fojtik¹, Nan Jiang¹, Ben Keller¹, Alicia Klinefelter¹, Nathaniel Pinckney¹, Priyanka Raina³, Stephen G. Tell¹, Yanqing Zhang¹, William J. Dally¹, Joel Emer¹, C. Thomas Gray¹, Stephen W. Keckler¹, Brucek Khailany¹ - Show less +13 more•Institutions (3)

Nvidia¹, University of California, Berkeley², Stanford University³

14 Jan 2020-IEEE Journal of Solid-state Circuits

TL;DR: A scalable DNN accelerator consisting of 36 chips connected in a mesh network on a multi-chip-module (MCM) using ground-referenced signaling (GRS) enables flexible scaling for efficient inference on a wide range of DNNs, from mobile to data center domains.

...read moreread less

Abstract: Custom accelerators improve the energy efficiency, area efficiency, and performance of deep neural network (DNN) inference. This article presents a scalable DNN accelerator consisting of 36 chips connected in a mesh network on a multi-chip-module (MCM) using ground-referenced signaling (GRS). While previous accelerators fabricated on a single monolithic chip are optimal for specific network sizes, the proposed architecture enables flexible scaling for efficient inference on a wide range of DNNs, from mobile to data center domains. Communication energy is minimized with large on-chip distributed weight storage and a hierarchical network-on-chip and network-on-package, and inference energy is minimized through extensive data reuse. The 16-nm prototype achieves 1.29-TOPS/mm2 area efficiency, 0.11 pJ/op (9.5 TOPS/W) energy efficiency, 4.01-TOPS peak performance for a one-chip system, and 127.8 peak TOPS and 1903 images/s ResNet-50 batch-1 inference for a 36-chip system.

...read moreread less

Journal Article•DOI•

A 20-GHz 1.9-mW LNA Using g m -Boost and Current-Reuse Techniques in 65-nm CMOS for Satellite Communications

[...]

Jiajun Zhang¹, Dixian Zhao¹, Xiaohu You¹•Institutions (1)

Southeast University¹

01 Jun 2020-IEEE Journal of Solid-state Circuits

TL;DR: A 20-GHz low-power low-noise amplifier (LNA) in 65-nm CMOS is presented and an elaborate analysis of the current-reused CG–CS LNA using a transformer-based-boost technique and transformer- based MCR is proposed.

...read moreread less

Abstract: A 20-GHz low-power low-noise amplifier (LNA) in 65-nm CMOS is presented. The LNA is cascaded with a single-ended $g_{\mathrm {m}}$ -boosted common-gate (CG) stage and a differential neutralized common-source (CS) stage. Current-reuse technique is employed to save the power consumption with little deterioration in gain and noise figure (NF). The transformer-based $g_{\mathrm {m}}$ -boost technique in the CG stage and neutralization technique in CS stage further enhances the RF performances. Inter-stage magnetically coupled resonator (MCR) extends the bandwidth. An elaborate analysis of the current-reused CG–CS LNA using a transformer-based $g_{\mathrm {m}}$ -boost technique and transformer-based MCR is proposed. Fabricated in 65-nm CMOS technology, the LNA achieves a measured power gain of 14.9 dB at 21 GHz with a −3-dB bandwidth of 4.8 GHz. The lowest NF is 3.3 dB at 19.5 GHz and is below 4 dB from 17 to 21 GHz. The LNA consumes 1.9 mW from a 1-V supply, with a chip area of 600 $\mu \text{m}\,\,\times $ 700 $\mu \text{m}$ .

...read moreread less

Journal Article•DOI•

Design and Analysis of Enhanced Mixer-First Receivers Achieving 40-dB/decade RF Selectivity

[...]

Sashank Krishnamurthy¹, Ali M. Niknejad¹•Institutions (1)

University of California, Berkeley¹

01 May 2020-IEEE Journal of Solid-state Circuits

TL;DR: A “second-order” passive mixer-first receiver is proposed to improve channel selectivity, linearity, and noise figure (NF) in the presence of out-of-band blockers, by presenting an impedance that rolls off at 40 dB/decade as the load to an N-path filter.

...read moreread less

Abstract: A “second-order” passive mixer-first receiver is proposed to improve channel selectivity, linearity, and noise figure (NF) in the presence of out-of-band blockers, by presenting an impedance that rolls off at 40 dB/decade as the load to an N-path filter. The synthesis of this impedance is described in a step-by-step manner starting from the required impedance transfer function to its actual circuit realization. Various tradeoffs and limitations of the architecture are described in detail, and layout-related techniques are also provided. Two integrated circuit prototypes were fabricated in 28-nm bulk CMOS as proof of concept for this circuit, including a low-power version. The receiver, capable of broadband operation from 0.2 to 2 GHz, achieves an out-of-band IIP3 of +33 dBm and a blocker P1dB of +12 dBm. Additionally, it achieves an NF of 4.4 dB with less than 2-dB degradation in NF for a 0-dBm blocker.

...read moreread less

Journal Article•DOI•

A 24.5–43.5-GHz Ultra-Compact CMOS Receiver Front End With Calibration-Free Instantaneous Full-Band Image Rejection for Multiband 5G Massive MIMO

[...]

Min-Yu Huang¹, Taiyun Chi, Sensen Li¹, Tzu-Yuan Huang¹, Hua Wang¹ - Show less +1 more•Institutions (1)

Georgia Institute of Technology¹

01 Jan 2020-IEEE Journal of Solid-state Circuits

TL;DR: The first CMOS RX front end that covers 24.5–43.5-GHz mm-Wave 5G bands and supports instantaneous full-band IR with no calibration, switching/tuning elements, or external controls is presented, enabling future wideband low-latency 5G MIMOs.

...read moreread less

Abstract: This article presents an extremely broadband 24.5–43.5 GHz receiver (RX) achieving 32–56-dB instantaneous full-band image rejection (IR), which supports multiple major mm-Wave 5G bands at 24.5/28/37/39/43 GHz. A compact transformer-based I/Q network (0.14 mm2) is proposed to generate high-precision LO I/Q signals at millimeter-wave (mm-Wave) and provide built-in load impedance up-transformation for passive voltage amplification, boosting the LO swing for a higher RX conversion gain (CG). The high-quality differential I/Q generation is measured with phase/amplitude variation less than ±1.8°/±0.15 dB over an instantaneous wide bandwidth of 25–50 GHz without any calibration or switching/tunable elements. The RX is measured with a peak 35.2-dB CG and 18-dB gain tuning to accommodate complex EM environments. The RX modulation tests successfully demonstrate receiving 18-Gb/s 64-QAM and 14.4-Gb/s 256-QAM signals. In addition, the RX is tested with concurrent injection of a desired signal and an image, while the image uses the same wideband modulation scheme and data rate as the desired signal. The RX successfully rejects the wideband images and receives the desired signals of 12-Gb/s 64-QAM with −27.6-dB EVM and 8-Gb/s 256-QAM with −33.47-dB EVM. To the best of our knowledge, this article presents the first CMOS RX front end that covers 24.5–43.5-GHz mm-Wave 5G bands and supports instantaneous full-band IR with no calibration, switching/tuning elements, or external controls, enabling future wideband low-latency 5G MIMOs.

...read moreread less

Journal Article•DOI•

STICKER: An Energy-Efficient Multi-Sparsity Compatible Accelerator for Convolutional Neural Networks in 65-nm CMOS

[...]

Zhe Yuan¹, Yongpan Liu¹, Jinshan Yue¹, Yixiong Yang¹, Jingyu Wang¹, Xiaoyu Feng¹, Jian Zhao², Xueqing Li¹, Huazhong Yang¹ - Show less +5 more•Institutions (2)

Tsinghua University¹, Shanghai Jiao Tong University²

01 Feb 2020-IEEE Journal of Solid-state Circuits

TL;DR: Three new features are proposed in this article to support wide sparsity distribution efficiently and include a multi-sparsity-compatible set-associative convolution processing element (PE) array, designed to efficiently carry out convolution operations under different sparsity modes.

...read moreread less

Abstract: STICKER is an energy-efficient convolutional neural network (NN) processor. It mainly improves energy efficiency by making full use of sparsity. The network sparsity can potentially lower storage and computation requirements. However, the sparsity distribution of both activations and weights ranges from 2% to 99% in different layers or models. Therefore, good support for the sparsity distribution is the key to improve the energy efficiency. Three new features are proposed in this article to support wide sparsity distribution efficiently. First, multi-sparsity control and data flow are implemented for finer sparsity granularity support. It can automatically switch the processor among nine sparsity modes for higher energy efficiency. Second, a multi-mode hierarchical data memory which can be reconfigured for networks with different sparsity modes is designed for higher storage efficiency. Third, a multi-sparsity-compatible set-associative convolution processing element (PE) array is designed to efficiently carry out convolution operations under different sparsity modes, especially when both activations and weights are sparse. STICKER was implemented in a 65-nm CMOS technology. With its wide-range sparsity-supported capacity, the peak energy efficiency reaches 62.1 TOPS/W when sparsity ratios of both activations and weights are 5%. In a completely pruned Alexnet model, STICKER achieves 2.82 TOPS/W energy efficiency 1.8 $\times $ higher than that of the state-of-the-art processors.

...read moreread less

Journal Article•DOI•

A 19.5-GHz 28-nm Class-C CMOS VCO, With a Reasonably Rigorous Result on 1/ f Noise Upconversion Caused by Short-Channel Effects

[...]

Alessandro Franceschin¹, Pietro Andreani², Fabio Padovan³, Matteo Bassi³, Andrea Bevilacqua¹ - Show less +1 more•Institutions (3)

University of Padua¹, Lund University², Infineon Technologies³

13 May 2020-IEEE Journal of Solid-state Circuits

TL;DR: The design is complemented by a theoretical investigation of noise upconversion caused by short-channel effects in the cross-coupled transistors, obtaining the first instance of a closed-form phase noise expression in the $1/f^{3}$ region.

...read moreread less

Abstract: Class-C operation is leveraged to implement a $K$ -band CMOS voltage-controlled oscillator (VCO) where the upconversion of $1/f$ current noise from the cross-coupled transistors in the oscillator core is robustly contained at a very low level. Implemented in a bulk 28-nm CMOS technology, the 12%-tuning-range VCO shows a phase noise as low as −112 dBc/Hz at 1-MHz offset (−86 dBc/Hz at 100 kHz offset) from a 19.5 GHz carrier while consuming 20.7 mW, achieving a figure of merit (FoM) of −185 dBc/Hz. The design is complemented by a theoretical investigation of $1/f$ noise upconversion caused by short-channel effects in the cross-coupled transistors, obtaining the first instance of a closed-form phase noise expression in the $1/f^{3}$ region.

...read moreread less

Journal Article•DOI•

A CMOS 76–81-GHz 2-TX 3-RX FMCW Radar Transceiver Based on Mixed-Mode PLL Chirp Generator

[...]

Taikun Ma¹, Wei Deng¹, Zipeng Chen¹, Jianxi Wu¹, Wei Zheng, Shufu Wang, Nan Qi², Yibo Liu¹, Baoyong Chi¹ - Show less +5 more•Institutions (2)

Tsinghua University¹, Chinese Academy of Sciences²

01 Feb 2020-IEEE Journal of Solid-state Circuits

TL;DR: A fully integrated 76–81-GHz frequency-modulated, continuous-wave (FMCW) radar transceiver (TRX) in a 65-nm CMOS is presented and real-time experimental results show that the distance and the angular resolution of the MIMO radar achieved are 5 cm and 9°.

...read moreread less

Abstract: A fully integrated 76–81-GHz frequency-modulated, continuous-wave (FMCW) radar transceiver (TRX) in a 65-nm CMOS is presented. Two transmitters (TXs) and three receivers (RXs) are integrated for multiple-input multiple-output (MIMO) processing. A 38.5-GHz mixed-mode phase-locked loop (PLL) with reconfigurable loop bandwidth and a frequency doubling scheme are employed to generate the reconfigurable FMCW chirp waveforms. The coarse-to-fine-segmented current DAC is utilized to support sawtooth FMCW chirps with fast frequency ramping-down capability, and the delay lock loop (DLL)-based delay time calibration is used to improve the linearity of the embedded 2-D Vernier time-to-digital converter (TDC). Passive voltage-mode down-conversion is utilized to improve the RX linearity against TX leakage and short-range interference. A bottom-switching Gilbert-type modulator in the TX is proposed to realize the bi-phase modulation, and the magnetically coupled resonator technique is used to effectively expand the link bandwidth. The measurement results show that the FMCW TRX could generate reconfigurable chirps with the bandwidth from 250 MHz to 4 GHz and the period from 30 $\mu \text{s}$ to 10 ms. The root-mean-square (rms) frequency error is 110 kHz for a sawtooth chirp with 4-GHz bandwidth and 300- $\mu \text{s}$ period. The TX maximum output power is 13.4 dBm and is adjustable within 3 dB by reconfiguring its low dropout regulator (LDO) voltage. The RX achieves a 15.3-dB noise figure at 600-kHz IF and a −8.5-dBm RF input-referred P1dB. The overall power consumption is 921 mW, with two TXs and three RXs powered ON. Based on the proposed TRX chip, prototype hardware and a data process algorithm are developed. Real-time experimental results show that the distance and the angular resolution of the MIMO radar achieved are 5 cm and 9°, respectively.

...read moreread less

Journal Article•DOI•

A 373-F 2 0.21%-Native-BER EE SRAM Physically Unclonable Function With 2-D Power-Gated Bit Cells and ${V}_{\text{SS}}$ Bias-Based Dark-Bit Detection

[...]

Kunyang Liu¹, Yue Min¹, Xuan Yang¹, Hanfeng Sun¹, Hirofumi Shinohara¹ - Show less +1 more•Institutions (1)

Waseda University¹

13 Jan 2020-IEEE Journal of Solid-state Circuits

TL;DR: This article presents a highly stable SRAM-based physically unclonable function (PUF) using enhancement–enhancement (EE)-structure bit cells for native stability improvement using a lightweight integrated dark-bit detection technique and eliminated all unstable bits in the accelerated aging test.

...read moreread less

Abstract: This article presents a highly stable SRAM-based physically unclonable function (PUF) using enhancement–enhancement (EE)-structure bit cells for native stability improvement. The PUF bit cells are power-gated 2-D and are normally in the OFF state, which largely reduces power and is beneficial to attack tolerance. In addition, a dark-bit detection technique based on a lightweight integrated ${V}_{\text {SS}}$ -bias generator is implemented in order to screen out potentially unstable bit cells (dark bits) induced by supply voltage/temperature (VT) variations and other factors. Measured native bit error rate (BER) of prototype chips fabricated in 130-nm standard CMOS is 0.21% at 0.8 V and 23 °C, which is 14 $\times $ better compared with the conventional SRAM-based PUF. After masking the detected dark bits, no bit error (3339 bits $\times $ 500 evaluations) appeared at the worst VT corner across 0.8 to 1.4 V and −40 °C to 120 °C. This technique also eliminated all unstable bits in the accelerated aging test. Both the data before and after dark-bit masking have passed all applicable NIST SP 800–22 randomness tests. The measured operational energy at 0.8 V is 128 fJ/bit and the standby power is 0.44 pW/bit, thanks to the 2-D power-gating scheme. The nMOS-only bit cell is highly compact, with a normalized bit cell area of 373 F 2.

...read moreread less

Journal Article•DOI•

A 12.08-TOPS/W All-Digital Time-Domain CNN Engine Using Bi-Directional Memory Delay Lines for Energy Efficient Edge Computing

[...]

Aseem Sayal¹, S. S. Teja Nibhanupudi¹, Shirin Fathima¹, Jaydeep P. Kulkarni¹•Institutions (1)

University of Texas at Austin¹

01 Jan 2020-IEEE Journal of Solid-state Circuits

TL;DR: An energy efficient convolutional neural network (CNN) engine by performing multiply-and-accumulate (MAC) operations in the time domain by employing a novel bi-directional memory delay line (MDL) unit to perform signed accumulation of input and weight products.

...read moreread less

Abstract: In this article, we demonstrate an energy efficient convolutional neural network (CNN) engine by performing multiply-and-accumulate (MAC) operations in the time domain. The multi-bit inputs are compactly represented as a single pulse width encoded input. This translates into reduced switching capacitance ( $C_{\mathrm{ DYN}}$ ), compared to baseline digital implementation, and can enable low power neural network computing in an edge device. The time-domain CNN engine employs a novel bi-directional memory delay line (MDL) unit to perform signed accumulation of input and weight products. The proposed MDL design leverages standard digital circuits and does not require any capacitors and complex analog-to-digital converters (ADCs) to realize the convolution operation, thereby enabling easy scaling across the process technology nodes. Four speed-up modes and a configurable MDL length are supported to address throughput versus accuracy trade-off of the time-domain computing approach. Delay calibration units have been accommodated to mitigate the process variation induced delay mismatch among concurrently operating MDL units. The proposed time-domain MDL design implements a LeNet-5 CNN engine in a commercial 40-nm CMOS process achieving an energy efficiency of 12.08 TOPS/W, a throughput of 0.365 GOPS at 537 mV in the 16 $\times $ speed-up mode. 40-nm CMOS test-chip measurements over 100 MNIST images show 97% classification accuracy. Simulation results over the entire 10 000 MNIST validation dataset images taking into account the circuit non-ideal effects of the MDL-based time-domain approach show a classification accuracy of 98.42%. The test-chip is operational down to the near-threshold voltage (up to 375 mV) while maintaining the classification accuracy over 90% in the 1 $\times $ speed-up mode. Furthermore, two methods of scaling MDLs to multi-bit weights are proposed. Simulation results for 1000-class AlexNet over 50 000 ImageNet validation dataset images show classification accuracy loss within 1% when compared with software implementation. The proposed MDL based time-domain approach performing 1-bit/8-bit weight and 8-bit input MAC operations when compared with the corresponding baseline digital implementations shows 2.09 $\times $ −2.32 $\times $ higher energy efficiency and 2.22 $\times $ −3.45 $\times $ smaller area.

...read moreread less

Journal Article•DOI•

High-Value Tunable Pseudo-Resistors Design

[...]

Emanuele Guglielmi¹, Fabio Toso¹, Francesco Zanetto¹, Giuseppe Sciortino¹, Alireza Mesri¹, Marco Sampietro¹, Giorgio Ferrari¹ - Show less +3 more•Institutions (1)

Polytechnic University of Milan¹

25 Feb 2020-IEEE Journal of Solid-state Circuits

TL;DR: An optimized architecture of pseudo-resistor, made in standard CMOS 0.35 technology, is presented to bias a low-noise transimpedance amplifier for high-sensitivity applications in the frequency range 100 kHz–10 MHz.

...read moreread less

Abstract: Pseudo-resistor circuits are used to mimic large value resistors and base their success on the reduction of occupied areas with respect to physical devices of equal value. This article presents an optimized architecture of pseudo-resistor, made in standard CMOS 0.35 $\mu \text{m}$ technology to bias a low-noise transimpedance amplifier for high-sensitivity applications in the frequency range 100 kHz–10 MHz. The architecture was selected after a critical review of the different topologies to implement high-value resistances with MOSFET transistors, considering their performance in terms of linearity of response, symmetric dynamic range, frequency behavior, and simplicity of realization. The resulting circuit consumes an area of 0.017 mm2 and features a tunable resistance from ${20\quad \text {M} \Omega }$ to ${20\quad \text {G} \Omega }$ , dynamic offset reduction due to a more than linear $I$ – $V$ curve, and a high-frequency noise well below the one of a physical resistor of equal value. This latter aspect highlights the larger perspective of pseudo-resistors as building blocks in very low-noise applications in addition to the advantage in occupied areas they provide.

...read moreread less

Journal Article•DOI•

A 4-Kb 1-to-8-bit Configurable 6T SRAM-Based Computation-in-Memory Unit-Macro for CNN-Based AI Edge Processors

[...]

Yen-Cheng Chiu¹, Zhixiao Zhang¹, Jia-Jing Chen¹, Xin Si¹, Ruhui Liu¹, Yung-Ning Tu¹, Jian-Wei Su², Wei-Hsing Huang¹, Jing-Hong Wang¹, Wei-Chen Wei¹, Je-Min Hung¹, Shyh-Shyuan Sheu², Sih-Han Li², Chih-I Wu², Ren-Shuo Liu¹, Chih-Cheng Hsieh¹, Kea-Tiong Tang¹, Meng-Fan Chang¹ - Show less +14 more•Institutions (2)

National Tsing Hua University¹, Industrial Technology Research Institute²

14 Jul 2020-IEEE Journal of Solid-state Circuits

TL;DR: This work presents a 1-to-8-bit configurable SRAM CIM unit-macro using a hybrid structure combining 6T-SRAM based in-memory binary product-sum operations with digital near-memory-computing multibit PS accumulation to increase read accuracy and reduce area overhead.

...read moreread less

Abstract: Previous SRAM-based computing-in-memory (SRAM-CIM) macros suffer small read margins for high-precision operations, large cell array area overhead, and limited compatibility with many input and weight configurations. This work presents a 1-to-8-bit configurable SRAM CIM unit-macro using: 1) a hybrid structure combining 6T-SRAM based in-memory binary product-sum (PS) operations with digital near-memory-computing multibit PS accumulation to increase read accuracy and reduce area overhead; 2) column-based place-value-grouped weight mapping and a serial-bit input (SBIN) mapping scheme to facilitate reconfiguration and increase array efficiency under various input and weight configurations; 3) a self-reference multilevel reader (SRMLR) to reduce read-out energy and achieve a sensing margin 2 $\times $ that of the mid-point reference scheme; and 4) an input-aware bitline voltage compensation scheme to ensure successful read operations across various input-weight patterns. A 4-Kb configurable 6T-SRAM CIM unit-macro was fabricated using a 55-nm CMOS process with foundry 6T-SRAM cells. The resulting macro achieved access times of 3.5 ns per cycle (pipeline) and energy efficiency of 0.6–40.2 TOPS/W under binary to 8-b input/8-b weight precision.

...read moreread less

Journal Article•DOI•

A 1.7-dB Minimum NF, 22–32-GHz Low-Noise Feedback Amplifier With Multistage Noise Matching in 22-nm FD-SOI CMOS

[...]

Bolun Cui¹, John R. Long¹•Institutions (1)

University of Waterloo¹

30 Jan 2020-IEEE Journal of Solid-state Circuits

TL;DR: A low-noise feedback amplifier with interstage noise matching is implemented in 22-nm fully depleted silicon-on-insulator (SOI)-CMOS technology with continuous dc power control on the fly using modulation of FET backgates.

...read moreread less

Abstract: A low-noise feedback amplifier (LNA) with interstage noise matching is implemented in 22-nm fully depleted silicon-on-insulator (SOI)-CMOS technology. Minimum noise figure (NF) is 1.7 dB centered at 28 GHz, and NF remains below 1.98±0.25 dB across a 10-GHz range. Peak gain of the two-stage LNA is 21.5 dB at 22 GHz, and the bandwidth (BW) for $|{S_{21}} |$ is 19–36 GHz. Input and output return losses are better than 10 dB across an effective LNA BW of 22–32 GHz. The third-order input intercept is −13.4 dBm at peak gain when dissipating 17.3 mW. Continuous dc power control on the fly is implemented using modulation of FET backgates. When dc power consumption is reduced 5.6 mW, NF increases by less than 0.5 dB, peak gain decreases by 3.6 dB, and input return loss remains better than 10 dB with no change in effective BW.

...read moreread less

Journal Article•DOI•

A 1.6-GS/s 12.2-mW Seven-/Eight-Way Split Time-Interleaved SAR ADC Achieving 54.2-dB SNDR With Digital Background Timing Mismatch Calibration

[...]

Mingqiang Guo¹, Jiaji Mao¹, Sai-Weng Sin¹, Hegong Wei², Rui P. Martins¹ - Show less +1 more•Institutions (2)

University of Macau¹, University of Texas at Austin²

01 Mar 2020-IEEE Journal of Solid-state Circuits

TL;DR: This article presents a split time-interleaved (TI) successive-approximation register (SAR) analog-to-digital converter (ADC) with digital background timing-skew mismatch calibration, which divides a TI-SAR ADC into two split parts with the same overall sampling rate but different numbers of TI channels.

...read moreread less

Abstract: This article presents a split time-interleaved (TI) successive-approximation register (SAR) analog-to-digital converter (ADC) with digital background timing-skew mismatch calibration. It divides a TI-SAR ADC into two split parts with the same overall sampling rate but different numbers of TI channels. Benefitting from the proposed split TI topology, the timing-skew calibration convergence speed is fast without any extra analog circuits. The input impedance of the overall TI-ADC remains unchanged, which is essential for the preceding driving stage in a high-speed application. We designed a prototype seven-/eight-way split TI-ADC implemented in 28-nm CMOS. After a digital background timing-skew calibration, it reaches a 54.2-dB signal-to-noise-and-distortion ratio (SNDR) and 67.1-dB spurious free dynamic range (SFDR) with a near Nyquist rate input signal and a 2.5-GHz effective resolution bandwidth (ERBW). Furthermore, the power consumption of ADC core (mismatch calibration off-chip) is 12.2-mW running at 1.6 GS/s, leading to a Walden figure-of-merit (FOM) of 18.2 fJ/conv.-step and a Schreier FOM of 162.4 dB, respectively.

...read moreread less

Journal Article•DOI•

A 13.5-ENOB, 107-μW Noise-Shaping SAR ADC With PVT-Robust Closed-Loop Dynamic Amplifier

[...]

Xiyuan Tang¹, Xiangxing Yang¹, Wenda Zhao¹, Chen-Kai Hsu¹, Jiaxin Liu², Linxiao Shen¹, Abhishek Mukherjee¹, Wei Shi¹, Shaolan Li³, David Z. Pan¹, Nan Sun¹ - Show less +7 more•Institutions (3)

University of Texas at Austin¹, Tsinghua University², Georgia Institute of Technology³

09 Sep 2020-IEEE Journal of Solid-state Circuits

TL;DR: This article presents a second-order noise-shaping (NS) successive approximation register (SAR) analog-to-digital converter (ADC) with a process, voltage, and temperature (PVT)-robust closed-loop dynamic amplifier, enabling the first fully dynamic NS-SAR ADC that realizes sharp noise transfer function (NTF) while not requiring any gain calibration.

...read moreread less

Abstract: This article presents a second-order noise-shaping (NS) successive approximation register (SAR) analog-to-digital converter (ADC) with a process, voltage, and temperature (PVT)-robust closed-loop dynamic amplifier. The proposed closed-loop dynamic amplifier combines the merits of closed-loop architecture and dynamic operation, realizing robustness, high accuracy, and high energy-efficiency simultaneously. It is embedded in the loop filter of an NS SAR design, enabling the first fully dynamic NS-SAR ADC that realizes sharp noise transfer function (NTF) while not requiring any gain calibration. Fabricated in 40-nm CMOS technology, the prototype ADC achieves an SNDR of 83.8 dB over a bandwidth of 625 kHz while consuming only $107~\mu \text{W}$ . It results in an SNDR-based Schreier figure-of-merit (FoM) of 181.5 dB.

...read moreread less

Journal Article•DOI•

A 243-mW 1.25–56-Gb/s Continuous Range PAM-4 42.5-dB IL ADC/DAC-Based Transceiver in 7-nm FinFET

[...]

Matteo Pisati, Alberto Minuti, Giacomino Bollati, Fabio Giunco, Roberto Giampiero Massolini, Giovanni Cesura, Fernando De Bernardinis, Paolo Pascale, Claudio Nani, N. Ghittori, Enrico Pozzati, Marco Sosio, Marco Garampazzi, Antonio Milani - Show less +10 more

01 Jan 2020-IEEE Journal of Solid-state Circuits

TL;DR: This article presents a compact analog-to-digital converter (ADC)/digital- to-analog converter (DAC) digital signal processing (DSP)-based long reach (LR) transceiver in 7-nm FinFET technology that operates seamlessly from 3.5—56 Gb/s in pulse-amplitude modulation (PAM-4) and consumes only 243 mW at 56 GB/s.

...read moreread less

Abstract: This article presents a compact analog-to-digital converter (ADC)/digital-to-analog converter (DAC) digital signal processing (DSP)-based long reach (LR) transceiver in 7-nm FinFET technology that operates seamlessly from 3.5—56 Gb/s in pulse-amplitude modulation (PAM-4) [from 1.25 to 28 Gb/s in non-return to zero (NRZ) mode] and consumes only 243 mW at 56 Gb/s. The receiver (RX) front end consists of a two-stage continuous-time linear equalizer (CTLE), a 40-way time-interleaved (TI) successive approximation register (SAR)-ADC, a DSP equalizer containing a 17-tap feed-forward equalizer (FFE) working concurrently with a one-tap speculative decision feedback equalizer (DFE) and a reflection canceling FFE, which implements four individually roaming taps. Clock recovery is achieved on a dedicated low latency path consisting of a five-tap FFE, slicer, time error detector (TED), and loop filter driving a dedicated LC —digital-controlled oscillator (DCO). The transmit section consists of a variety of pattern generators, a five-tap finite impulse response (FIR) section, and a terminated DAC as an analog transmitter. When working on a 42.5-dB-LR channel at 56 Gb/s PAM-4, the transceiver consumes 243 mW from the 0.9-V (analog) and 0.75-V (digital) supplies, corresponding to an efficiency of 4.3 pJ/b.

...read moreread less

Journal Article•DOI•

An mm-Wave Synthesizer With Robust Locking Reference-Sampling PLL and Wide-Range Injection-Locked VCO

[...]

Dongyi Liao¹, Yucai Zhang², Fa Foster Dai², Zhenqi Chen, Yanjie Wang - Show less +1 more•Institutions (2)

Qualcomm¹, Auburn University²

10 Jan 2020-IEEE Journal of Solid-state Circuits

TL;DR: Using a two-stage scheme allows separately dealing with the low phase noise (PN) frequency synthesis in the first stage and the mm-wave frequency multiplication in the second stage, achieving the best overall power efficiency.

...read moreread less

Abstract: In this article, a two-stage millimeter (mm)-wave frequency synthesizer with low in-band noise and robust locking reference-sampling techniques is presented. Using a two-stage scheme allows separately dealing with the low phase noise (PN) frequency synthesis in the first stage and the mm-wave frequency multiplication in the second stage, achieving the best overall power efficiency. In the first stage, a voltage domain reference-sampling phase detector (RSPD)-locked loop (RSPLL) is adopted to achieve both low PN and robust locking without additional frequency locking loop. A reference reshaping buffer is implemented to improve the phase detector gain and in-band PN. The reference rising/falling time is programmable to achieve optimal RSPLL performance even under external disturbances. The second stage employs an injection-locked voltage-controlled oscillator (ILVCO) for 4 $\times $ frequency multiplication. A low-power digital frequency tracking loop (FTL) detecting actual frequency errors is implemented in order to achieve wide operation range for the ILVCO while using a high ${Q}$ tank with low power. The prototype synthesizer was fabricated in a 45-nm partially depleted silicon on insulator (PDSOI) CMOS technology. The first stage 9-GHz RSPLL achieves 144-fs integrated jitter with 7.2-mW power consumption, achieving a figure of merit (FoM) of −248 dB and the overall mm-wave synthesizer achieves 251-fs integrated jitter with 20.6-mW power consumption at 35.84 GHz, achieving an FoM of −238.9 dB.

...read moreread less

Collapse