Showing papers in "IEEE Journal of Solid-state Circuits in 2021"

PDF

Open Access

Journal Article•DOI•

A 7-nm Compute-in-Memory SRAM Macro Supporting Multi-Bit Input, Weight and Output and Achieving 351 TOPS/W and 372.4 GOPS

[...]

Mahmut E. Sinangil¹, Burak Erbagci¹, Rawan Naous¹, Kerem Akarvardar¹, Dar Sun¹, Win-San Khwa¹, Hung-jen Liao¹, Yih Wang¹, Jonathan Chang¹ - Show less +5 more•Institutions (1)

TSMC¹

01 Jan 2021-IEEE Journal of Solid-state Circuits

TL;DR: This work presents a compute-in-memory (CIM) macro built around a standard two-port compiler macro using foundry 8T bit-cell in 7-nm FinFET technology and achieves energy efficiency of 351 TOPS/W and throughput of 372.4 GOPS.

...read moreread less

Abstract: In this work, we present a compute-in-memory (CIM) macro built around a standard two-port compiler macro using foundry 8T bit-cell in 7-nm FinFET technology. The proposed design supports 1024 4 b $\times $ 4 b multiply-and-accumulate (MAC) computations simultaneously. The 4-bit input is represented by the number of read word-line (RWL) pulses, while the 4-bit weight is realized by charge sharing among binary-weighted computation caps. Each unit of computation cap is formed by the inherent cap of the sense amplifier (SA) inside the 4-bit Flash ADC, which saves area and minimizes kick-back effect. Access time is 5.5 ns with 0.8-V power supply at room temperature. The proposed design achieves energy efficiency of 351 TOPS/W and throughput of 372.4 GOPS. Implications of our design from neural network implementation and accuracy perspectives are also discussed.

...read moreread less

73 citations

Journal Article•DOI•

Colonnade: A Reconfigurable SRAM-Based Digital Bit-Serial Compute-In-Memory Macro for Processing Neural Networks

[...]

Hyunjoon Kim¹, Taegeun Yoo², Tony Tae-Hyoung Kim¹, Bongjin Kim³•Institutions (3)

Nanyang Technological University¹, Samsung², University of California, Santa Barbara³

09 Mar 2021-IEEE Journal of Solid-state Circuits

TL;DR: Based on the benefits of digital CIM, reconfigurability, and bit-serial computing architecture, the Colonnade can achieve both high performance and energy efficiency for processing neural networks.

...read moreread less

Abstract: This article (Colonnade) presents a fully digital bit-serial compute-in-memory (CIM) macro. The digital CIM macro is designed for processing neural networks with reconfigurable 1–16 bit input and weight precisions based on bit-serial computing architecture and a novel all-digital bitcell structure. A column of bitcells forms a column MAC and used for computing a multiply-and-accumulate (MAC) operation. The column MACs placed in a row work as a single neuron and computes a dot-product, which is an essential building block of neural network accelerators. Several key features differentiate the proposed Colonnade architecture from the existing analog and digital implementations. First, its full-digital circuit implementation is free from process variation, noise susceptibility, and data-conversion overhead that are prevalent in prior analog CIM macros. A bitwise MAC operation in a bitcell is performed in the digital domain using a custom-designed XNOR gate and a full-adder. Second, the proposed CIM macro is fully reconfigurable in both weight and input precision from 1 to 16 bit. So far, most of the analog macros were used for processing quantized neural networks with very low input/weight precisions, mainly due to a memory density issue. Recent digital accelerators have implemented reconfigurable precisions, but they are inferior in energy efficiency due to significant off-chip memory access. We present a regular digital bitcell array that is readily reconfigured to a 1–16 bit weight-stationary bit-serial CIM macro. The macro computes parallel dot-product operations between the weights stored in memory and inputs that are serialized from LSB to MSB. Finally, the bit-serial computing scheme significantly reduces the area overhead while sacrificing latency due to bit-by-bit operation cycles. Based on the benefits of digital CIM, reconfigurability, and bit-serial computing architecture, the Colonnade can achieve both high performance and energy efficiency (i.e., both benefits of prior analog and digital accelerators) for processing neural networks. A test-chip with $128 \times 128$ SRAM-based bitcells for digital bit-serial computing is implemented using 65-nm technology and tested with 1–16 bit weight/input precisions. The measured energy efficiency is 117.3 TOPS/W at 1 bit and 2.06 TOPS/W at 16 bit.

...read moreread less

58 citations

Journal Article•DOI•

STATICA: A 512-Spin 0.25M-Weight Annealing Processor With an All-Spin-Updates-at-Once Architecture for Combinatorial Optimization With Complete Spin–Spin Interactions

[...]

Kasho Yamamoto¹, Kazushi Kawamura², Kota Ando², Normann Mertig¹, Takashi Takemoto¹, Masanao Yamaoka¹, Hiroshi Teramoto³, Akira Sakai³, Shinya Takamaeda-Yamazaki⁴, Masato Motomura² - Show less +6 more•Institutions (4)

Hitachi¹, Tokyo Institute of Technology², Hokkaido University³, University of Tokyo⁴

01 Jan 2021-IEEE Journal of Solid-state Circuits

TL;DR: A high-performance annealing processor named STochAsTIc Cellular automata Annealer (STATICA) for solving combinatorial optimization problems represented by fully connected graphs and can update multiple states of fully connected spins simultaneously by introducing different dynamics called stochastic cellular automata annealer.

...read moreread less

Abstract: This article presents a high-performance annealing processor named STochAsTIc Cellular automata Annealer (STATICA) for solving combinatorial optimization problems represented by fully connected graphs. Supporting fully connected graphs is strongly required for dealing with realistic optimization problems. Unlike previous annealing processors that follow Glauber dynamics, our proposed annealer can update multiple states of fully connected spins simultaneously by introducing different dynamics called stochastic cellular automata annealing. It allows us to utilize the pipeline-level and memory-bank-level parallelization in addition to the PE-level parallelization originally adopted in the previous annealers. The STATICA prototype chip, which supports 512-spin fully connected graph, has been fabricated with the 65-nm CMOS technology and realized as a 3 mm $\times \,\,{4}$ mm chip. Using the fabricated 512-spin chip and numerical projections for a 2048-spin chip, we have conducted experiments to reveal the annealing performance of STATICA and examined how to control its annealing process efficiently.

...read moreread less

51 citations

Journal Article•DOI•

A 3-D-Integrated Silicon Photonic Microring-Based 112-Gb/s PAM-4 Transmitter With Nonlinear Equalization and Thermal Control

[...]

Hao Li¹, Ganesh Balamurugan¹, Taehwan Kim¹, Meer Sakib¹, Ranjeet Kumar¹, Haisheng Rong¹, James E. Jaussi¹, Bryan K. Casper¹ - Show less +4 more•Institutions (1)

Intel¹

01 Jan 2021-IEEE Journal of Solid-state Circuits

TL;DR: A 3-D-integrated 112-Gb/s pulse amplitude modulation (PAM)-4 optical transmitter (OTX) using silicon photonic MRM, on-chip laser, and co-packaged 28-nm CMOS driver to address static and dynamic MRM nonlinearities is presented.

...read moreread less

Abstract: Microring modulators (MRMs) with CMOS electronics enable compact low power transmitter solutions for 400G Ethernet and co-packaged optical transceivers. In this article, we present a 3-D-integrated 112-Gb/s pulse amplitude modulation (PAM)-4 optical transmitter (OTX) using silicon photonic MRM, on-chip laser, and co-packaged 28-nm CMOS driver. The 3- $V_{\mathrm {pp}}$ driver includes a lookup table (LUT)-based PAM-4 nonlinear equalizer to address static and dynamic MRM nonlinearities. An integrated thermal control method that is insensitive to input power fluctuations is proposed to compensate for the temperature sensitivity of MRMs. PAM-4 measurement results of our OTX at 112 Gb/s show that transmitter dispersion eye closure quaternary (TDECQ) < 1.5 dB is achieved from 28 °C to 55 °C with 7.4-pJ/bit energy efficiency including on-chip laser.

...read moreread less

50 citations

Journal Article•DOI•

A 112-Gb/s PAM-4 Long-Reach Wireline Transceiver Using a 36-Way Time-Interleaved SAR ADC and Inverter-Based RX Analog Front-End in 7-nm FinFET

[...]

Jay Im¹, Kevin Zheng¹, Chuen-huei Adam Chou¹, Lei Zhou¹, Jae Wook Kim¹, Stanley Chen¹, Yipeng Wang¹, H.-W. Hung¹, Kee Hian Tan¹, Winson Lin¹, Arianne Roldan¹, Declan Carey¹, Ilias Chlis¹, Ronan Casey¹, Adebabay M. Bekele¹, Ying Cao¹, David Mahashin¹, H. Ahn¹, Hongtao Zhang¹, Yohan Frans¹, Kun-Yung Ken Chang¹ - Show less +17 more•Institutions (1)

Xilinx¹

01 Jan 2021-IEEE Journal of Solid-state Circuits

TL;DR: A 36-way time-interleaved 56-GS/s 7-bit ADC is designed to realize 112-Gb/s pulse-amplitude modulation (PAM-4) transceiver in a 7-nm FinFET CMOS, achieved over a channel with 37.5-dB loss at 28 GHz while dissipating 602 mW per channel, excluding DSP.

...read moreread less

Abstract: A 36-way time-interleaved 56-GS/s 7-bit ADC is designed to realize 112-Gb/s pulse-amplitude modulation (PAM-4) transceiver in a 7-nm FinFET CMOS. The receiver analog front-end stages and the ADC track-and-hold (T/H) buffers are implemented using inverter-based Gm/inverse-Gm-load cells. A distributed inductor peaking network and multi-phase clock calibration is implemented in the quarter-rate transmitter. The transceiver achieves <1E-8 pseudorandom binary sequence (PRBS)-31 PAM-4 bit error rate (BER) over a channel with 37.5-dB loss at 28 GHz while dissipating 602 mW per channel, excluding DSP.

...read moreread less

50 citations

Journal Article•DOI•

MANA: A Monolithic Adiabatic iNtegration Architecture Microprocessor Using 1.4-zJ/op Unshunted Superconductor Josephson Junction Devices

[...]

Christopher L. Ayala¹, Tomoyuki Tanaka¹, Ro Saito¹, Mai Nozoe¹, Naoki Takeuchi¹, Nobuyuki Yoshikawa¹ - Show less +2 more•Institutions (1)

Yokohama National University¹

01 Apr 2021-IEEE Journal of Solid-state Circuits

TL;DR: In this paper, the first successful demonstration of an adiabatic microprocessor based on unshunted Josephson junction (JJ) devices manufactured using a Nb/AlOx/Nb superconductor IC fabrication process was conducted.

...read moreread less

Abstract: We conducted the first successful demonstration of an adiabatic microprocessor based on unshunted Josephson junction (JJ) devices manufactured using a Nb/AlOx/Nb superconductor IC fabrication process. It is a hybrid of RISC and dataflow architectures operating on 4-b data words. We demonstrate register file R/W access, ALU execution, hardware stalling, and program branching performed at 100 kHz under the cryogenic temperature of 4.2 K. We also successfully demonstrated a high-speed breakout chip of the microprocessor execution units up to 2.5 GHz. We use a logic primitive called the adiabatic quantum-flux-parametron (AQFP), which has a switching energy of 1.4 zJ per JJ when driven by a four-phase 5-GHz sinusoidal ac-clock at 4.2 K. These demonstrations show that AQFP logic is capable of both processing and memory operations and that we have a path toward practical adiabatic computing operating at high-clock rates while dissipating very little energy.

...read moreread less

48 citations

Journal Article•DOI•

A Local Computing Cell and 6T SRAM-Based Computing-in-Memory Macro With 8-b MAC Operation for Edge AI Chips

[...]

Xin Si¹, Yung-Ning Tu², Wei-Hsing Huang², Jian-Wei Su², Pei-Jung Lu², Jing-Hong Wang², Ta-Wei Liu², Ssu-Yen Wu², Ruhui Liu², Yen-Chi Chou², Yen-Lin Chung², William Shih², Chung-Chuan Lo², Ren-Shuo Liu², Chih-Cheng Hsieh², Kea-Tiong Tang², Nan-Chun Lien, Wei-Chiang Shih, Yajuan He³, Qiang Li³, Meng-Fan Chang² - Show less +17 more•Institutions (3)

Southeast University¹, National Tsing Hua University², University of Electronic Science and Technology of China³

27 Apr 2021-IEEE Journal of Solid-state Circuits

TL;DR: A 6T SRAM-based CIM (SRAM-CIM) macro capable of weight-bitwise MAC (WbwMAC) operations to expand the sensing margin and improve the readout accuracy for high-precision MAC operations is presented.

...read moreread less

Abstract: This article presents a computing-in-memory (CIM) structure aimed at improving the energy efficiency of edge devices running multi-bit multiply-and-accumulate (MAC) operations. The proposed scheme includes a 6T SRAM-based CIM (SRAM-CIM) macro capable of: 1) weight-bitwise MAC (WbwMAC) operations to expand the sensing margin and improve the readout accuracy for high-precision MAC operations; 2) a compact 6T local computing cell to perform multiplication with suppressed sensitivity to process variation; 3) an algorithm-adaptive low MAC-aware readout scheme to improve energy efficiency; 4) a bitline header selection scheme to enlarge signal margin; and 5) a small-offset margin-enhanced sense amplifier for robust read operations against process variation. A fabricated 28-nm 64-kb SRAM-CIM macro achieved access times of 4.1–8.4 ns with energy efficiency of 11.5–68.4 TOPS/W, while performing MAC operations with 4- or 8-b input and weight precision.

...read moreread less

48 citations

Journal Article•DOI•

A 220-to-320-GHz FMCW Radar in 65-nm CMOS Using a Frequency-Comb Architecture

[...]

Xiang Yi¹, Cheng Wang¹, Xibi Chen¹, Jinchen Wang¹, Jesus Grajal¹, Ruonan Han¹ - Show less +2 more•Institutions (1)

Massachusetts Institute of Technology¹

01 Feb 2021-IEEE Journal of Solid-state Circuits

TL;DR: This work marks the first CMOS demonstration of THz radar and achieves record bandwidth and ranging resolution among all radar front-end chips.

...read moreread less

Abstract: This article presents a CMOS-based, ultra-broadband frequency-modulated continuous-wave (FMCW) radar using a terahertz (THz) frequency-comb architecture. The high-parallelism spectral sensing provided by this architecture significantly reduces the bandwidth requirement for the THz front-end circuitry and ensures that the peak output power and sensitivity are maintained across the entire band of operation. The speed and linearity of frequency chirping are also improved by the comb system. An antenna-sharing scheme based on a square-mixer-first architecture is used, which not only leads to compact size but also facilitates the stitching of the multichannel radar IF data. To avoid the usage of high-cost silicon lens in the on-chip broadband radiation, a multi-resonance substrate-integrated-waveguide (SIW) antenna structure is innovated, which provides 15% fractional bandwidth for impedance matching. As a proof of concept, a five-tone radar prototype that seamlessly scans the entire 220-to-320-GHz band is demonstrated. In the measurement, the multi-channel-aggregated equivalent-isotropically radiated power (EIRP) is 0.6 dBm and is further boosted to ~20 dBm with a TPX (polymethylpentene) lens. The measured minimum single-sideband noise figure (SSB NF) of the receiver, including the antenna loss and baseband amplifier, is 22.8 dB. Due to the comb architecture, the EIRP and NF values fluctuate by only 8.8 and 14.6 dB, respectively, across the 100-GHz bandwidth. The chip has a die size of 5 mm2 and consumes 840 mW of dc power. This work marks the first CMOS demonstration of THz radar and achieves record bandwidth and ranging resolution among all radar front-end chips.

...read moreread less

48 citations

Journal Article•DOI•

Vega: A Ten-Core SoC for IoT Endnodes With DNN Acceleration and Cognitive Wake-Up From MRAM-Based State-Retentive Sleep Mode

[...]

Davide Rossi¹, Francesco Conti¹, Manuel Eggimann², Alfio Di Mauro², Giuseppe Tagliavini¹, Stefan Mach², Marco Guermandi¹, Antonio Pullini, Igor Loi, Jie Chen¹, Eric Flamand, Luca Benini¹ - Show less +8 more•Institutions (2)

University of Bologna¹, ETH Zurich²

06 Oct 2021-IEEE Journal of Solid-state Circuits

TL;DR: Vega as discussed by the authors is an IoT endnode system on chip (SoC) capable of scaling from a 1.7-μW fully retentive cognitive sleep mode up to 32.2-GOPS (at 49.4 mW).

...read moreread less

Abstract: The Internet-of-Things (IoT) requires endnodes with ultra-low-power always-on capability for a long battery lifetime, as well as high performance, energy efficiency, and extreme flexibility to deal with complex and fast-evolving near-sensor analytics algorithms (NSAAs). We present Vega, an IoT endnode system on chip (SoC) capable of scaling from a 1.7-μW fully retentive cognitive sleep mode up to 32.2-GOPS (at 49.4 mW) peak performance on NSAAs, including mobile deep neural network (DNN) inference, exploiting 1.6 MB of state-retentive SRAM, and 4 MB of non-volatile magnetoresistive random access memory (MRAM). To meet the performance and flexibility requirements of NSAAs, the SoC features ten RISC-V cores: one core for SoC and IO management and a nine-core cluster supporting multi-precision single instruction multiple data (SIMD) integer and floating-point (FP) computation. Vega achieves the state-of-the-art (SoA)-leading efficiency of 615 GOPS/W on 8-bit INT computation (boosted to 1.3 TOPS/W for 8-bit DNN inference with hardware acceleration). On FP computation, it achieves the SoA-leading efficiency of 79 and 129 GFLOPS/W on 32- and 16-bit FP, respectively. Two programmable machine learning (ML) accelerators boost energy efficiency in cognitive sleep and active states.

...read moreread less

46 citations

Journal Article•DOI•

A Probabilistic Compute Fabric Based on Coupled Ring Oscillators for Solving Combinatorial Optimization Problems

[...]

Ibrahim Ahmed¹, Po-Wei Chiu¹, William Moy¹, Chris H. Kim¹•Institutions (1)

University of Minnesota¹

12 Mar 2021-IEEE Journal of Solid-state Circuits

TL;DR: Experimental results show that ROSCs are a potential candidate for a dedicated hardware accelerator aiming to solve a wide range of COPs and that the integrated CMOS-based Ising computer can find the solution to NP-hard problems with an accuracy of 82%–100%.

...read moreread less

Abstract: Nondeterministic polynomial time hard (NP-hard) combinatorial optimization problems (COPs) are intractable to solve using a traditional computer as the time to find a solution increases very rapidly with the number of variables. An efficient alternative computing method uses coupled spin networks to solve COP. This work presents a first-of-its-kind coupled ring oscillator (ROSC)-based scalable probabilistic Ising computer to solve NP-hard COPs. An integrated coupled oscillator network was designed with 560 ROSCs that mimic a coupled spin network. Each ROSC can be coupled to any of its neighbors using programmable back-to-back (B2B) inverter-based coupling mechanism. The ROSC-based spins and B2B inverter-based coupling were optimized to work under a wide range of system noise as well as voltage and temperature variations. Randomly generated 1000 max-cut problems were mapped and solved in the hardware. The integrated Ising computer produced satisfactory solutions of max-cut problems when compared with commercial software running on a CPU. Experiments show that the integrated CMOS-based Ising computer can find the solution to NP-hard problems with an accuracy of 82%–100%. In addition, the repeated measurements of the same problem showed that the Ising computer can traverse through several local minima to find high-quality solutions under various voltage and temperature variation conditions. The experimental results show that ROSCs are a potential candidate for a dedicated hardware accelerator aiming to solve a wide range of COPs.

...read moreread less

44 citations

Journal Article•DOI•

Evolver: A Deep Learning Processor With On-Device Quantization–Voltage–Frequency Tuning

[...]

Fengbin Tu¹, Weiwei Wu¹, Yang Wang¹, Hongjiang Chen¹, Feng Xiong¹, Man Shi¹, Ning Li¹, Jinyi Deng¹, Tianbao Chen, Leibo Liu¹, Shaojun Wei¹, Yuan Xie², Shouyi Yin¹ - Show less +9 more•Institutions (2)

Tsinghua University¹, University of California, Santa Barbara²

01 Feb 2021-IEEE Journal of Solid-state Circuits

TL;DR: Evolver is the first deep learning processor that utilizes on-device QVF tuning to achieve both customized and optimal DNN deployment, and introduces bidirectional speculation and runtime reconfiguration techniques into the architecture.

...read moreread less

Abstract: When deploying deep neural networks (DNNs) onto deep learning processors, we usually exploit mixed-precision quantization and voltage–frequency scaling to make tradeoffs among accuracy, latency, and energy. Conventional methods usually determine the quantization–voltage–frequency (QVF) policy before DNNs are deployed onto local devices. However, they are difficult to make optimal customizations for local user scenarios. In this article, we solve the problem by enabling on-device QVF tuning with a new deep learning processor architecture Evolver. Evolver has a QVF tuning mode to deploy DNNs with local customizations before normal execution. In this mode, Evolver uses reinforcement learning to search the optimal QVF policy based on direct hardware feedbacks from the chip itself. After that, Evolver runs the newly quantized DNN inference under the searched voltage and frequency. To improve the performance and energy efficiency of both training and inference, we introduce bidirectional speculation and runtime reconfiguration techniques into the architecture. To the best of our knowledge, Evolver is the first deep learning processor that utilizes on-device QVF tuning to achieve both customized and optimal DNN deployment.

...read moreread less

Journal Article•DOI•

Direct TOF Scanning LiDAR Sensor With Two-Step Multievent Histogramming TDC and Embedded Interference Filter

[...]

Hyeongseok Seo¹, Heesun Yoon, Dong-Kyu Kim, Jungwoo Kim², Seong-Jin Kim³, Jung-Hoon Chun¹, Jaehyuk Choi¹ - Show less +3 more•Institutions (3)

Sungkyunkwan University¹, Samsung², Ulsan National Institute of Science and Technology³

14 Jan 2021-IEEE Journal of Solid-state Circuits

TL;DR: In this paper, a 36-channel scanning light detection and ranging (LiDAR) sensor with an on-chip single-photon avalanche diode array is presented, which has an area-efficient 11-bit in situ histogramming time-to-digital converter with a $3000 \times 78 \,\,\mu \text {m}^{2}$ per channel area based on a mixed-signal accumulator.

...read moreread less

Abstract: This article presents a 36-channel scanning light detection and ranging (LiDAR) sensor with an on-chip single-photon avalanche diode array. The sensor has an area-efficient 11-bit in situ histogramming time-to-digital converter with a $3000 \times 78\,\,\mu \text {m}^{2}$ per channel area based on a mixed-signal accumulator, though it is incorporated with histogramming and filtering capabilities. Furthermore, owing to its embedded interference (IF) filter, the sensor can perform reliable direct time-of-flight measurements even with IF from 32 different LiDAR sensors. The LiDAR system also has a beam scanner that comprises dual laser diodes for IF elimination and a hybrid mirror such that high-resolution images with a resolution of $2200 \times 36$ can be acquired with a wide field-of-view of $120^{\circ } \times 8^{\circ }$ .

...read moreread less

Journal Article•DOI•

IntAct: A 96-Core Processor With Six Chiplets 3D-Stacked on an Active Interposer With Distributed Interconnects and Integrated Power Management

[...]

Pascal Vivet¹, Eric Guthmuller¹, Yvain Thonnart¹, Gael Pillonnet¹, Cesar Fuguet¹, Ivan Miro-Panades¹, Guillaume Moritz¹, J. Durupt¹, Christian Bernard, Didier Varreau, Julian Pontes, Sebastien Thuries¹, David Coriat¹, Michel Harrand¹, Denis Dutoit¹, Didier Lattard¹, Lucile Arnaud¹, Jean Charbonnier¹, P. Coudrain¹, Arnaud Garnier¹, Frédéric Berger¹, Alain Gueugnot¹, Alain Greiner², Quentin L. Meunier², Alexis Farcy³, Alexandre Arriordaz⁴, Severine Cheramy¹, Fabien Clermidy¹ - Show less +24 more•Institutions (4)

University of Grenoble¹, Sorbonne², STMicroelectronics³, Siemens⁴

01 Jan 2021-IEEE Journal of Solid-state Circuits

TL;DR: The IntAct project as mentioned in this paper integrates six chiplets in FDSOI 28-nm technology, which are 3D-stacked onto this active interposer in 65-nm process, offering a total of 96 computing cores.

...read moreread less

Abstract: In the context of high-performance computing, the integration of more computing capabilities with generic cores or dedicated accelerators for artificial intelligence (AI) application is raising more and more challenges. Due to the increasing costs of advanced nodes and the difficulties of shrinking analog and circuit input output signals (IOs), alternative architecture solutions to single die are becoming mainstream. Chiplet-based systems using 3D technologies enable modular and scalable architecture and technology partitioning. Nevertheless, there are still limitations due to chiplet integration on passive interposers—silicon or organic. In this article we present the first CMOS active interposer, integrating: 1) power management without any external components; 2) distributed interconnects enabling any chiplet-to-chiplet communication; and3) system infrastructure, design-for-test, and circuit IOs. The IntAct circuit prototype integrates six chiplets in FDSOI 28-nm technology, which are 3D-stacked onto this active interposer in 65-nm process, offering a total of 96 computing cores. Full scalability of the computing system is achieved using an innovative scalable cache-coherent memory hierarchy, enabled by distributed network-on-chips, with 3-Tbit/s/mm2 high bandwidth 3D-plug interfaces using 20- $\mu \text{m}$ pitch micro-bumps, 0.6-ns/mm low latency asynchronous interconnects, while the six chiplets are locally power-supplied with 156-mW/mm2 at 82%-peak-efficiency dc–dc converters through the active interposer. Thermal dissipation is studied showing the feasibility of such approach.

...read moreread less

Journal Article•DOI•

A CMOS Dual-Polarized Phased-Array Beamformer Utilizing Cross-Polarization Leakage Cancellation for 5G MIMO Systems

[...]

Jian Pang¹, Zheng Li¹, Xueting Luo¹, Joshua Alvin¹, Rattanan Saengchan¹, Ashbir Aviat Fadila¹, Kiyoshi Yanagisawa¹, Yi Zhang¹, Zixin Chen¹, Zhongliang Huang¹, Xiaofan Gu¹, Rui Wu¹, Yun Wang¹, Dongwon You¹, Bangan Liu¹, Zheng Sun¹, Yuncheng Zhang¹, Hongye Huang¹, Naoki Oshima², Keiichi Motoi², Shinichi Hori², Kazuaki Kunihiro², Tomoya Kaneko², Atsushi Shirane¹, Kenichi Okada¹ - Show less +21 more•Institutions (2)

Tokyo Institute of Technology¹, NEC²

06 Jan 2021-IEEE Journal of Solid-state Circuits

TL;DR: In this paper, a power-efficient and low-cost CMOS 28-GHz phased-array beamformer supporting 5G dual-polarized MIMO (DP-MIMO) operation is introduced.

...read moreread less

Abstract: This article introduces a power-efficient and low-cost CMOS 28-GHz phased-array beamformer supporting fifth-generation (5G) dual-polarized multiple-in-multiple-out (MIMO) (DP-MIMO) operation. To improve the cross-polarization (cross-pol.) isolation degraded by the antennas and propagation, a power-efficient analog-assisted cross-pol. leakage cancellation technique is implemented. After the high-accuracy cancellation, more than 41.3-dB cross-pol. isolation is maintained along with the transmitter array to the receiver array. The element-beamformer in this work adopts the compact neutralized bi-directional architecture featuring a minimized manufacturing cost. The proposed beamformer achieves 22% per path TX-mode efficiency and a 4.9-dB RX-mode noise figure. The required on-chip area for the beamformer is only 0.48 mm2. In over-the-air measurement, a 64-element dual-polarized phased-array module achieves 52.2-dBm saturated effective isotropic radiated power (EIRP). The 5G standard-compliant OFDMA-mode modulated signals of up to 256-QAM could be supported by the 64-element modules. With the help of the cross-pol. leakage cancellation technique, the proposed array module realizes improved DP-MIMO EVMs even under severe polarization coupling and rotation conditions. The measured DP-MIMO EVMs are 3.4% in both 64-QAM and 256-QAM. The consumed power per beamformer path is 186 mW in the TX mode and 88 mW in the RX mode.

...read moreread less

Journal Article•DOI•

A 28-nm-CMOS Based 145-GHz FMCW Radar: System, Circuits, and Characterization

[...]

Akshay Visweswaran¹, Kristof Vaesen¹, Miguel Glassee¹, Anirudh Kankuppe¹, Siddhartha Sinha¹, Claude Desset¹, Thomas Gielen¹, Andre Bourdoux¹, Piet Wambacq¹ - Show less +5 more•Institutions (1)

Katholieke Universiteit Leuven¹

06 Jan 2021-IEEE Journal of Solid-state Circuits

TL;DR: Extensive characterization results showcase state-of-the-art performance of the TRXs, while the code-domain multiple-input and multiple-output (MIMO) radars built with them demonstrate vital-sign and gesture detections.

...read moreread less

Abstract: This article presents frequency-modulated-continuous-wave (FMCW) radars developed for the detection of vital signs and gestures using two generations of 145-GHz transceivers (TRXs) integrated in 28-nm bulk CMOS. The performance and limitations of high-frequency radars are quantified with a system-level study, and the design and performance of individual circuit blocks are presented in detail. A 145-GHz center frequency and radar operation over an RF bandwidth of 10 GHz yield a displacement responsivity of 2 $\pi $ rad/mm and a windowed range resolution of 30 mm, respectively. Radar operation over a 0.1–7 m range is enabled by an effective-isotropic radiated power of 11.5 dBm and a noise figure of 8 dB. The ICs feature frequency multiplication by 9 in the transmit and receive paths, sub-arrayed dipole antennas, and neutralization of TX–RX leakage via delay control. A single TRX dissipates 500 mW from a 0.9-/1.8-V drive. The use of fast chirps (5–30- $\mu \text{s}$ ) mitigates the effect of 1/ $f$ -noise at the intermediate frequency (IF). Extensive characterization results showcase state-of-the-art performance of the TRXs, while the code-domain multiple-input and multiple-output (MIMO) radars ( $1 \times 4$ and $4 \times 4$ ) built with them demonstrate vital-sign and gesture detections.

...read moreread less

Journal Article•DOI•

CAP-RAM: A Charge-Domain In-Memory Computing 6T-SRAM for Accurate and Precision-Programmable CNN Inference

[...]

Zhiyu Chen¹, Zhanghao Yu¹, Qing Jin², Yan He¹, Jingyu Wang¹, Sheng Lin², Dai Li¹, Yanzhi Wang², Kaiyuan Yang¹ - Show less +5 more•Institutions (2)

Rice University¹, Northeastern University²

26 May 2021-IEEE Journal of Solid-state Circuits

TL;DR: In this paper, a compact, accurate, and bitwidth-programmable in-memory computing (IMC) static random access memory (SRAM) macro, named CAP-RAM, is presented for energy-efficient convolutional neural network (CNN) inference.

...read moreread less

Abstract: A compact, accurate, and bitwidth-programmable in-memory computing (IMC) static random-access memory (SRAM) macro, named CAP-RAM, is presented for energy-efficient convolutional neural network (CNN) inference. It leverages a novel charge-domain multiply-and-accumulate (MAC) mechanism and circuitry to achieve superior linearity under process variations compared to conventional IMC designs. The adopted semi-parallel architecture efficiently stores filters from multiple CNN layers by sharing eight standard 6T SRAM cells with one charge-domain MAC circuit. Moreover, up to six levels of bit-width of weights with two encoding schemes and eight levels of input activations are supported. A 7-bit charge-injection SAR (ciSAR) analog-to-digital converter (ADC) getting rid of sample and hold (S&H) and input/reference buffers further improves the overall energy efficiency and throughput. A 65-nm prototype validates the excellent linearity and computing accuracy of CAP-RAM. A single $512\times 128$ macro stores a complete pruned and quantized CNN model to achieve 98.8% inference accuracy on the MNIST data set and 89.0% on the CIFAR-10 data set, with a 573.4-giga operations per second (GOPS) peak throughput and a 49.4-tera operations per second (TOPS)/W energy efficiency.

...read moreread less

Journal Article•DOI•

Monostatic and Bistatic G -Band BiCMOS Radar Transceivers With On-Chip Antennas and Tunable TX-to-RX Leakage Cancellation

[...]

Maciej Kucharski¹, Wael A. Ahmad¹, Herman Jalli Ng², Dietmar Kissinger³•Institutions (3)

Leibniz Institute for Neurobiology¹, Karlsruhe University of Applied Sciences², University of Ulm³

01 Mar 2021-IEEE Journal of Solid-state Circuits

TL;DR: In this article, the authors presented a monostatic and bistatic radar transceivers incorporating on-chip antennas for short-range high-precision applications. But the performance of the transceiver was not evaluated.

...read moreread less

Abstract: This article presents $G$ -band monostatic and bistatic radar transceivers (TRX) incorporating on-chip antennas for short-range high-precision applications. The circuits were fabricated using a silicon–germanium (SiGe) BiCMOS technology offering heterojunction bipolar transistors (HBTs) with $\bf {f}_{\mathbf {T}}/\bf {f}_{\mathbf {MAX}}$ of 300/500 GHz. The monostatic TRX implements a tunable leakage canceller (LC) for enhanced transmitter (TX)-to-receiver (RX) leakage compensation and hence improved detectability of weakly reflecting near targets. A standalone monostatic TRX characterized at on-wafer level achieves 4-dBm maximum output power ( $\bf {P}_{\mathbf {TX}}$ ) and 19-dB peak conversion gain ( $\bf {G}_{\mathbf {RX}}$ ) with 3-dB bandwidths of 18 and 17GHz for the TX and the RX, respectively. The bistatic version reaches $\bf {P}_{\mathbf {TX}}$ of 13 dBm and $\bf {G}_{\mathbf {RX}}$ of 24 dB expanding the 3-dB bandwidths to 32 and 34 GHz for the TX and RX, respectively. A double-folded dipole antenna providing 5-dBi gain at 170 GHz was implemented using localized backside etching (LBE) and integrated with the transceivers. A frequency-modulated continuous-wave (FMCW) radar demonstrator incorporating an external phase-locked loop (PLL) was built to evaluate both TRXs and tunable leakage cancellation feature available in the monostatic variant. The maximum equivalent isotropic radiated power ( $\bf {EIRP}$ ), including on-chip antennas, is 8 and 18 dBm for the monostatic and bistatic TRX, respectively. The radars support sweep bandwidth up to 20 GHz reaching 2.1 cm spatial resolution. For a target at 1 m distance the measured ranging precision is $105~\mu \text{m}$ and $13~\mu \text{m}$ for monostatic and bistatic TRX, accordingly. Activation of leakage cancellation effectively suppresses close-in noise and extends the minimum detectable range remarkably.

...read moreread less

Journal Article•DOI•

A Cryogenic Broadband Sub-1-dB NF CMOS Low Noise Amplifier for Quantum Applications

[...]

Yatao Peng¹, Andrea Ruffino¹, Edoardo Charbon¹•Institutions (1)

École Polytechnique Fédérale de Lausanne¹

26 Apr 2021-IEEE Journal of Solid-state Circuits

TL;DR: In this paper, a cryogenic broadband low noise amplifier (LNA) for quantum applications based on a standard 40-nm CMOS technology is reported, whose performance is derived from the readout of semiconductor quantum bits at 42 K, whose quantum information signals are characterized as phase-modulated signals.

...read moreread less

Abstract: A cryogenic broadband low noise amplifier (LNA) for quantum applications based on a standard 40-nm CMOS technology is reported The LNA specifications are derived from the readout of semiconductor quantum bits at 42 K, whose quantum information signals are characterized as phase-modulated signals To achieve broadband input matching impedance and low noise figure, the gate-to-drain capacitance of the input transistor is exploited The goal is to involve a resistive and capacitive load into the input impedance match of a common-source stage with source inductive degeneration The capacitive load is created by an LC parallel tank whose resonant frequency is lower than the operating frequency The achieved non-constant in-band equivalent capacitance is proven to be beneficial to input impedance matching The resistive part of the load is provided by the transconductance of the cascode stage implicitly An inductor is added to the gate of the cascode transistor to suppress its noise, and a transformer-based resonator with two resonant frequencies serves as the load of the first stage, thus extending the operating bandwidth Design considerations for the cryogenic temperature operation of the LNA are proposed and analyzed The LNA achieves a measured gain ( $S_{21}$ ) of 35 ± 05 dB, return loss > 12 dB, and NF of 075–13 dB across the band (41–79 GHz), with 511-mW power consumption at room temperature, while it shows a measured gain of 42 ± 33 dB, and NF of 023–065 dB with 39-mW power consumption at 42 K between 46 and 8 GHz To the best of our knowledge, this is the first report of a cryogenic LNA based on a bulk CMOS process working above 4 GHz showing sub-1-dB NF both at room and cryogenic temperatures

...read moreread less

Journal Article•DOI•

HNPU: An Adaptive DNN Training Processor Utilizing Stochastic Dynamic Fixed-Point and Active Bit-Precision Searching

[...]

Donghyeon Han¹, Dongseok Im¹, Gwangtae Park¹, Youngwoo Kim¹, Seokchan Song¹, Juhyoung Lee¹, Hoi-Jun Yoo¹ - Show less +3 more•Institutions (1)

KAIST¹

23 Mar 2021-IEEE Journal of Solid-state Circuits

TL;DR: The HNPU supports stochastic dynamic fixed-point representation and layer-wise adaptive precision searching unit for low-bit-precision training and utilizes slice-level reconfigurability and sparsity to maximize its efficiency both in DNN inference and training.

...read moreread less

Abstract: This article presents HNPU, which is an energy-efficient deep neural network (DNN) training processor by adopting algorithm-hardware co-design. The HNPU supports stochastic dynamic fixed-point representation and layer-wise adaptive precision searching unit for low-bit-precision training. It additionally utilizes slice-level reconfigurability and sparsity to maximize its efficiency both in DNN inference and training. Adaptive bandwidth reconfigurable accumulation network enables reconfigurable DNN allocation and maintains its high core utilization even in various bit-precision conditions. Fabricated in a 28-nm process, the HNPU accomplished at least $5.9\times $ higher energy efficiency and $2.5\times $ higher area efficiency in actual DNN training compared with the previous state-of-the-art on-chip learning processors.

...read moreread less

Journal Article•DOI•

A 0.5-V Hybrid SRAM Physically Unclonable Function Using Hot Carrier Injection Burn-In for Stability Reinforcement

[...]

Kunyang Liu¹, Xinpeng Chen¹, Hongliang Pu¹, Hirofumi Shinohara¹•Institutions (1)

Waseda University¹

01 Jul 2021-IEEE Journal of Solid-state Circuits

TL;DR: The introduced hybrid SRAM PUF is compatible with hot carrier injection (HCI) burn-in stabilization, which can reinforce PUF stability to ~100% without the requirements of bitcell redundancy, visible oxide damages, additional fabrication processes, helper data storage, or error- correcting code (ECC) circuits.

...read moreread less

Abstract: This article introduces an SRAM-based physically unclonable function (PUF) that employs hybrid-mode operations in the enhancement–enhancement (EE) SRAM mode and CMOS SRAM mode to achieve both high native stability and low power. A data latching scheme based on the hybrid structure enables operations under low supply voltage ( ${V}_{\text {DD}}$ ). Furthermore, the proposed hybrid SRAM PUF is compatible with hot carrier injection (HCI) burn-in stabilization, which can reinforce PUF stability to ~100% without the requirements of bitcell redundancy, visible oxide damages, additional fabrication processes, helper data storage, or error-correcting code (ECC) circuits. The proposed PUF is fabricated in 130-nm standard CMOS, and the experimental results show that it achieves 0.29% native bit error rate (BER) at the nominal condition of 0.6 V/25 °C. The operating ${V}_{\text {DD}}$ scales down to 0.5 V, with a core energy efficiency of 2.07 fJ/b. After HCI burn-in, no bit errors are found across all ${V}_{\text {DD}}$ /temperature (VT) corners from 0.5 to 0.7 V and from −40 °C to 120 °C (5120 bits $\times $ 500 evaluations tested at each condition). Long-term reliability is verified by using an accelerated aging test equivalent to approximately 21 years of operation, where the reinforced PUF shows no bit errors even at the worst VT corner of 0.5 V/120 °C during the test. The introduced hybrid SRAM PUF also passes all applicable NIST SP 800–22 randomness tests. It has a compact bitcell with an area of 497 F2.

...read moreread less

Journal Article•DOI•

A Broadband Linear Ultra-Compact mm-Wave Power Amplifier With Distributed-Balun Output Network: Analysis and Design

[...]

Fei Wang¹, Hua Wang¹•Institutions (1)

Georgia Institute of Technology¹

22 Jun 2021-IEEE Journal of Solid-state Circuits

TL;DR: In this article, a broadband power amplifier (PA) with a distributed-balun output network that provides the PA optimum load impedance over a wide bandwidth is presented. But the performance of the proposed network is limited.

...read moreread less

Abstract: This article presents a broadband power amplifier (PA) with a distributed-balun output network that provides the PA optimum load impedance over a wide bandwidth. The proposed output network comprises two coupled-line sections and absorbs the device output capacitance. It employs a scalable coupled-line modeling approach that captures both the magnetic (inductive) and electric (capacitive) couplings between windings with fewer parameters and supports a rapid design process. Closed-form design solutions, design space limitations, bandwidth limits, and design tradeoffs are derived and analyzed comprehensively. Its extension to differential output and common-mode response is also discussed in detail. As a proof of concept, a prototype PA is implemented for multiband fifth-generation (5G) applications in 45-nm SOI CMOS. With no biasing retuning or network reconfiguration, the PA consistently achieves >19.1 dBm $P_{\mathrm {sat}}$ , >37.3% peak power-added efficiency (PAE), 17.8–19.6 dBm $P_{\mathrm {1dB}}$ , and 36.6%–44.3% PAE $_{P\mathrm {1dB}}$ over 24–40 GHz, verifying the truly wideband large-signal matching. The PA demonstrates 5G new radio (NR) frequency range 2 (FR2) modulation signals over 24–42 GHz, covering n257/n258/n260 5G bands. For 5G NR FR2 800-MHz 2-CC 64-QAM signals (11.78-dB PAPR), the PA achieves 11.3-dBm/16.6% average $P_{\mathrm {out}}$ /PAE with −25.1-dB rms EVM at 28-GHz and 10.2-dBm/13.6% average $P_{\mathrm {out}}$ /PAE with −25.1-dB rms EVM at 37 GHz.

...read moreread less

Journal Article•DOI•

EM and Power SCA-Resilient AES-256 Through >350× Current-Domain Signature Attenuation and Local Lower Metal Routing

[...]

Debayan Das¹, Josef Danial¹, Anupam Golder², Nirmoy Modak¹, Shovan Maity¹, Baibhab Chatterjee¹, Dong-Hyun Seo¹, Muya Chang², Avinash L. Varna³, Harish K. Krishnamurthy³, Sanu Mathew³, Santosh Ghosh³, Arijit Raychowdhury², Shreyas Sen¹ - Show less +10 more•Institutions (3)

Purdue University¹, Georgia Institute of Technology², Intel³

01 Jan 2021-IEEE Journal of Solid-state Circuits

TL;DR: This work embraces lower level metal routing of the CDSA embedding the crypto-IP so that the signature becomes highly suppressed before it passes through the higher metal layers (which radiates significantly) to connect to the external pin.

...read moreread less

Abstract: Mathematically secure cryptographic algorithms, when implemented on a physical substrate, leak critical “side-channel” information, leading to power and electromagnetic (EM) analysis attacks. Circuit-level protections involve switched capacitor, buck converter, or series low-dropout (LDO) regulator-based implementations, each of which suffers from significant power, area, or performance tradeoffs and has only achieved a minimum traces to disclosure (MTD) of $10M$ till date. Utilizing an in-depth white-box model, this work, for the first time, focuses on signature suppression in the current domain, which provides an $Attenuation^{2}$ enhancement in MTD, leading to orders of magnitude improvement in both power and EM side-channel analysis (SCA) immunities. Using a combination of current-domain “signature attenuation” (CDSA) along with local lower level metal routing, the critical correlated information in the crypto current is significantly suppressed before it reaches the supply pin. Especially, to prevent the EM leakage from its source (metal layers carrying the correlated crypto current acting as antennas), this work embraces lower level metal routing of the CDSA embedding the crypto-IP so that the signature becomes highly suppressed before it passes through the higher metal layers (which radiates significantly) to connect to the external pin. The 65-nm CMOS test chip contains both protected and unprotected parallel AES-256 implementations, running at a clock frequency of 50 MHz. Test vector leakage assessment (TVLA) on the protected CDSA-AES, demonstrated with on-chip measurements for the first time, shows that the higher level metal layers leak significantly more compared with the lower level metal routing. Correlational power and EM analysis (CPA/CEMA) attacks on the unprotected implementation were able to extract the secret key within $8k$ and $12k$ traces, respectively, while the protected CDSA-AES could not be broken even after $1B$ encryptions for both power and EM SCA, evaluated both in the time and frequency domains, showing an improvement of $100\times $ over the prior state-of-the-art countermeasures with comparable power and area overheads.

...read moreread less

Journal Article•DOI•

A 3.5-mV Input Single-Inductor Self-Starting Boost Converter With Loss-Aware MPPT for Efficient Autonomous Body-Heat Energy Harvesting

[...]

Soumya Bose¹, Tejasvi Anand¹, Matthew L. Johnston¹•Institutions (1)

Oregon State University¹

01 Jun 2021-IEEE Journal of Solid-state Circuits

TL;DR: In order to extract maximum energy from a thermoelectric generator at small temperature gradients, a loss-aware maximum power point tracking (MPPT) scheme was developed, which enables the harvester to achieve high end-to-end efficiency at low input voltages.

...read moreread less

Abstract: A single-inductor self-starting boost converter is presented, which is suitable for thermoelectric energy harvesting from human body heat. In order to extract maximum energy from a thermoelectric generator (TEG) at small temperature gradients, a loss-aware maximum power point tracking (MPPT) scheme was developed, which enables the harvester to achieve high end-to-end efficiency at low input voltages. The boost converter is implemented in a 0.18- $\mu \text{m}$ CMOS technology and is more than 75% efficient for a matched input voltage range of 15–100 mV, with a peak efficiency of 82%. Enhanced power extraction enables the converter to sustain operation at an input voltage as low as 3.5 mV. In addition, the boost converter self-starts with a minimum TEG voltage of 50 mV leveraging a dual-path architecture without using additional off-chip components.

...read moreread less

Journal Article•DOI•

A 1.16-V 5.8-to-13.5-ppm/°C Curvature-Compensated CMOS Bandgap Reference Circuit With a Shared Offset-Cancellation Method for Internal Amplifiers

[...]

Keng Chen¹, Luca Petruzzi¹, Ronald Hulfachor¹, Marvin Onabajo²•Institutions (2)

Infineon Technologies¹, Northeastern University²

01 Jan 2021-IEEE Journal of Solid-state Circuits

TL;DR: An accurate current-mode bandgap reference circuit design with a novel shared offset compensation scheme for its internal amplifiers that allows to conserve die size and power consumption by preventing that each amplifier is accompanied by its own active auxiliary offset-cancellation circuit.

...read moreread less

Abstract: This article introduces an accurate current-mode bandgap reference circuit design with a novel shared offset compensation scheme for its internal amplifiers. This bandgap circuit has been designed to operate over a very wide temperature range from −40 °C to 150 °C. Its output voltage is 1.16 V with a 3.3-V supply voltage. A multi-section curvature compensation method alleviates the error from the bipolar junction transistor’s base–emitter nonlinear voltage dependence on temperature. The bandgap reference circuit contains two operational amplifiers that are utilized to generate proportional-to-absolute-temperature (PTAT) and complementary-to-absolute-temperature (CTAT) current sources. With the implementation of the described shared offset-cancellation methodology, the simulated output inaccuracy introduced by the amplifier is kept to a 5 $\sigma $ offset within ±4.6 $\mu \text{V}$ while allowing to conserve die size and power consumption by preventing that each amplifier is accompanied by its own active auxiliary offset-cancellation circuit. Designed and fabricated in a 130-nm CMOS process technology, the bandgap reference has a measured output voltage shift of less than 1 mV over a −40 °C to 150 °C temperature range and an overall variation of ±8.2 mV across seven measured samples without trimming.

...read moreread less

Journal Article•DOI•

A High-Power Broadband Multi-Primary DAT-Based Doherty Power Amplifier for mm-Wave 5G Applications

[...]

Fei Wang¹, Hua Wang¹•Institutions (1)

Georgia Institute of Technology¹

04 May 2021-IEEE Journal of Solid-state Circuits

TL;DR: In this paper, the authors proposed a fully integrated high-power broadband linear Doherty PA with multi-primary distributed-active-transformer (DAT) power combining. But the performance of the proposed DAT-based Doherty output network was not evaluated.

...read moreread less

Abstract: Silicon-based millimeter-wave (mm-Wave) power amplifiers (PAs) with high power and high peak/back-off efficiency are highly desired to efficiently amplify multi-Gb/s 5G NR signals. This article presents a fully integrated high-power broadband linear Doherty PA with multi-primary distributed-active-transformer (DAT) power combining. We introduce a transformer-based impedance inverter for active load modulation and a multi-primary DAT structure for hybrid series and parallel power combining. Based on this, we propose a transformer-based Doherty combiner with more design freedom and a multi-primary DAT-based Doherty PA for simultaneous active load modulation and low-loss power combining. The EM simulation results demonstrate that the proposed DAT-based Doherty output network achieves very symmetric and balanced load impedances among all the main and auxiliary PA ports. As a proof of concept, a 24–30-GHz prototype PA is implemented in a 0.13- $\mu \text{m}$ SiGe BiCMOS process. The PA achieves 30.4% PAEmax, 28.3-dBm $P_{\mathrm {sat}}$ , 30.2% PAE at 26.8-dBm $P_{\mathrm {1\,dB}}$ , and 21.2% PAE at 6-dB back-off from $P_{\mathrm {sat}}$ at 28 GHz. Modulation measurement with single-carrier 64-QAM signals and 5G NR FR2 orthogonal frequency-division multiplexing (OFDM) signals has been demonstrated. For a 200-MHz 1-CC 5G NR FR2 64-QAM signal, the PA achieves 18.1-dBm Pavg and 13.8% PAEavg with −25.1-dB rms EVM at 28 GHz.

...read moreread less

Journal Article•DOI•

A 0.5-V Real-Time Computational CMOS Image Sensor With Programmable Kernel for Feature Extraction

[...]

Tzu-Hsiang Hsu¹, Yi-Ren Chen¹, Ren-Shuo Liu¹, Chung-Chuan Lo¹, Kea-Tiong Tang¹, Meng-Fan Chang¹, Chih-Cheng Hsieh¹ - Show less +3 more•Institutions (1)

National Tsing Hua University¹

01 May 2021-IEEE Journal of Solid-state Circuits

TL;DR: The C²IS prototype sensor is used as a real-time edge feature detection frond-end camera and accompanied with a simplified convolutional neural network (CNN) architecture to demonstrate the hand gesture recognition.

...read moreread less

Abstract: As the growing demand on artificial intelligence (AI) Internet-of-Things (IoT) devices, smart vision sensors with energy-efficient computing capability are required. This article presents a low-power and low-voltage dual mode 0.5-V computational CMOS image sensor (C2IS) with array-parallel computing capability for feature extraction using convolution. In the feature extraction mode, by applying the pulsewidth modulation (PWM) pixel and switch-current integration (SCI) circuit, the in-sensor eight-directional matrix-parallel multiply–accumulate (MAC) operation is realized. Furthermore, the analog-domain convolution-on-readout (COR) operation, the programmable $3\times3$ kernel with ±3-bit weights, and the tunable-resolution column-parallel analog-to-digital converter (ADC) (1–8 bit) are implemented to achieve the real-time feature extraction without using additional memory and sacrificing frame rate. In the image capturing mode, the sensor provides the linear-response 8-bit raw image data. The C2IS prototype has been fabricated in the TSMC 0.18- $\mu \text{m}$ standard process technology and verified to demonstrate the raw and feature images at 480 frames/s with a power consumption of 77/ $117~\mu \text{W}$ and the resultant FoM of 9.8/14.8 pJ/pixel/frame, respectively. The prototype sensor is used as a real-time edge feature detection frond-end camera and accompanied with a simplified convolutional neural network (CNN) architecture to demonstrate the hand gesture recognition. The prototype system achieves more than 95% validation accuracy.

...read moreread less

Journal Article•DOI•

A 90.2% Peak Efficiency Multi-Input Single-Inductor Multi-Output Energy Harvesting Interface With Double-Conversion Rejection Technique and Buck-Based Dual-Conversion Mode

[...]

Hyun Jin Kim¹, Junyoung Maeng¹, Inho Park¹, Jeon Jinwoo¹, Dongju Lim¹, Chulwoo Kim¹ - Show less +2 more•Institutions (1)

Korea University¹

01 Mar 2021-IEEE Journal of Solid-state Circuits

TL;DR: This article presents a multi-input single-inductor multi-output energy-harvesting interface that extracts power from three independent sources and regulates three output voltages and achieves a peak end-to-end efficiency of 90.2% and a maximum output power of 24 mW, indicating improvements of approximately 7.52% and 1.85 times, respectively, compared with those of conventional buck–boost converters.

...read moreread less

Abstract: This article presents a multi-input single-inductor multi-output energy-harvesting interface that extracts power from three independent sources and regulates three output voltages. The converter employs the proposed double-conversion rejection technique to reduce the double-converted power by up to 81.8% under the light-load condition and operates in various power conversion modes, including the proposed buck-based dual-conversion mode, to improve the power conversion efficiency and maximum load power. The proposed adaptive peak inductor current controller determines the inductor charging period, and the proposed digitally controlled zero-current detector detects the optimum zero-current point according to the operating mode. The proposed converter achieves a peak end-to-end efficiency of 90.2% and a maximum output power of 24 mW, indicating the improvements of approximately 7.52% and 1.85 times, respectively, compared with those of conventional buck–boost converters.

...read moreread less

Journal Article•DOI•

A 510-nW Wake-Up Keyword-Spotting Chip Using Serial-FFT-Based MFCC and Binarized Depthwise Separable CNN in 28-nm CMOS

[...]

Weiwei Shan¹, Minhao Yang², Tao Wang¹, Yicheng Lu¹, Hao Cai¹, Lixuan Zhu¹, Jiaming Xu¹, Chengjun Wu¹, Longxing Shi¹, Jun Yang¹ - Show less +6 more•Institutions (2)

Southeast University¹, École Polytechnique Fédérale de Lausanne²

01 Jan 2021-IEEE Journal of Solid-state Circuits

TL;DR: A sub-Sub-inline-formula for always-ON keyword spotting with LaTeX notation is proposed, which is mainly composed of a neural network and a feature extraction circuit for audio wake-up systems.

...read moreread less

Abstract: We propose a sub- $\mu \text{W}$ always-ON keyword spotting ( $\mu $ KWS) chip for audio wake-up systems. It is mainly composed of a neural network (NN) and a feature extraction (FE) circuit. For significantly reducing the memory footprint and computational load, four techniques are used to achieve ultra-low-power consumption: 1) a serial-FFT-based Mel-frequency cepstrum coefficient circuit is designed for FE, instead of the common parallel FFT. 2) A small-sized binarized depthwise separable convolutional NN (DSCNN) is designed as the classifier. 3) A framewise incremental computation technique is devised in contrast to the conventional whole-word processing. 4) Reduced computation allows a low system clock frequency, which enables near-threshold voltage operation, and low leakage memory blocks are designed to minimize the leakage power. Implemented in 28-nm CMOS technology, this $\mu $ KWS consumes $0.51~\mu \text{W}$ at a 40-kHz frequency and a 0.41-V supply, with an area of 0.23 mm2. Using the Google speech command data set, 97.3% accuracy is reached for a one-word KWS task and 94.6% for a two-word task.

...read moreread less

Journal Article•DOI•

High-Scalability CMOS Quantum Magnetometer With Spin-State Excitation and Detection of Diamond Color Centers

[...]

Mohamed I. Ibrahim¹, Christopher Foy¹, Dirk Englund¹, Ruonan Han¹•Institutions (1)

Massachusetts Institute of Technology¹

01 Mar 2021-IEEE Journal of Solid-state Circuits

TL;DR: In this article, a CMOS quantum vector-field magnetometer using nitrogen-vacancy (NV) centers in diamond was presented, which achieved high sensitivity and long-term stability without the need for recalibration.

...read moreread less

Abstract: Magnetometers based on quantum mechanical processes enable high sensitivity and long-term stability without the need for re-calibration, but their integration into fieldable devices remains challenging. This article presents a CMOS quantum vector-field magnetometer that miniaturizes the conventional quantum sensing platforms using nitrogen-vacancy (NV) centers in diamond. By integrating key components for spin control and readout, the chip performs magnetometry through optically detected magnetic resonance (ODMR) through a diamond slab attached to a custom CMOS chip. The ODMR control is highly uniform across the NV centers in the diamond, which is enabled by a CMOS-generated ~2.87 GHz magnetic field with $\times $ 80 $\mu \text{m}^{2}$ diamond slab. NV fluorescence is measured by CMOS-integrated photodetectors. This ON-chip measurement is enabled by efficient rejection of the green pump light from the red fluorescence through a CMOS-integrated spectral filter based on a combination of spectrally dependent plasmonic losses and diffractive filtering in the CMOS back-end-of-line (BEOL). This filter achieves a measured ~25 dB of green light rejection. We measure a sensitivity of 245 nT/Hz1/2, marking a 130 $\times $ improvement over a previous CMOS-NV sensor prototype, largely thanks to the better spectral filtering and homogeneous microwave generation over larger area.

...read moreread less

Journal Article•DOI•

SNAP: An Efficient Sparse Neural Acceleration Processor for Unstructured Sparse Deep Neural Network Inference

[...]

Jie-Fang Zhang¹, Ching-En Lee¹, Chester Liu¹, Yakun Sophia Shao², Stephen W. Keckler², Zhengya Zhang¹ - Show less +2 more•Institutions (2)

University of Michigan¹, Nvidia²

01 Feb 2021-IEEE Journal of Solid-state Circuits

TL;DR: SNAP as discussed by the authors uses parallel associative search to discover valid weight (W) and input activation (IA) pairs from compressed, unstructured, sparse W and IA data arrays, which allows SNAP to maintain a 75% average compute utilization.

...read moreread less

Abstract: Recent developments in deep neural network (DNN) pruning introduces data sparsity to enable deep learning applications to run more efficiently on resource- and energy-constrained hardware platforms. However, these sparse models require specialized hardware structures to exploit the sparsity for storage, latency, and efficiency improvements to the full extent. In this work, we present the sparse neural acceleration processor (SNAP) to exploit unstructured sparsity in DNNs. SNAP uses parallel associative search to discover valid weight (W) and input activation (IA) pairs from compressed, unstructured, sparse W and IA data arrays. The associative search allows SNAP to maintain a 75% average compute utilization. SNAP follows a channel-first dataflow and uses a two-level partial sum (psum) reduction dataflow to eliminate access contention at the output buffer and cut the psum writeback traffic by 22 $\times $ compared with state-of-the-art DNN accelerator designs. SNAP’s psum reduction dataflow can be configured in two modes to support general convolution (CONV) layers, pointwise CONV, and fully connected layers. A prototype SNAP chip is implemented in a 16-nm CMOS technology. The 2.3-mm2 test chip is measured to achieve a peak effectual efficiency of 21.55 TOPS/W (16 b) at 0.55 V and 260 MHz for CONV layers with 10% weight and activation densities. Operating on a pruned ResNet-50 network, the test chip achieves a peak throughput of 90.98 frames/s at 0.80 V and 480 MHz, dissipating 348 mW.

...read moreread less

Collapse