scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Journal of Solid-state Circuits in 2021"


Journal ArticleDOI
TL;DR: This work presents a compute-in-memory (CIM) macro built around a standard two-port compiler macro using foundry 8T bit-cell in 7-nm FinFET technology and achieves energy efficiency of 351 TOPS/W and throughput of 372.4 GOPS.
Abstract: In this work, we present a compute-in-memory (CIM) macro built around a standard two-port compiler macro using foundry 8T bit-cell in 7-nm FinFET technology. The proposed design supports 1024 4 b $\times $ 4 b multiply-and-accumulate (MAC) computations simultaneously. The 4-bit input is represented by the number of read word-line (RWL) pulses, while the 4-bit weight is realized by charge sharing among binary-weighted computation caps. Each unit of computation cap is formed by the inherent cap of the sense amplifier (SA) inside the 4-bit Flash ADC, which saves area and minimizes kick-back effect. Access time is 5.5 ns with 0.8-V power supply at room temperature. The proposed design achieves energy efficiency of 351 TOPS/W and throughput of 372.4 GOPS. Implications of our design from neural network implementation and accuracy perspectives are also discussed.

73 citations


Journal ArticleDOI
TL;DR: Based on the benefits of digital CIM, reconfigurability, and bit-serial computing architecture, the Colonnade can achieve both high performance and energy efficiency for processing neural networks.
Abstract: This article (Colonnade) presents a fully digital bit-serial compute-in-memory (CIM) macro. The digital CIM macro is designed for processing neural networks with reconfigurable 1–16 bit input and weight precisions based on bit-serial computing architecture and a novel all-digital bitcell structure. A column of bitcells forms a column MAC and used for computing a multiply-and-accumulate (MAC) operation. The column MACs placed in a row work as a single neuron and computes a dot-product, which is an essential building block of neural network accelerators. Several key features differentiate the proposed Colonnade architecture from the existing analog and digital implementations. First, its full-digital circuit implementation is free from process variation, noise susceptibility, and data-conversion overhead that are prevalent in prior analog CIM macros. A bitwise MAC operation in a bitcell is performed in the digital domain using a custom-designed XNOR gate and a full-adder. Second, the proposed CIM macro is fully reconfigurable in both weight and input precision from 1 to 16 bit. So far, most of the analog macros were used for processing quantized neural networks with very low input/weight precisions, mainly due to a memory density issue. Recent digital accelerators have implemented reconfigurable precisions, but they are inferior in energy efficiency due to significant off-chip memory access. We present a regular digital bitcell array that is readily reconfigured to a 1–16 bit weight-stationary bit-serial CIM macro. The macro computes parallel dot-product operations between the weights stored in memory and inputs that are serialized from LSB to MSB. Finally, the bit-serial computing scheme significantly reduces the area overhead while sacrificing latency due to bit-by-bit operation cycles. Based on the benefits of digital CIM, reconfigurability, and bit-serial computing architecture, the Colonnade can achieve both high performance and energy efficiency (i.e., both benefits of prior analog and digital accelerators) for processing neural networks. A test-chip with $128 \times 128$ SRAM-based bitcells for digital bit-serial computing is implemented using 65-nm technology and tested with 1–16 bit weight/input precisions. The measured energy efficiency is 117.3 TOPS/W at 1 bit and 2.06 TOPS/W at 16 bit.

58 citations


Journal ArticleDOI
TL;DR: A high-performance annealing processor named STochAsTIc Cellular automata Annealer (STATICA) for solving combinatorial optimization problems represented by fully connected graphs and can update multiple states of fully connected spins simultaneously by introducing different dynamics called stochastic cellular automata annealer.
Abstract: This article presents a high-performance annealing processor named STochAsTIc Cellular automata Annealer (STATICA) for solving combinatorial optimization problems represented by fully connected graphs. Supporting fully connected graphs is strongly required for dealing with realistic optimization problems. Unlike previous annealing processors that follow Glauber dynamics, our proposed annealer can update multiple states of fully connected spins simultaneously by introducing different dynamics called stochastic cellular automata annealing. It allows us to utilize the pipeline-level and memory-bank-level parallelization in addition to the PE-level parallelization originally adopted in the previous annealers. The STATICA prototype chip, which supports 512-spin fully connected graph, has been fabricated with the 65-nm CMOS technology and realized as a 3 mm $\times \,\,{4}$ mm chip. Using the fabricated 512-spin chip and numerical projections for a 2048-spin chip, we have conducted experiments to reveal the annealing performance of STATICA and examined how to control its annealing process efficiently.

51 citations


Journal ArticleDOI
TL;DR: A 3-D-integrated 112-Gb/s pulse amplitude modulation (PAM)-4 optical transmitter (OTX) using silicon photonic MRM, on-chip laser, and co-packaged 28-nm CMOS driver to address static and dynamic MRM nonlinearities is presented.
Abstract: Microring modulators (MRMs) with CMOS electronics enable compact low power transmitter solutions for 400G Ethernet and co-packaged optical transceivers. In this article, we present a 3-D-integrated 112-Gb/s pulse amplitude modulation (PAM)-4 optical transmitter (OTX) using silicon photonic MRM, on-chip laser, and co-packaged 28-nm CMOS driver. The 3- $V_{\mathrm {pp}}$ driver includes a lookup table (LUT)-based PAM-4 nonlinear equalizer to address static and dynamic MRM nonlinearities. An integrated thermal control method that is insensitive to input power fluctuations is proposed to compensate for the temperature sensitivity of MRMs. PAM-4 measurement results of our OTX at 112 Gb/s show that transmitter dispersion eye closure quaternary (TDECQ) < 1.5 dB is achieved from 28 °C to 55 °C with 7.4-pJ/bit energy efficiency including on-chip laser.

50 citations


Journal ArticleDOI
TL;DR: A 36-way time-interleaved 56-GS/s 7-bit ADC is designed to realize 112-Gb/s pulse-amplitude modulation (PAM-4) transceiver in a 7-nm FinFET CMOS, achieved over a channel with 37.5-dB loss at 28 GHz while dissipating 602 mW per channel, excluding DSP.
Abstract: A 36-way time-interleaved 56-GS/s 7-bit ADC is designed to realize 112-Gb/s pulse-amplitude modulation (PAM-4) transceiver in a 7-nm FinFET CMOS. The receiver analog front-end stages and the ADC track-and-hold (T/H) buffers are implemented using inverter-based Gm/inverse-Gm-load cells. A distributed inductor peaking network and multi-phase clock calibration is implemented in the quarter-rate transmitter. The transceiver achieves <1E-8 pseudorandom binary sequence (PRBS)-31 PAM-4 bit error rate (BER) over a channel with 37.5-dB loss at 28 GHz while dissipating 602 mW per channel, excluding DSP.

50 citations


Journal ArticleDOI
TL;DR: In this paper, the first successful demonstration of an adiabatic microprocessor based on unshunted Josephson junction (JJ) devices manufactured using a Nb/AlOx/Nb superconductor IC fabrication process was conducted.
Abstract: We conducted the first successful demonstration of an adiabatic microprocessor based on unshunted Josephson junction (JJ) devices manufactured using a Nb/AlOx/Nb superconductor IC fabrication process. It is a hybrid of RISC and dataflow architectures operating on 4-b data words. We demonstrate register file R/W access, ALU execution, hardware stalling, and program branching performed at 100 kHz under the cryogenic temperature of 4.2 K. We also successfully demonstrated a high-speed breakout chip of the microprocessor execution units up to 2.5 GHz. We use a logic primitive called the adiabatic quantum-flux-parametron (AQFP), which has a switching energy of 1.4 zJ per JJ when driven by a four-phase 5-GHz sinusoidal ac-clock at 4.2 K. These demonstrations show that AQFP logic is capable of both processing and memory operations and that we have a path toward practical adiabatic computing operating at high-clock rates while dissipating very little energy.

48 citations


Journal ArticleDOI
TL;DR: A 6T SRAM-based CIM (SRAM-CIM) macro capable of weight-bitwise MAC (WbwMAC) operations to expand the sensing margin and improve the readout accuracy for high-precision MAC operations is presented.
Abstract: This article presents a computing-in-memory (CIM) structure aimed at improving the energy efficiency of edge devices running multi-bit multiply-and-accumulate (MAC) operations. The proposed scheme includes a 6T SRAM-based CIM (SRAM-CIM) macro capable of: 1) weight-bitwise MAC (WbwMAC) operations to expand the sensing margin and improve the readout accuracy for high-precision MAC operations; 2) a compact 6T local computing cell to perform multiplication with suppressed sensitivity to process variation; 3) an algorithm-adaptive low MAC-aware readout scheme to improve energy efficiency; 4) a bitline header selection scheme to enlarge signal margin; and 5) a small-offset margin-enhanced sense amplifier for robust read operations against process variation. A fabricated 28-nm 64-kb SRAM-CIM macro achieved access times of 4.1–8.4 ns with energy efficiency of 11.5–68.4 TOPS/W, while performing MAC operations with 4- or 8-b input and weight precision.

48 citations


Journal ArticleDOI
TL;DR: This work marks the first CMOS demonstration of THz radar and achieves record bandwidth and ranging resolution among all radar front-end chips.
Abstract: This article presents a CMOS-based, ultra-broadband frequency-modulated continuous-wave (FMCW) radar using a terahertz (THz) frequency-comb architecture. The high-parallelism spectral sensing provided by this architecture significantly reduces the bandwidth requirement for the THz front-end circuitry and ensures that the peak output power and sensitivity are maintained across the entire band of operation. The speed and linearity of frequency chirping are also improved by the comb system. An antenna-sharing scheme based on a square-mixer-first architecture is used, which not only leads to compact size but also facilitates the stitching of the multichannel radar IF data. To avoid the usage of high-cost silicon lens in the on-chip broadband radiation, a multi-resonance substrate-integrated-waveguide (SIW) antenna structure is innovated, which provides 15% fractional bandwidth for impedance matching. As a proof of concept, a five-tone radar prototype that seamlessly scans the entire 220-to-320-GHz band is demonstrated. In the measurement, the multi-channel-aggregated equivalent-isotropically radiated power (EIRP) is 0.6 dBm and is further boosted to ~20 dBm with a TPX (polymethylpentene) lens. The measured minimum single-sideband noise figure (SSB NF) of the receiver, including the antenna loss and baseband amplifier, is 22.8 dB. Due to the comb architecture, the EIRP and NF values fluctuate by only 8.8 and 14.6 dB, respectively, across the 100-GHz bandwidth. The chip has a die size of 5 mm2 and consumes 840 mW of dc power. This work marks the first CMOS demonstration of THz radar and achieves record bandwidth and ranging resolution among all radar front-end chips.

48 citations


Journal ArticleDOI
TL;DR: Vega as discussed by the authors is an IoT endnode system on chip (SoC) capable of scaling from a 1.7-μW fully retentive cognitive sleep mode up to 32.2-GOPS (at 49.4 mW).
Abstract: The Internet-of-Things (IoT) requires endnodes with ultra-low-power always-on capability for a long battery lifetime, as well as high performance, energy efficiency, and extreme flexibility to deal with complex and fast-evolving near-sensor analytics algorithms (NSAAs). We present Vega, an IoT endnode system on chip (SoC) capable of scaling from a 1.7-μW fully retentive cognitive sleep mode up to 32.2-GOPS (at 49.4 mW) peak performance on NSAAs, including mobile deep neural network (DNN) inference, exploiting 1.6 MB of state-retentive SRAM, and 4 MB of non-volatile magnetoresistive random access memory (MRAM). To meet the performance and flexibility requirements of NSAAs, the SoC features ten RISC-V cores: one core for SoC and IO management and a nine-core cluster supporting multi-precision single instruction multiple data (SIMD) integer and floating-point (FP) computation. Vega achieves the state-of-the-art (SoA)-leading efficiency of 615 GOPS/W on 8-bit INT computation (boosted to 1.3 TOPS/W for 8-bit DNN inference with hardware acceleration). On FP computation, it achieves the SoA-leading efficiency of 79 and 129 GFLOPS/W on 32- and 16-bit FP, respectively. Two programmable machine learning (ML) accelerators boost energy efficiency in cognitive sleep and active states.

46 citations


Journal ArticleDOI
TL;DR: Experimental results show that ROSCs are a potential candidate for a dedicated hardware accelerator aiming to solve a wide range of COPs and that the integrated CMOS-based Ising computer can find the solution to NP-hard problems with an accuracy of 82%–100%.
Abstract: Nondeterministic polynomial time hard (NP-hard) combinatorial optimization problems (COPs) are intractable to solve using a traditional computer as the time to find a solution increases very rapidly with the number of variables. An efficient alternative computing method uses coupled spin networks to solve COP. This work presents a first-of-its-kind coupled ring oscillator (ROSC)-based scalable probabilistic Ising computer to solve NP-hard COPs. An integrated coupled oscillator network was designed with 560 ROSCs that mimic a coupled spin network. Each ROSC can be coupled to any of its neighbors using programmable back-to-back (B2B) inverter-based coupling mechanism. The ROSC-based spins and B2B inverter-based coupling were optimized to work under a wide range of system noise as well as voltage and temperature variations. Randomly generated 1000 max-cut problems were mapped and solved in the hardware. The integrated Ising computer produced satisfactory solutions of max-cut problems when compared with commercial software running on a CPU. Experiments show that the integrated CMOS-based Ising computer can find the solution to NP-hard problems with an accuracy of 82%–100%. In addition, the repeated measurements of the same problem showed that the Ising computer can traverse through several local minima to find high-quality solutions under various voltage and temperature variation conditions. The experimental results show that ROSCs are a potential candidate for a dedicated hardware accelerator aiming to solve a wide range of COPs.

44 citations


Journal ArticleDOI
TL;DR: Evolver is the first deep learning processor that utilizes on-device QVF tuning to achieve both customized and optimal DNN deployment, and introduces bidirectional speculation and runtime reconfiguration techniques into the architecture.
Abstract: When deploying deep neural networks (DNNs) onto deep learning processors, we usually exploit mixed-precision quantization and voltage–frequency scaling to make tradeoffs among accuracy, latency, and energy. Conventional methods usually determine the quantization–voltage–frequency (QVF) policy before DNNs are deployed onto local devices. However, they are difficult to make optimal customizations for local user scenarios. In this article, we solve the problem by enabling on-device QVF tuning with a new deep learning processor architecture Evolver. Evolver has a QVF tuning mode to deploy DNNs with local customizations before normal execution. In this mode, Evolver uses reinforcement learning to search the optimal QVF policy based on direct hardware feedbacks from the chip itself. After that, Evolver runs the newly quantized DNN inference under the searched voltage and frequency. To improve the performance and energy efficiency of both training and inference, we introduce bidirectional speculation and runtime reconfiguration techniques into the architecture. To the best of our knowledge, Evolver is the first deep learning processor that utilizes on-device QVF tuning to achieve both customized and optimal DNN deployment.

Journal ArticleDOI
TL;DR: In this paper, a 36-channel scanning light detection and ranging (LiDAR) sensor with an on-chip single-photon avalanche diode array is presented, which has an area-efficient 11-bit in situ histogramming time-to-digital converter with a $3000 \times 78 \,\,\mu \text {m}^{2}$ per channel area based on a mixed-signal accumulator.
Abstract: This article presents a 36-channel scanning light detection and ranging (LiDAR) sensor with an on-chip single-photon avalanche diode array. The sensor has an area-efficient 11-bit in situ histogramming time-to-digital converter with a $3000 \times 78\,\,\mu \text {m}^{2}$ per channel area based on a mixed-signal accumulator, though it is incorporated with histogramming and filtering capabilities. Furthermore, owing to its embedded interference (IF) filter, the sensor can perform reliable direct time-of-flight measurements even with IF from 32 different LiDAR sensors. The LiDAR system also has a beam scanner that comprises dual laser diodes for IF elimination and a hybrid mirror such that high-resolution images with a resolution of $2200 \times 36$ can be acquired with a wide field-of-view of $120^{\circ } \times 8^{\circ }$ .

Journal ArticleDOI
TL;DR: The IntAct project as mentioned in this paper integrates six chiplets in FDSOI 28-nm technology, which are 3D-stacked onto this active interposer in 65-nm process, offering a total of 96 computing cores.
Abstract: In the context of high-performance computing, the integration of more computing capabilities with generic cores or dedicated accelerators for artificial intelligence (AI) application is raising more and more challenges. Due to the increasing costs of advanced nodes and the difficulties of shrinking analog and circuit input output signals (IOs), alternative architecture solutions to single die are becoming mainstream. Chiplet-based systems using 3D technologies enable modular and scalable architecture and technology partitioning. Nevertheless, there are still limitations due to chiplet integration on passive interposers—silicon or organic. In this article we present the first CMOS active interposer, integrating: 1) power management without any external components; 2) distributed interconnects enabling any chiplet-to-chiplet communication; and3) system infrastructure, design-for-test, and circuit IOs. The IntAct circuit prototype integrates six chiplets in FDSOI 28-nm technology, which are 3D-stacked onto this active interposer in 65-nm process, offering a total of 96 computing cores. Full scalability of the computing system is achieved using an innovative scalable cache-coherent memory hierarchy, enabled by distributed network-on-chips, with 3-Tbit/s/mm2 high bandwidth 3D-plug interfaces using 20- $\mu \text{m}$ pitch micro-bumps, 0.6-ns/mm low latency asynchronous interconnects, while the six chiplets are locally power-supplied with 156-mW/mm2 at 82%-peak-efficiency dc–dc converters through the active interposer. Thermal dissipation is studied showing the feasibility of such approach.

Journal ArticleDOI
TL;DR: In this paper, a power-efficient and low-cost CMOS 28-GHz phased-array beamformer supporting 5G dual-polarized MIMO (DP-MIMO) operation is introduced.
Abstract: This article introduces a power-efficient and low-cost CMOS 28-GHz phased-array beamformer supporting fifth-generation (5G) dual-polarized multiple-in-multiple-out (MIMO) (DP-MIMO) operation. To improve the cross-polarization (cross-pol.) isolation degraded by the antennas and propagation, a power-efficient analog-assisted cross-pol. leakage cancellation technique is implemented. After the high-accuracy cancellation, more than 41.3-dB cross-pol. isolation is maintained along with the transmitter array to the receiver array. The element-beamformer in this work adopts the compact neutralized bi-directional architecture featuring a minimized manufacturing cost. The proposed beamformer achieves 22% per path TX-mode efficiency and a 4.9-dB RX-mode noise figure. The required on-chip area for the beamformer is only 0.48 mm2. In over-the-air measurement, a 64-element dual-polarized phased-array module achieves 52.2-dBm saturated effective isotropic radiated power (EIRP). The 5G standard-compliant OFDMA-mode modulated signals of up to 256-QAM could be supported by the 64-element modules. With the help of the cross-pol. leakage cancellation technique, the proposed array module realizes improved DP-MIMO EVMs even under severe polarization coupling and rotation conditions. The measured DP-MIMO EVMs are 3.4% in both 64-QAM and 256-QAM. The consumed power per beamformer path is 186 mW in the TX mode and 88 mW in the RX mode.

Journal ArticleDOI
TL;DR: Extensive characterization results showcase state-of-the-art performance of the TRXs, while the code-domain multiple-input and multiple-output (MIMO) radars built with them demonstrate vital-sign and gesture detections.
Abstract: This article presents frequency-modulated-continuous-wave (FMCW) radars developed for the detection of vital signs and gestures using two generations of 145-GHz transceivers (TRXs) integrated in 28-nm bulk CMOS. The performance and limitations of high-frequency radars are quantified with a system-level study, and the design and performance of individual circuit blocks are presented in detail. A 145-GHz center frequency and radar operation over an RF bandwidth of 10 GHz yield a displacement responsivity of 2 $\pi $ rad/mm and a windowed range resolution of 30 mm, respectively. Radar operation over a 0.1–7 m range is enabled by an effective-isotropic radiated power of 11.5 dBm and a noise figure of 8 dB. The ICs feature frequency multiplication by 9 in the transmit and receive paths, sub-arrayed dipole antennas, and neutralization of TX–RX leakage via delay control. A single TRX dissipates 500 mW from a 0.9-/1.8-V drive. The use of fast chirps (5–30- $\mu \text{s}$ ) mitigates the effect of 1/ $f$ -noise at the intermediate frequency (IF). Extensive characterization results showcase state-of-the-art performance of the TRXs, while the code-domain multiple-input and multiple-output (MIMO) radars ( $1 \times 4$ and $4 \times 4$ ) built with them demonstrate vital-sign and gesture detections.

Journal ArticleDOI
TL;DR: In this paper, a compact, accurate, and bitwidth-programmable in-memory computing (IMC) static random access memory (SRAM) macro, named CAP-RAM, is presented for energy-efficient convolutional neural network (CNN) inference.
Abstract: A compact, accurate, and bitwidth-programmable in-memory computing (IMC) static random-access memory (SRAM) macro, named CAP-RAM, is presented for energy-efficient convolutional neural network (CNN) inference. It leverages a novel charge-domain multiply-and-accumulate (MAC) mechanism and circuitry to achieve superior linearity under process variations compared to conventional IMC designs. The adopted semi-parallel architecture efficiently stores filters from multiple CNN layers by sharing eight standard 6T SRAM cells with one charge-domain MAC circuit. Moreover, up to six levels of bit-width of weights with two encoding schemes and eight levels of input activations are supported. A 7-bit charge-injection SAR (ciSAR) analog-to-digital converter (ADC) getting rid of sample and hold (S&H) and input/reference buffers further improves the overall energy efficiency and throughput. A 65-nm prototype validates the excellent linearity and computing accuracy of CAP-RAM. A single $512\times 128$ macro stores a complete pruned and quantized CNN model to achieve 98.8% inference accuracy on the MNIST data set and 89.0% on the CIFAR-10 data set, with a 573.4-giga operations per second (GOPS) peak throughput and a 49.4-tera operations per second (TOPS)/W energy efficiency.

Journal ArticleDOI
TL;DR: In this article, the authors presented a monostatic and bistatic radar transceivers incorporating on-chip antennas for short-range high-precision applications. But the performance of the transceiver was not evaluated.
Abstract: This article presents $G$ -band monostatic and bistatic radar transceivers (TRX) incorporating on-chip antennas for short-range high-precision applications. The circuits were fabricated using a silicon–germanium (SiGe) BiCMOS technology offering heterojunction bipolar transistors (HBTs) with $\bf {f}_{\mathbf {T}}/\bf {f}_{\mathbf {MAX}}$ of 300/500 GHz. The monostatic TRX implements a tunable leakage canceller (LC) for enhanced transmitter (TX)-to-receiver (RX) leakage compensation and hence improved detectability of weakly reflecting near targets. A standalone monostatic TRX characterized at on-wafer level achieves 4-dBm maximum output power ( $\bf {P}_{\mathbf {TX}}$ ) and 19-dB peak conversion gain ( $\bf {G}_{\mathbf {RX}}$ ) with 3-dB bandwidths of 18 and 17GHz for the TX and the RX, respectively. The bistatic version reaches $\bf {P}_{\mathbf {TX}}$ of 13 dBm and $\bf {G}_{\mathbf {RX}}$ of 24 dB expanding the 3-dB bandwidths to 32 and 34 GHz for the TX and RX, respectively. A double-folded dipole antenna providing 5-dBi gain at 170 GHz was implemented using localized backside etching (LBE) and integrated with the transceivers. A frequency-modulated continuous-wave (FMCW) radar demonstrator incorporating an external phase-locked loop (PLL) was built to evaluate both TRXs and tunable leakage cancellation feature available in the monostatic variant. The maximum equivalent isotropic radiated power ( $\bf {EIRP}$ ), including on-chip antennas, is 8 and 18 dBm for the monostatic and bistatic TRX, respectively. The radars support sweep bandwidth up to 20 GHz reaching 2.1 cm spatial resolution. For a target at 1 m distance the measured ranging precision is $105~\mu \text{m}$ and $13~\mu \text{m}$ for monostatic and bistatic TRX, accordingly. Activation of leakage cancellation effectively suppresses close-in noise and extends the minimum detectable range remarkably.

Journal ArticleDOI
TL;DR: In this paper, a cryogenic broadband low noise amplifier (LNA) for quantum applications based on a standard 40-nm CMOS technology is reported, whose performance is derived from the readout of semiconductor quantum bits at 42 K, whose quantum information signals are characterized as phase-modulated signals.
Abstract: A cryogenic broadband low noise amplifier (LNA) for quantum applications based on a standard 40-nm CMOS technology is reported The LNA specifications are derived from the readout of semiconductor quantum bits at 42 K, whose quantum information signals are characterized as phase-modulated signals To achieve broadband input matching impedance and low noise figure, the gate-to-drain capacitance of the input transistor is exploited The goal is to involve a resistive and capacitive load into the input impedance match of a common-source stage with source inductive degeneration The capacitive load is created by an LC parallel tank whose resonant frequency is lower than the operating frequency The achieved non-constant in-band equivalent capacitance is proven to be beneficial to input impedance matching The resistive part of the load is provided by the transconductance of the cascode stage implicitly An inductor is added to the gate of the cascode transistor to suppress its noise, and a transformer-based resonator with two resonant frequencies serves as the load of the first stage, thus extending the operating bandwidth Design considerations for the cryogenic temperature operation of the LNA are proposed and analyzed The LNA achieves a measured gain ( $S_{21}$ ) of 35 ± 05 dB, return loss > 12 dB, and NF of 075–13 dB across the band (41–79 GHz), with 511-mW power consumption at room temperature, while it shows a measured gain of 42 ± 33 dB, and NF of 023–065 dB with 39-mW power consumption at 42 K between 46 and 8 GHz To the best of our knowledge, this is the first report of a cryogenic LNA based on a bulk CMOS process working above 4 GHz showing sub-1-dB NF both at room and cryogenic temperatures

Journal ArticleDOI
Donghyeon Han1, Dongseok Im1, Gwangtae Park1, Youngwoo Kim1, Seokchan Song1, Juhyoung Lee1, Hoi-Jun Yoo1 
TL;DR: The HNPU supports stochastic dynamic fixed-point representation and layer-wise adaptive precision searching unit for low-bit-precision training and utilizes slice-level reconfigurability and sparsity to maximize its efficiency both in DNN inference and training.
Abstract: This article presents HNPU, which is an energy-efficient deep neural network (DNN) training processor by adopting algorithm-hardware co-design. The HNPU supports stochastic dynamic fixed-point representation and layer-wise adaptive precision searching unit for low-bit-precision training. It additionally utilizes slice-level reconfigurability and sparsity to maximize its efficiency both in DNN inference and training. Adaptive bandwidth reconfigurable accumulation network enables reconfigurable DNN allocation and maintains its high core utilization even in various bit-precision conditions. Fabricated in a 28-nm process, the HNPU accomplished at least $5.9\times $ higher energy efficiency and $2.5\times $ higher area efficiency in actual DNN training compared with the previous state-of-the-art on-chip learning processors.

Journal ArticleDOI
TL;DR: The introduced hybrid SRAM PUF is compatible with hot carrier injection (HCI) burn-in stabilization, which can reinforce PUF stability to ~100% without the requirements of bitcell redundancy, visible oxide damages, additional fabrication processes, helper data storage, or error- correcting code (ECC) circuits.
Abstract: This article introduces an SRAM-based physically unclonable function (PUF) that employs hybrid-mode operations in the enhancement–enhancement (EE) SRAM mode and CMOS SRAM mode to achieve both high native stability and low power. A data latching scheme based on the hybrid structure enables operations under low supply voltage ( ${V}_{\text {DD}}$ ). Furthermore, the proposed hybrid SRAM PUF is compatible with hot carrier injection (HCI) burn-in stabilization, which can reinforce PUF stability to ~100% without the requirements of bitcell redundancy, visible oxide damages, additional fabrication processes, helper data storage, or error-correcting code (ECC) circuits. The proposed PUF is fabricated in 130-nm standard CMOS, and the experimental results show that it achieves 0.29% native bit error rate (BER) at the nominal condition of 0.6 V/25 °C. The operating ${V}_{\text {DD}}$ scales down to 0.5 V, with a core energy efficiency of 2.07 fJ/b. After HCI burn-in, no bit errors are found across all ${V}_{\text {DD}}$ /temperature (VT) corners from 0.5 to 0.7 V and from −40 °C to 120 °C (5120 bits $\times $ 500 evaluations tested at each condition). Long-term reliability is verified by using an accelerated aging test equivalent to approximately 21 years of operation, where the reinforced PUF shows no bit errors even at the worst VT corner of 0.5 V/120 °C during the test. The introduced hybrid SRAM PUF also passes all applicable NIST SP 800–22 randomness tests. It has a compact bitcell with an area of 497 F2.

Journal ArticleDOI
TL;DR: In this article, a broadband power amplifier (PA) with a distributed-balun output network that provides the PA optimum load impedance over a wide bandwidth is presented. But the performance of the proposed network is limited.
Abstract: This article presents a broadband power amplifier (PA) with a distributed-balun output network that provides the PA optimum load impedance over a wide bandwidth. The proposed output network comprises two coupled-line sections and absorbs the device output capacitance. It employs a scalable coupled-line modeling approach that captures both the magnetic (inductive) and electric (capacitive) couplings between windings with fewer parameters and supports a rapid design process. Closed-form design solutions, design space limitations, bandwidth limits, and design tradeoffs are derived and analyzed comprehensively. Its extension to differential output and common-mode response is also discussed in detail. As a proof of concept, a prototype PA is implemented for multiband fifth-generation (5G) applications in 45-nm SOI CMOS. With no biasing retuning or network reconfiguration, the PA consistently achieves >19.1 dBm $P_{\mathrm {sat}}$ , >37.3% peak power-added efficiency (PAE), 17.8–19.6 dBm $P_{\mathrm {1dB}}$ , and 36.6%–44.3% PAE $_{P\mathrm {1dB}}$ over 24–40 GHz, verifying the truly wideband large-signal matching. The PA demonstrates 5G new radio (NR) frequency range 2 (FR2) modulation signals over 24–42 GHz, covering n257/n258/n260 5G bands. For 5G NR FR2 800-MHz 2-CC 64-QAM signals (11.78-dB PAPR), the PA achieves 11.3-dBm/16.6% average $P_{\mathrm {out}}$ /PAE with −25.1-dB rms EVM at 28-GHz and 10.2-dBm/13.6% average $P_{\mathrm {out}}$ /PAE with −25.1-dB rms EVM at 37 GHz.

Journal ArticleDOI
TL;DR: This work embraces lower level metal routing of the CDSA embedding the crypto-IP so that the signature becomes highly suppressed before it passes through the higher metal layers (which radiates significantly) to connect to the external pin.
Abstract: Mathematically secure cryptographic algorithms, when implemented on a physical substrate, leak critical “side-channel” information, leading to power and electromagnetic (EM) analysis attacks. Circuit-level protections involve switched capacitor, buck converter, or series low-dropout (LDO) regulator-based implementations, each of which suffers from significant power, area, or performance tradeoffs and has only achieved a minimum traces to disclosure (MTD) of $10M$ till date. Utilizing an in-depth white-box model, this work, for the first time, focuses on signature suppression in the current domain, which provides an $Attenuation^{2}$ enhancement in MTD, leading to orders of magnitude improvement in both power and EM side-channel analysis (SCA) immunities. Using a combination of current-domain “signature attenuation” (CDSA) along with local lower level metal routing, the critical correlated information in the crypto current is significantly suppressed before it reaches the supply pin. Especially, to prevent the EM leakage from its source (metal layers carrying the correlated crypto current acting as antennas), this work embraces lower level metal routing of the CDSA embedding the crypto-IP so that the signature becomes highly suppressed before it passes through the higher metal layers (which radiates significantly) to connect to the external pin. The 65-nm CMOS test chip contains both protected and unprotected parallel AES-256 implementations, running at a clock frequency of 50 MHz. Test vector leakage assessment (TVLA) on the protected CDSA-AES, demonstrated with on-chip measurements for the first time, shows that the higher level metal layers leak significantly more compared with the lower level metal routing. Correlational power and EM analysis (CPA/CEMA) attacks on the unprotected implementation were able to extract the secret key within $8k$ and $12k$ traces, respectively, while the protected CDSA-AES could not be broken even after $1B$ encryptions for both power and EM SCA, evaluated both in the time and frequency domains, showing an improvement of $100\times $ over the prior state-of-the-art countermeasures with comparable power and area overheads.

Journal ArticleDOI
TL;DR: In order to extract maximum energy from a thermoelectric generator at small temperature gradients, a loss-aware maximum power point tracking (MPPT) scheme was developed, which enables the harvester to achieve high end-to-end efficiency at low input voltages.
Abstract: A single-inductor self-starting boost converter is presented, which is suitable for thermoelectric energy harvesting from human body heat. In order to extract maximum energy from a thermoelectric generator (TEG) at small temperature gradients, a loss-aware maximum power point tracking (MPPT) scheme was developed, which enables the harvester to achieve high end-to-end efficiency at low input voltages. The boost converter is implemented in a 0.18- $\mu \text{m}$ CMOS technology and is more than 75% efficient for a matched input voltage range of 15–100 mV, with a peak efficiency of 82%. Enhanced power extraction enables the converter to sustain operation at an input voltage as low as 3.5 mV. In addition, the boost converter self-starts with a minimum TEG voltage of 50 mV leveraging a dual-path architecture without using additional off-chip components.

Journal ArticleDOI
TL;DR: An accurate current-mode bandgap reference circuit design with a novel shared offset compensation scheme for its internal amplifiers that allows to conserve die size and power consumption by preventing that each amplifier is accompanied by its own active auxiliary offset-cancellation circuit.
Abstract: This article introduces an accurate current-mode bandgap reference circuit design with a novel shared offset compensation scheme for its internal amplifiers. This bandgap circuit has been designed to operate over a very wide temperature range from −40 °C to 150 °C. Its output voltage is 1.16 V with a 3.3-V supply voltage. A multi-section curvature compensation method alleviates the error from the bipolar junction transistor’s base–emitter nonlinear voltage dependence on temperature. The bandgap reference circuit contains two operational amplifiers that are utilized to generate proportional-to-absolute-temperature (PTAT) and complementary-to-absolute-temperature (CTAT) current sources. With the implementation of the described shared offset-cancellation methodology, the simulated output inaccuracy introduced by the amplifier is kept to a 5 $\sigma $ offset within ±4.6 $\mu \text{V}$ while allowing to conserve die size and power consumption by preventing that each amplifier is accompanied by its own active auxiliary offset-cancellation circuit. Designed and fabricated in a 130-nm CMOS process technology, the bandgap reference has a measured output voltage shift of less than 1 mV over a −40 °C to 150 °C temperature range and an overall variation of ±8.2 mV across seven measured samples without trimming.

Journal ArticleDOI
TL;DR: In this paper, the authors proposed a fully integrated high-power broadband linear Doherty PA with multi-primary distributed-active-transformer (DAT) power combining. But the performance of the proposed DAT-based Doherty output network was not evaluated.
Abstract: Silicon-based millimeter-wave (mm-Wave) power amplifiers (PAs) with high power and high peak/back-off efficiency are highly desired to efficiently amplify multi-Gb/s 5G NR signals. This article presents a fully integrated high-power broadband linear Doherty PA with multi-primary distributed-active-transformer (DAT) power combining. We introduce a transformer-based impedance inverter for active load modulation and a multi-primary DAT structure for hybrid series and parallel power combining. Based on this, we propose a transformer-based Doherty combiner with more design freedom and a multi-primary DAT-based Doherty PA for simultaneous active load modulation and low-loss power combining. The EM simulation results demonstrate that the proposed DAT-based Doherty output network achieves very symmetric and balanced load impedances among all the main and auxiliary PA ports. As a proof of concept, a 24–30-GHz prototype PA is implemented in a 0.13- $\mu \text{m}$ SiGe BiCMOS process. The PA achieves 30.4% PAEmax, 28.3-dBm $P_{\mathrm {sat}}$ , 30.2% PAE at 26.8-dBm $P_{\mathrm {1\,dB}}$ , and 21.2% PAE at 6-dB back-off from $P_{\mathrm {sat}}$ at 28 GHz. Modulation measurement with single-carrier 64-QAM signals and 5G NR FR2 orthogonal frequency-division multiplexing (OFDM) signals has been demonstrated. For a 200-MHz 1-CC 5G NR FR2 64-QAM signal, the PA achieves 18.1-dBm Pavg and 13.8% PAEavg with −25.1-dB rms EVM at 28 GHz.

Journal ArticleDOI
TL;DR: The C2IS prototype sensor is used as a real-time edge feature detection frond-end camera and accompanied with a simplified convolutional neural network (CNN) architecture to demonstrate the hand gesture recognition.
Abstract: As the growing demand on artificial intelligence (AI) Internet-of-Things (IoT) devices, smart vision sensors with energy-efficient computing capability are required. This article presents a low-power and low-voltage dual mode 0.5-V computational CMOS image sensor (C2IS) with array-parallel computing capability for feature extraction using convolution. In the feature extraction mode, by applying the pulsewidth modulation (PWM) pixel and switch-current integration (SCI) circuit, the in-sensor eight-directional matrix-parallel multiply–accumulate (MAC) operation is realized. Furthermore, the analog-domain convolution-on-readout (COR) operation, the programmable $3\times3$ kernel with ±3-bit weights, and the tunable-resolution column-parallel analog-to-digital converter (ADC) (1–8 bit) are implemented to achieve the real-time feature extraction without using additional memory and sacrificing frame rate. In the image capturing mode, the sensor provides the linear-response 8-bit raw image data. The C2IS prototype has been fabricated in the TSMC 0.18- $\mu \text{m}$ standard process technology and verified to demonstrate the raw and feature images at 480 frames/s with a power consumption of 77/ $117~\mu \text{W}$ and the resultant FoM of 9.8/14.8 pJ/pixel/frame, respectively. The prototype sensor is used as a real-time edge feature detection frond-end camera and accompanied with a simplified convolutional neural network (CNN) architecture to demonstrate the hand gesture recognition. The prototype system achieves more than 95% validation accuracy.

Journal ArticleDOI
Hyun Jin Kim1, Junyoung Maeng1, Inho Park1, Jeon Jinwoo1, Dongju Lim1, Chulwoo Kim1 
TL;DR: This article presents a multi-input single-inductor multi-output energy-harvesting interface that extracts power from three independent sources and regulates three output voltages and achieves a peak end-to-end efficiency of 90.2% and a maximum output power of 24 mW, indicating improvements of approximately 7.52% and 1.85 times, respectively, compared with those of conventional buck–boost converters.
Abstract: This article presents a multi-input single-inductor multi-output energy-harvesting interface that extracts power from three independent sources and regulates three output voltages. The converter employs the proposed double-conversion rejection technique to reduce the double-converted power by up to 81.8% under the light-load condition and operates in various power conversion modes, including the proposed buck-based dual-conversion mode, to improve the power conversion efficiency and maximum load power. The proposed adaptive peak inductor current controller determines the inductor charging period, and the proposed digitally controlled zero-current detector detects the optimum zero-current point according to the operating mode. The proposed converter achieves a peak end-to-end efficiency of 90.2% and a maximum output power of 24 mW, indicating the improvements of approximately 7.52% and 1.85 times, respectively, compared with those of conventional buck–boost converters.

Journal ArticleDOI
TL;DR: A sub-Sub-inline-formula for always-ON keyword spotting with LaTeX notation is proposed, which is mainly composed of a neural network and a feature extraction circuit for audio wake-up systems.
Abstract: We propose a sub- $\mu \text{W}$ always-ON keyword spotting ( $\mu $ KWS) chip for audio wake-up systems. It is mainly composed of a neural network (NN) and a feature extraction (FE) circuit. For significantly reducing the memory footprint and computational load, four techniques are used to achieve ultra-low-power consumption: 1) a serial-FFT-based Mel-frequency cepstrum coefficient circuit is designed for FE, instead of the common parallel FFT. 2) A small-sized binarized depthwise separable convolutional NN (DSCNN) is designed as the classifier. 3) A framewise incremental computation technique is devised in contrast to the conventional whole-word processing. 4) Reduced computation allows a low system clock frequency, which enables near-threshold voltage operation, and low leakage memory blocks are designed to minimize the leakage power. Implemented in 28-nm CMOS technology, this $\mu $ KWS consumes $0.51~\mu \text{W}$ at a 40-kHz frequency and a 0.41-V supply, with an area of 0.23 mm2. Using the Google speech command data set, 97.3% accuracy is reached for a one-word KWS task and 94.6% for a two-word task.

Journal ArticleDOI
TL;DR: In this article, a CMOS quantum vector-field magnetometer using nitrogen-vacancy (NV) centers in diamond was presented, which achieved high sensitivity and long-term stability without the need for recalibration.
Abstract: Magnetometers based on quantum mechanical processes enable high sensitivity and long-term stability without the need for re-calibration, but their integration into fieldable devices remains challenging. This article presents a CMOS quantum vector-field magnetometer that miniaturizes the conventional quantum sensing platforms using nitrogen-vacancy (NV) centers in diamond. By integrating key components for spin control and readout, the chip performs magnetometry through optically detected magnetic resonance (ODMR) through a diamond slab attached to a custom CMOS chip. The ODMR control is highly uniform across the NV centers in the diamond, which is enabled by a CMOS-generated ~2.87 GHz magnetic field with $\times $ 80 $\mu \text{m}^{2}$ diamond slab. NV fluorescence is measured by CMOS-integrated photodetectors. This ON-chip measurement is enabled by efficient rejection of the green pump light from the red fluorescence through a CMOS-integrated spectral filter based on a combination of spectrally dependent plasmonic losses and diffractive filtering in the CMOS back-end-of-line (BEOL). This filter achieves a measured ~25 dB of green light rejection. We measure a sensitivity of 245 nT/Hz1/2, marking a 130 $\times $ improvement over a previous CMOS-NV sensor prototype, largely thanks to the better spectral filtering and homogeneous microwave generation over larger area.

Journal ArticleDOI
TL;DR: SNAP as discussed by the authors uses parallel associative search to discover valid weight (W) and input activation (IA) pairs from compressed, unstructured, sparse W and IA data arrays, which allows SNAP to maintain a 75% average compute utilization.
Abstract: Recent developments in deep neural network (DNN) pruning introduces data sparsity to enable deep learning applications to run more efficiently on resource- and energy-constrained hardware platforms. However, these sparse models require specialized hardware structures to exploit the sparsity for storage, latency, and efficiency improvements to the full extent. In this work, we present the sparse neural acceleration processor (SNAP) to exploit unstructured sparsity in DNNs. SNAP uses parallel associative search to discover valid weight (W) and input activation (IA) pairs from compressed, unstructured, sparse W and IA data arrays. The associative search allows SNAP to maintain a 75% average compute utilization. SNAP follows a channel-first dataflow and uses a two-level partial sum (psum) reduction dataflow to eliminate access contention at the output buffer and cut the psum writeback traffic by 22 $\times $ compared with state-of-the-art DNN accelerator designs. SNAP’s psum reduction dataflow can be configured in two modes to support general convolution (CONV) layers, pointwise CONV, and fully connected layers. A prototype SNAP chip is implemented in a 16-nm CMOS technology. The 2.3-mm2 test chip is measured to achieve a peak effectual efficiency of 21.55 TOPS/W (16 b) at 0.55 V and 260 MHz for CONV layers with 10% weight and activation densities. Operating on a pruned ResNet-50 network, the test chip achieves a peak throughput of 90.98 frames/s at 0.80 V and 480 MHz, dissipating 348 mW.