scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Journal of Solid-state Circuits in 2019"


Journal ArticleDOI
Jinmook Lee1, Changhyeon Kim1, Sanghoon Kang1, Dongjoo Shin1, Sangyeob Kim1, Hoi-Jun Yoo1 
TL;DR: An energy-efficient deep neural network (DNN) accelerator, unified neural processing unit (UNPU), is proposed for mobile deep learning applications and is the first DNN accelerator ASIC that can support fully variable weight bit precision from 1 to 16 bit.
Abstract: An energy-efficient deep neural network (DNN) accelerator, unified neural processing unit (UNPU), is proposed for mobile deep learning applications. The UNPU can support both convolutional layers (CLs) and recurrent or fully connected layers (FCLs) to support versatile workload combinations to accelerate various mobile deep learning applications. In addition, the UNPU is the first DNN accelerator ASIC that can support fully variable weight bit precision from 1 to 16 bit. It enables the UNPU to operate on the accuracy-energy optimal point. Moreover, the lookup table (LUT)-based bit-serial processing element (LBPE) in the UNPU achieves the energy consumption reduction compared to the conventional fixed-point multiply-and-accumulate (MAC) array by 23.1%, 27.2%, 41%, and 53.6% for the 16-, 8-, 4-, and 1-bit weight precision, respectively. Besides the energy efficiency improvement, the unified DNN core architecture of the UNPU improves the peak performance for CL by 1.15 $\times$ compared to the previous work. It makes the UNPU operate on the lower voltage and frequency for the given DNN to increase energy efficiency. The UNPU is implemented in 65-nm CMOS technology and occupies the $4 \times 4$ mm2 die area. The UNPU can operates from 0.63- to 1.1-V supply voltage with maximum frequency of 200 MHz. The UNPU has peak performance of 345.6 GOPS for 16-bit weight precision and 7372 GOPS for 1-bit weight precision. The wide operating range of UNPU makes the UNPU achieve the power efficiency of 3.08 TOPS/W for 16-bit weight precision and 50.6 TOPS/W for 1-bit weight precision. The functionality of the UNPU is successfully demonstrated on the verification system using ImageNet deep CNN (VGG-16).

225 citations


Journal ArticleDOI
TL;DR: An energy-efficient static random access memory (SRAM) with embedded dot-product computation capability, for binary-weight convolutional neural networks, using a 10T bit-cell-based SRAM array to store the 1-b filter weights.
Abstract: This paper presents an energy-efficient static random access memory (SRAM) with embedded dot-product computation capability, for binary-weight convolutional neural networks. A 10T bit-cell-based SRAM array is used to store the 1-b filter weights. The array implements dot-product as a weighted average of the bitline voltages, which are proportional to the digital input values. Local integrating analog-to-digital converters compute the digital convolution outputs, corresponding to each filter. We have successfully demonstrated functionality (>98% accuracy) with the 10 000 test images in the MNIST hand-written digit recognition data set, using 6-b inputs/outputs. Compared to conventional full-digital implementations using small bitwidths, we achieve similar or better energy efficiency, by reducing data transfer, due to the highly parallel in-memory analog computations.

220 citations


Journal ArticleDOI
TL;DR: This paper addresses data movement via an in-memory-computing accelerator that employs charged-domain mixed-signal operation for enhancing compute SNR and, thus, scalability in large-scale matrix-vector multiplications.
Abstract: Large-scale matrix-vector multiplications, which dominate in deep neural networks (DNNs), are limited by data movement in modern VLSI technologies. This paper addresses data movement via an in-memory-computing accelerator that employs charged-domain mixed-signal operation for enhancing compute SNR and, thus, scalability. The architecture supports analog/binary input activation (IA)/weight first layer (FL) and binary/binary IA/weight hidden layers (HLs), with batch normalization and input–output (IO) (buffering) circuitry to enable cascading, if desired, for realizing different DNN layers. The architecture is arranged as $8\times 8=64$ in-memory-computing neuron tiles, supporting up to 512, $3\times 3\times 512$ -input HL neurons and 64, $3\times 3\times 3$ -input FL neurons, configurable via tile-level clock gating. In-memory computing is achieved using an 8T bit cell with overlaying metal-oxide-metal (MOM) capacitor, yielding a structure having $1.8\times $ the area of a standard 6T bit cell. Implemented in 65-nm CMOS, the design achieves HLs/FL energy efficiency of 866/1.25 TOPS/W and throughput of 18876/43.2 GOPS (1498/3.43 GOPS/mm2), when implementing convolution layers; and 658/0.95 TOPS/W, 9438/10.47 GOPS (749/0.83 GOPS/mm2), when implementing convolution followed by batch normalization layers. Several large-scale neural networks are demonstrated, showing performance on standard benchmarks (MNIST, CIFAR-10, and SVHN) equivalent to ideal digital computing.

183 citations


Journal ArticleDOI
TL;DR: A flash LiDAR using direct time of flight (dTOF) and based on Ocelot is demonstrated, achieving depth imaging at short distances with a frame rate of 30 frames/s, employing an ultra-low power laser.
Abstract: A $252 \times 144$ single-photon avalanche diode (SPAD) pixel sensor, called Ocelot, is reported for light detection and ranging (LiDAR). The sensor, fabricated in the 180-nm CMOS technology, features 1728 12-bit time-to-digital converters (TDCs) with 48.8-ps resolution (LSB). Each 126 pixels in a half-column are connected to six TDCs through a collision detection bus, which enables effective sharing of resources, and consequently a fill factor of 28% with a pixel pitch of 28.5 $\mu \text{m}$ . The column-parallel TDCs, based on dual-clock architecture, exhibit a DNL of +0.48/−0.48 LSB and an INL of +0.89/−1.67 LSB; they are dynamically reallocated in a scalable daisy chain approach that enables a maximum of five photon detections per illumination cycle per half-column. The sensor can operate in time-correlated single-photon counting (TCSPC) and single-photon counting (SPC) modes, while peak detection (PD) and partial histogramming (PH) are included in the operation of the sensor. The PD and PH modes are enabled by the first implementation of integrated histogramming for a full array via 3.32-Mb SRAM-based PH readout (PHR) scheme providing a 14.9-to-1 compression. Telemetry measurements up to 50 m achieve an accuracy of 8.8 cm and worst-case precision of 1.4 mm ( $\sigma $ ). A flash LiDAR using direct time of flight (dTOF) and based on Ocelot is demonstrated, achieving depth imaging at short distances with a frame rate of 30 frames/s, employing an ultra-low power laser with an average power of 2 mW and peak power of 0.5 W.

149 citations


Journal ArticleDOI
TL;DR: A single-chip CMOS transceiver capable of wireless data rates up to 80 Gb/s using part of frequencies covered by IEEE Std 802.15.3d is presented.
Abstract: A single-chip CMOS transceiver (TRX) capable of wireless data rates up to 80 Gb/s using part of frequencies (252–279 GHz) covered by IEEE Std 802.15.3d is presented. The TRX chip operates in either transmitter (TX) or receiver (RX) mode at frequencies comparable to ${f_{\mathrm {max}}}$ of the NMOSFET. The TX part adopts mixer-last architecture with four-way power combining using a ring circuit called a double-rat-race. The RX part adopts fundamental-mixer-first direct-conversion architecture. In the RX mode, the TX serves as an LO multiplier chain, which conventionally accounted for a significant part of the RX die area. The double-rat-race, having an improved design than the original one, integrates the TX and RX and also rejects unwanted harmonics generated by the frequency-doubler-based upconversion mixer. Low-loss, low-characteristic-impedance transmission lines are used extensively to combat losses. The TRX was fabricated using a 40-nm CMOS process. The saturated output power of the TX is −1.6 dBm at 265.68 GHz. The mean single-sideband noise figure (SSB NF) of the RX is 22.9 dB. The TX mode and the RX mode consume dc power of 890 and 897 mW, respectively. A wireless data rate of 80 Gb/s between a pair of TRX chips is demonstrated with 16QAM over a distance of 3 cm.

144 citations


Journal ArticleDOI
TL;DR: The proposed transceiver is based on the local-oscillator (LO) phase-shifting architecture, and it achieves quasi-continuous phase tuning with less than 0.2-dB radio frequency (RF) gain variation and 0.3°C phase error.
Abstract: This paper presents a 28-GHz CMOS four-element phased-array transceiver chip for the fifth-generation mobile network (5G) new radio (NR). The proposed transceiver is based on the local-oscillator (LO) phase-shifting architecture, and it achieves quasi-continuous phase tuning with less than 0.2-dB radio frequency (RF) gain variation and 0.3°C phase error. Accurate beam control with suppressed sidelobe level during beam steering could be supported by this work. At 28 GHz, a single-element transmitter-mode output ${{\mathrm {P}}_{\mathrm {1\,dB}}}$ of 15.7 dBm and a receiver-mode noise figure (NF) of 4.1 dB are achieved. The eight-element transceiver modules developed in this work are capable of scanning the beam from −50° to +50° with less than −9-dB sidelobe level. A saturated equivalent isotropic radiated power (EIRP) of 39.8 dBm is achieved at 0° scan. In a 5-m over-the-air measurement, the proposed module demonstrates the first 512 quadrature amplitude modulation (QAM) constellation in the 28-GHz band. A data stream of 6.4 Gb/s in 256-QAM could be supported within a beam angle of ±50°. The achieved maximum data rate is 15 Gb/s in 64-QAM. The proposed transceiver chip consumes 1.2 W/chip in transmitter mode and 0.59 W/chip in receiver mode.

144 citations


Journal ArticleDOI
TL;DR: This paper presents a two-chip solution low-power scalable OPA with a nonuniform sparse aperture, providing radiation pattern adjustment and feed distribution feasibility in a CMOS compatible silicon photonics process.
Abstract: Integrated optical phased arrays (OPAs) capable of adaptive beamforming and beam steering enable a wide range of applications. For many of these applications, a large scale 2-D OPA with full phase control for each radiating element is essential to achieve a functional low-cost solution. However, the scalability of such OPAs has been hampered by the optical feed distribution difficulties in a planar photonics process, as well as the high power consumption associated with having a large number of phase control units. In this paper, we present a two-chip solution low-power scalable OPA with a nonuniform sparse aperture, providing radiation pattern adjustment and feed distribution feasibility in a CMOS compatible silicon photonics process. The demonstrated OPA with a 128-element aperture achieves the highest reported grating-lobe-free field-of-view (FOV)-to-beamwidth ratio of 16°/0.8°, which is equivalent to a 484-element uniform array. This translates to at least 400 resolvable spots, 30 times more than the state-of-the-art 2-D OPAs. Moreover, by utilizing compact phase shifters in a row–column power delivery grid, we reduce the number of required drivers from 144 to 37. A high-swing pulsewidth modulation (PWM) driving circuit featuring breakdown voltage multipliers and soft turn-on activation significantly reduces the power consumption of the system. The electronic driver chip and the integrated photonic chip are fabricated on a 65-nm CMOS process and a thick silicon-on-insulator (SOI) silicon photonics process, occupying 1.7 mm2 and 2.08 mm2 of active area, respectively.

144 citations


Journal ArticleDOI
Gregory K. Chen1, Raghavan Kumar1, H. Ekin Sumbul1, Knag Phil1, Ram Krishnamurthy1 
TL;DR: A reconfigurable 4096-neuron, 1M-synapse chip in 10-nm FinFET CMOS is developed to accelerate inference and learning for many classes of spiking neural networks (SNNs) with less than 2% overhead for storing connections.
Abstract: A reconfigurable 4096-neuron, 1M-synapse chip in 10-nm FinFET CMOS is developed to accelerate inference and learning for many classes of spiking neural networks (SNNs). The SNN features digital circuits for leaky integrate and fire neuron models, on-chip spike-timing-dependent plasticity (STDP) learning, and high-fan-out multicast spike communication. Structured fine-grained weight sparsity reduces synapse memory by up to 16 $\times $ with less than 2% overhead for storing connections. Approximate computing co-optimizes the dropping flow control and benefits from algorithmic noise to process spatiotemporal spike patterns with up to 9.4 $\times $ lower energy. The SNN achieves a peak throughput of 25.2 GSOP/s at 0.9 V, peak energy efficiency of 3.8 pJ/SOP at 525 mV, and 2.3- $\mu \text{W}$ /neuron operation at 450 mV. On-chip unsupervised STDP trains a spiking restricted Boltzmann machine to de-noise Modified National Institute of Standards and Technology (MNIST) digits and to reconstruct natural scene images with RMSE of 0.036. Near-threshold operation, in conjunction with temporal and spatial sparsity, reduces energy by $17.4\times $ to 1.0- $\mu \text{J}$ /classification in a $236 \times 20$ feed-forward network that is trained to classify MNIST digits using supervised STDP. A binary-activation multilayer perceptron with 50% sparse weights is trained offline with error backpropagation to classify MNIST digits with 97.9% accuracy at 1.7- $\mu \text{J}$ /classification.

137 citations


Journal ArticleDOI
TL;DR: A new transformer-based on-chip Doherty power combiner is introduced that can reduce the impedance transformation ratio (ITR) in power back-off (PBO) and, thus, improve the bandwidth and power-combining efficiency.
Abstract: This paper presents the first 28-/37-/39-GHz linear Doherty power amplifier (PA) in silicon for broadband fifth-generation (5G) applications. We introduce a new transformer-based on-chip Doherty power combiner that can reduce the impedance transformation ratio (ITR) in power back-off (PBO) and, thus, improve the bandwidth and power-combining efficiency. We also devise a “driver-PA co-design” method that creates power-dependent uneven feeding in the Doherty PA and enhances the Doherty operation without any hardware overhead or bandwidth compromise. For the proof of concept, we implement a 28-/37-/39-GHz PA fully integrated in a standard 130-nm SiGe BiCMOS process, which occupies 1.8 mm $^{\mathbf {2}}$ . The PA achieves a 52% −3-dB small-signal $\text{S}_{\mathbf {21}}$ bandwidth and a 40% −1-dB large-signal saturated output power ( $\text{P}_{\mathbf {sat}}$ ) bandwidth. At 28/37/39 GHz, the PA achieves +16.8−/+17.1−/+17-dBm $\text{P}_{\mathbf {sat}}$ , +15.2−/+15.5−/+15.4-dBm $\text{P}_{\mathbf {1\,dB}}$ , and superior 1.72/1.92/1.62 times efficiency enhancement over class-B operation at 5.9-/6-/6.7-dB PBO. Moreover, the PA demonstrates multi-gigabit-per-second data rates with excellent efficiency and linearity for 64-quadrature amplitude modulation (64-QAM) in three millimeter-wave (mm-wave) 5G bands. This PA advances the state of the art for Doherty, wideband, and 5G silicon PAs in mm-wave bands. It supports drop-in upgrade for current PAs in existing mm-wave systems and opens doors to compact system solutions for future multiband 5G massive multiple-input multiple-output (MIMO) and phased-array platforms.

124 citations


Journal ArticleDOI
TL;DR: A 256 single-photon avalanche diode (SPAD) sensor integrated into a 3-D-stacked 90-nm 1P4M/40-nm1P8M process is reported for flash light detection and ranging (LIDAR) or high-speed direct time-of-flight (ToF)3-D imaging.
Abstract: A 256 $\times $ 256 single-photon avalanche diode (SPAD) sensor integrated into a 3-D-stacked 90-nm 1P4M/40-nm 1P8M process is reported for flash light detection and ranging (LIDAR) or high-speed direct time-of-flight (ToF) 3-D imaging. The sensor bottom tier is composed of a 64 $\times $ 64 matrix of 36.72- $\mu \text{m}$ pitch modular photon processing units which operate from shared $4\,\,\times $ 4 SPADs at 9.18- $\mu \text{m}$ pitch and 51% fill-factor. A 16 $\times $ 14 bit counter array integrates photon counts or events to compress data to 31.4 Mb/s at 30-frame/s readout over 8 I/O operating at 100 MHz. The pixel-parallel multi-event time-to-digital converter (TDC) approach employs a programmable internal or external clock for 0.56–560-ns time bin resolution. In conjunction with a per-pixel correlator, the power is reduced to less than 100 mW in practical daylight ranging scenarios. Examples of ranging and high-speed 3-D ToF applications are given.

123 citations


Journal ArticleDOI
TL;DR: This paper describes the design and implementation of a scalable multi-element phased-array system, with built-in self-alignment and self-test, based on an RFIC transceiver chipset manufactured in the TowerJazz 0.18.
Abstract: This paper describes the design and implementation of a scalable $W$ -band phased-array system, with built-in self-alignment and self-test, based on an RFIC transceiver chipset manufactured in the TowerJazz 0.18- $\mu \text{m}$ SiGe BiCMOS technology with $f_{T}/f_{\text {MAX}}$ of 240/270 GHz. The RFIC integrates 24 phase-shifter elements (16TX/8RX or 8TX/16RX) as well as direct up- and down-converters, phase-locked loop with prime-ratio frequency multiplier, analog baseband, beam lookup memory, and diagnostic circuits for performance monitoring. Two organic printed circuit board (PCB) interposers with integrated antenna sub-arrays are designed and co-assembled with the RFIC chipsets to produce a scalable phased-array tile. Tiles are phase-aligned to one another through a daisy-chained local oscillator (LO) synchronization signal. Statistical analysis of the effects of LO misalignment between tiles on beam patterns is presented. Sixteen tiles are combined onto a carrier PCB to create a 384-element (256TX/128RX) phased-array system. A maximum saturated effective isotropic radiated power (EIRP) of 60 dBm (1 kW) is measured at boresight for the 256 transmit elements. Wireless links operating at 90.7 GHz using a 16-QAM constellation at a reduced EIRP of 52 dBm produced data rates beyond 10 Gb/s for an equivalent link distance in excess of 250 m.

Journal ArticleDOI
TL;DR: This paper presents Navion, an energy-efficient accelerator for visual-inertial odometry (VIO) that enables autonomous navigation of miniaturized robots, and virtual reality (VR)/augmented reality (AR) on portable devices and is the first fully integrated VIO system in an application-specified integrated circuit (ASIC).
Abstract: This paper presents Navion, an energy-efficient accelerator for visual-inertial odometry (VIO) that enables autonomous navigation of miniaturized robots (e.g., nano drones), and virtual reality (VR)/augmented reality (AR) on portable devices. The chip uses inertial measurements and mono/stereo images to estimate the drone’s trajectory and a 3-D map of the environment. This estimate is obtained by running a state-of-the-art VIO algorithm based on non-linear factor graph optimization, which requires large irregularly structured memories and heterogeneous computation flow. To reduce the energy consumption and footprint, the entire VIO system is fully integrated on-chip to eliminate costly off-chip processing and storage. This paper uses compression and exploits both structured and unstructured sparsity to reduce on-chip memory size by 4.1 $\times $ . Parallelism is used under tight area constraints to increase throughput by 43%. The chip is fabricated in 65-nm CMOS and can process $752\times 480$ stereo images from EuRoC data set in real time at 20 frames per second (fps) consuming only an average power of 2 mW. At its peak performance, Navion can process stereo images at up to 171 fps and inertial measurements at up to 52 kHz, while consuming an average of 24 mW. The chip is configurable to maximize accuracy, throughput, and energy-efficiency tradeoffs and to adapt to different environments. To the best of our knowledge, this is the first fully integrated VIO system in an application-specified integrated circuit (ASIC).

Journal ArticleDOI
TL;DR: The capabilities of the proposed SoC are demonstrated on a wide set of near-sensor processing kernels showing that Mr.Wolf can deliver performance up to 16.4 GOp/s with energy efficiency up to 274 MOp/S/mW on real-life applications, paving the way for always-on data analytics on high-bandwidth sensors at the edge of the Internet of Things.
Abstract: This paper presents Mr.Wolf, a parallel ultra-low power (PULP) system on chip (SoC) featuring a hierarchical architecture with a small (12 kgates) microcontroller (MCU) class RISC-V core augmented with an autonomous IO subsystem for efficient data transfer from a wide set of peripherals. The small core can offload compute-intensive kernels to an eight-core floating-point capable of processing engine available on demand. The proposed SoC, implemented in a 40-nm LP CMOS technology, features a 108- $\mu \text{W}$ fully retentive memory (512 kB). The IO subsystem is capable of transferring up to 1.6 Gbit/s from external devices to the memory in less than 2.5 mW. The eight-core compute cluster achieves a peak performance of 850 million of 32-bit integer multiply and accumulate per second (MMAC/s) and 500 million of 32-bit floating-point multiply and accumulate per second (MFMAC/s) −1 GFlop/s—with an energy efficiency up to 15 MMAC/s/mW and 9 MFMAC/s/mW. These building blocks are supported by aggressive on-chip power conversion and management, enabling energy-proportional heterogeneous computing for always-on IoT end nodes improving performance by several orders of magnitude with respect to traditional single-core MCUs within a power envelope of 153 mW. We demonstrated the capabilities of the proposed SoC on a wide set of near-sensor processing kernels showing that Mr.Wolf can deliver performance up to 16.4 GOp/s with energy efficiency up to 274 MOp/s/mW on real-life applications, paving the way for always-on data analytics on high-bandwidth sensors at the edge of the Internet of Things.

Journal ArticleDOI
TL;DR: A prototype cryogenic CMOS quantum controller designed in a 28-nm bulk CMOS process and optimized to implement a 16-word (4-bit) XY gate instruction set for controlling transmon qubits is presented.
Abstract: Implementation of an error-corrected quantum computer is believed to require a quantum processor with a million or more physical qubits, and, in order to run such a processor, a quantum control system of similar scale will be required. Such a controller will need to be integrated within the cryogenic system and in close proximity with the quantum processor in order to make such a system practical. Here, we present a prototype cryogenic CMOS quantum controller designed in a 28-nm bulk CMOS process and optimized to implement a 16-word (4-bit) XY gate instruction set for controlling transmon qubits. After introducing the transmon qubit, including a discussion of how it is controlled, design considerations are discussed, with an emphasis on error rates and scalability. The circuit design is then discussed. Cryogenic performance of the underlying technology is presented, and the results of several quantum control experiments carried out using the integrated controller are described. This article ends with a comparison to the state of the art and a discussion of further research to be carried out. It has been shown that the quantum control IC achieves promising performance while dissipating less than 2 mW of total ac and dc power and requiring a digital data stream of less than 500 Mb/s.

Journal ArticleDOI
TL;DR: This article introduces a single-chip OPA realized through wafer-scale 3-D integration of silicon photonics and CMOS, and achieves wide-range 2-D steering over 18.25° beamwidth while consuming 20 mW/element average power.
Abstract: With the growing demand for automotive LiDAR and the maturation of silicon photonics platforms, optical phased arrays (OPAs) have emerged as a key technology for solid-state optical beam-steering. In order to meet realistic automotive specifications with OPAs, >500 antenna elements should work reliably under tight power and cost budgets. Existing multi-chip solutions necessitate expensive packaging and assembly to achieve high interconnect density. Even with 2-D monolithic integration, high-voltage drivers to deliver sufficient power to resistive phase shifters typically result in significant overhead in die area and limited power efficiency. In this article, we introduce a single-chip OPA realized through wafer-scale 3-D integration of silicon photonics and CMOS. Flexible and ultra-dense connections with through-oxide vias (TOVs) in our platform resolve the I/O density issue. Moreover, low-voltage L-shaped phase shifters and compact, efficient switch-mode drivers, connected vertically using TOVs, remove wiring/placement overhead and achieve a large active array aperture within a compact die. Our OPA prototype achieves wide-range 2-D steering over 18.5 $^\circ \times $ 16° by leveraging wavelength tuning and phase control, and array scaling up to 125 elements with a large aperture size of $0.5\,\mathrm {mm}\times 0.5\,\mathrm {mm}$ and 0.15 $^\circ \times $ 0.25° beamwidth while consuming 20 mW/element average power. Since our system supports per-element independent phase control, increased sensitivity to process variations in L-shaped shifters is fully compensated by a simple calibration process.

Journal ArticleDOI
TL;DR: The PLL employs digital-to-time converter (DTC)-based sampling PLL architecture, high linearity DTC design techniques, background DTC gain calibration, and reference clock duty cycle correction (DCC) to improve the integrated phase noise (IPN) and fractional spur.
Abstract: An analog fractional- $N$ sampling phase-locked loop (PLL) is presented. It achieves 75-fs rms jitter, integrated from 10 kHz to 10 MHz, and a −249.7-dB figure of merit (FoM) at the fractional- $N$ mode with a 52-MHz reference clock. The measured fractional spur is less than −64 dBc across the 5.5–7.3-GHz output frequency band. The PLL employs digital-to-time converter (DTC)-based sampling PLL architecture, high linearity DTC design techniques, background DTC gain calibration, and reference clock duty cycle correction (DCC) to improve the integrated phase noise (IPN) and fractional spur. This design meets the performance requirement of the 5G cellular 64-quadratic-amplitude modulation (QAM) standard in the 28-/39-GHz band, supporting $2 \times 2$ multi-in multi-out (MIMO). This paper, implemented in a 28-nm CMOS process, is integrated in a 5G millimeter-wave cellular transceiver. This PLL consumes 18.9 mW and occupies 0.45 mm2.

Journal ArticleDOI
TL;DR: In this article, a 0.8mm3-wireless, ultrasonically powered, free-floating neural recording implant is presented, which is comprised only of a recording integrated circuit (IC) and a single piezoceramic resonator.
Abstract: A 0.8-mm3-wireless, ultrasonically powered, free-floating neural recording implant is presented. The device is comprised only of a 0.25-mm2 recording integrated circuit (IC) and a single piezoceramic resonator that are used for both power harvesting and data transmission. Uplink data transmission is performed by the analog amplitude modulation of the ultrasound echo. Using a 1.78-MHz main carrier, >35 kb/s/mote equivalent uplink data rate is achieved. A technique to linearize the echo amplitude modulation is introduced, resulting in $\mu \text{W}$ , while the neural recording front end consumes $4~\mu \text{W}$ and achieves a noise floor of 5.3 $\mu V_{\text {rms}}$ in a 5-kHz bandwidth. This work improves the sub-mm recording mote depth by >2.5 $\times $ , resulting in the highest measured depth/volume ratio by $\sim 3\times $ . Orthogonal subcarrier modulation enables simultaneous operation of multiple implants, using a single-element ultrasound external transducer. Dual-mote simultaneous power-up and data transmission are demonstrated at a rate of 7 kS/s at the depth of 50 mm.

Journal ArticleDOI
TL;DR: A digital calibration scheme integrated into a column of the imager allows off-chip digital process, voltage, and temperature (PVT) compensation of every frame on the fly.
Abstract: A $192 \times 128$ pixel single photon avalanche diode (SPAD) time-resolved single photon counting (TCSPC) image sensor is implemented in STMicroelectronics 40-nm CMOS technology. The 13% fill factor, $18.4\,\,\mu \text {m} \times 9.2\,\,\mu \text{m}$ pixel contains a 33-ps resolution, 135-ns full scale, 12-bit time-to-digital converter (TDC) with 0.9-LSB differential and 5.64-LSB integral nonlinearity (DNL/INL). The sensor achieves a mean 219-ps full-width half-maximum (FWHM) impulse response function (IRF) and is operable at up to 18.6 kframes/s through 64 parallelized serial outputs. Cylindrical microlenses with a concentration factor of 3.25 increase the fill factor to 42%. The median dark count rate (DCR) is 25 Hz at 1.5-V excess bias. A digital calibration scheme integrated into a column of the imager allows off-chip digital process, voltage, and temperature (PVT) compensation of every frame on the fly. Fluorescence lifetime imaging microscopy (FLIM) results are presented.

Journal ArticleDOI
TL;DR: This paper presents a 60-GHz CMOS transceiver targeting the IEEE 802.11ay standard with a calibration block for local oscillator feedthrough and I/Q imbalance featuring high accuracy and low power consumption integrated with the transceiver, capable of boosting the data rate with higher order modulation scheme and wider channel-bonding bandwidth.
Abstract: This paper presents a 60-GHz CMOS transceiver targeting the IEEE 802.11ay standard. A calibration block for local oscillator feedthrough (LOFT) and I/Q imbalance featuring high accuracy and low power consumption is integrated with the transceiver. With the help of the proposed calibration, this paper is capable of boosting the data rate with higher order modulation scheme and wider channel-bonding bandwidth, which are demanded by IEEE 802.11ay. At the same time, it maintains the compatibility with the existing IEEE 802.11ad standard. This paper reports a two-channel-bonding data rate of 24.64 Gb/s in 128 quadrature amplitude modulation (QAM). The corresponding TX-to-RX error vector magnitude (EVM) is −26.1 dB. Furthermore, a four-channel-bonding data rate of 42.24 Gb/s in 64 QAM is realized with a single-element transceiver. The measured maximum data rate is 50.1 Gb/s in 64 QAM, which is the highest data rate achieved in the 60-GHz band. The power consumption is only 169 mW in the transmitting mode and 139 mW in the receiving mode.

Journal ArticleDOI
TL;DR: A minimum mean-square error (MMSE) beam adaptation technique, introduced for the first time in an RF or hybrid beamformer, breaks the need for individual access to the beamformer inputs needed in a traditional least-mean-square (LMS) scheme and allows both main lobe and null adaptation.
Abstract: This paper presents a multi-standard fully connected hybrid beamforming (FC-HBF) receiver that can be reconfigured between two fully connected two-stream multi-input-multi-output (MIMO) modes at 28- or 37-GHz band, and an inter-band carrier-aggregation (CA) mode enabling concurrent single-stream operation at 28 and 37 GHz. A new image-reject (IR) heterodyne beamforming architecture is introduced that facilitates easy reconfiguration. Concurrent dual-band operation of the beamformer front end is achieved by using an inherently wideband Cartesian-combining-based RF-domain complex-weighting architecture, which is implemented using coupled-resonator-based dual-band gain stages, and current-mode dual-band active combiners. The downconversion stage comprises a cascade of complex-quadrature mixing stages that combines Cartesian complex-weighting operation with image rejection. A new quadrature error detection and calibration scheme is also introduced to improve the image rejection of the proposed architecture. A minimum mean-square error (MMSE) beam adaptation technique, introduced for the first time in an RF or hybrid beamformer, breaks the need for individual access to the beamformer inputs needed in a traditional least-mean-square (LMS) scheme and allows both main lobe and null adaptation. A 28-/37-GHz hybrid beamforming receiver with four antenna inputs and two baseband output streams is designed in 65-nm CMOS that demonstrates the above features. In each antenna path, the receiver achieves 33-dB (26.5 dB) peak conversion gain, 2.75-GHz (3.75 GHz) RF bandwidth, and 5.7-dB (8.5 dB) NF at 28 GHz (37 GHz). The entire receiver consumes 310 mW and occupies 2.2 mm2 core area. Extensive characterization results for single-element and several multi-element scenarios are presented including concurrent dual-band receiving (>35-dB image rejection), null-steering (>25-dB peak to null for two elements), and multi-element CA (~50-dB inter-carrier interference rejection).

Journal ArticleDOI
TL;DR: A modular, direct time-of-flight (TOF) depth sensor that operates autonomously, by internal data acquisition, management, and storage, being periodically read out by an external access is introduced.
Abstract: This article introduces a modular, direct time-of-flight (TOF) depth sensor. Each module is digitally synthesized and features a 2 $\times $ (8 $\times $ 8) single-photon avalanche diode (SPAD) pixel array, an edge-sensitive decision tree, a shared time-to-digital converter (TDC), 21-bit per-pixel memory, and in-locus data processing. Each module operates autonomously, by internal data acquisition, management, and storage, being periodically read out by an external access. The prototype was fabricated in a TSMC 3-D-stacked 45/65-nm CMOS technology, featuring backside illumination (BSI) SPAD detectors on the top tier, and readout circuit on the bottom tier. The sensor was characterized by single-point measurements, in two different modes of resolution and range. In low-resolution mode, a maximum of 300-m and 80-cm accuracy was recorded; on the other hand, in high-resolution mode, the maximum range and accuracy were 150 m and 7 cm, respectively. The module was also used in a flexible scanning light detection and ranging (LiDAR) system, where a 256 $\times $ 256 depth map, with millimeter precision, was obtained. A laser signature based on pulse-position modulation (PPM) is also proposed, achieving a maximum of 28-dB interference reduction.

Journal ArticleDOI
TL;DR: This paper describes a unified static/dynamic entropy generator based on a 512-b common entropy source (ES) array fabricated in 14-nm tri-gate CMOS with reconfigurable and adaptive post-processing circuits implemented on Arria 10 FPGA, targeted for flexible and secure privacy preserving mutual authentication on compact trusted mote platforms at the edge of internet of things.
Abstract: This paper describes a unified static/dynamic entropy generator based on a 512-b common entropy source (ES) array fabricated in 14-nm tri-gate CMOS with reconfigurable and adaptive post-processing circuits implemented on Arria 10 FPGA, targeted for flexible and secure privacy preserving mutual authentication on compact trusted mote platforms at the edge of internet of things. Several conditioning techniques that include temporal majority voting (TMV)-assisted ES array segregation with integrated bias tracking, three-way in-line self-calibration for tolerance to process–voltage–temperature variation, tri-level hierarchical Von Neumann (VN) extraction to maximize entropy harvesting, soft-dark bit masking for improving physically unclonable function (PUF) stability, and selective stress hardening to co-optimize the ES array for static-dynamic entropy with bias aware device aging enable simultaneous PUF and true random number generator (TRNG) operation with 1.48 and 0.56 Gb/s throughput, respectively, measured at 650 mV, 70 °C. The all-digital design with a compact layout footprint of 2114 $\mu \text{m}^{2}$ facilitates seamless integration in area constrained system-on-chips while achieving: 1) 25% area savings over conventional separate PUF and TRNG implementations; 2) cryptographic quality TRNG stream that passes all NIST randomness tests with 0.38 average p-value; 3) $1.6\times $ higher extractor performance at $9\times $ lower area with 750-gate hierarchical VN circuit over conventional light-weight entropy extractors; 4) 0.9996/0.99997 static/dynamic Shannon entropy indicating unbiased PUF/TRNG streams; 5) ultra-low energy consumption of 2.5 and 0.46 pJ/bit measured at 650 mV, 70 °C in TRNG and PUF modes; 6) 40% higher TRNG throughput with three-way self-calibration featuring coarse-grain column swap, fine-grain incremental ES substitution, and residual entropy recycling; 7) resistance to power injection attacks as measured by 64% higher performance over un-calibrated design in the presence 200-mV supply noise; 8) 2.8% PUF bit-error measured at 0.55–0.75 V, 25 °C–110 °C with 15-way TMV and soft dark-bit masking over a window of 100 cycles; 9) $14.8\times $ inter and intra-PUF hamming distance separation; and 10) 56% reduction in discarded ES cells with selective stress hardening to opportunistically reinforce/nullify pre-existing bias in PUF/TRNG candidate cells. To our knowledge, this is the first reported unified PUF-TRNG implementation enabling simultaneous generation of high-entropy chip-ID and encryption keys in real time.

Journal ArticleDOI
TL;DR: A smart wearable electrocardiographic (ECG) processor is presented for secure ECG-based biometric authentication and cardiac monitoring, including arrhythmia and anomaly detection, using data-driven Lasso regression and low-precision techniques.
Abstract: Many wearable devices employ the sensors for physiological signals (e.g., electrocardiogram or ECG) to continuously monitor personal health (e.g., cardiac monitoring). Considering private medical data storage, secure access to such wearable devices becomes a crucial necessity. Exploiting the ECG sensors present on wearable devices, we investigate the possibility of using ECG as the individually unique source for device authentication. In particular, we propose to use ECG features toward both cardiac monitoring and neural-network-based biometric authentication. For such complex functionalities to be seamlessly integrated in wearable devices, an accurate algorithm must be implemented with ultralow power and a small form factor. In this paper, a smart ECG processor is presented for ECG-based authentication as well as cardiac monitoring. Data-driven Lasso regression and low-precision techniques are developed to compress neural networks for feature extraction by 24.4 $\times $ . The 65-nm testchip consumes 1.06 $\mu \text{W}$ at 0.55 V for real-time ECG authentication. For authentication, equal error rates of 1.70%/2.18%/2.48% (best/average/worst) are achieved on the in-house 645-subject database. For cardiac monitoring, 93.13% arrhythmia detection sensitivity and 89.78% specificity are achieved for 42 subjects in the MIT-BIH arrhythmia database.

Journal ArticleDOI
TL;DR: This paper presents a low-power and scaling-friendly noise-shaping (NS) SAR ADC that uses passive switches and capacitors to perform residue integration and realizes the path gains via transistor size ratios inside a multi-path dynamic comparator.
Abstract: This paper presents a low-power and scaling-friendly noise-shaping (NS) SAR ADC. Instead of using operational transconductance amplifiers that are power hungry and scaling unfriendly, the proposed architecture uses passive switches and capacitors to perform residue integration and realizes the path gains via transistor size ratios inside a multi-path dynamic comparator. The overall architecture is simple and robust. Since the noise transfer function is set by component ratios, it is insensitive to process, voltage, and temperature (PVT) variations. Besides the proposed architecture, this paper also presents two new circuit techniques. A tri-level voting scheme is proposed to reduce the comparator noise. It outperforms the majority voting technique by exploiting more information in the comparator output statistics and providing an extra decision level. A dynamic multi-phase clock generator is also proposed to guarantee non-overlapping and support an arbitrary number of phases. A prototype 9-bit NS-SAR ADC is fabricated in a 40-nm CMOS process. It consumes $143~\mu \text{W}$ at 1.1 V while operating at 8.4 MS/s. Taking advantage of the second-order NS, it achieves a peak SNDR of 78.4 dB over a bandwidth of 262 kHz at the oversampling ratio of 16, leading to an SNDR-based Schreier figure of merit (FoM) of 171 dB.

Journal ArticleDOI
TL;DR: This paper presents an ultra-low-power voice activity detector (VAD) that uses analog signal processing for acoustic feature extraction (AFE) directly on the microphone output, approximate event-driven analog-to-digital conversion (ED-ADC), and digital deep neural network (DNN) for speech/non-speech classification.
Abstract: This paper presents an ultra-low-power voice activity detector (VAD). It uses analog signal processing for acoustic feature extraction (AFE) directly on the microphone output, approximate event-driven analog-to-digital conversion (ED-ADC), and digital deep neural network (DNN) for speech/non-speech classification. New circuits, including the low-noise amplifier, bandpass filter, and full-wave rectifier contribute to the more than 9 $\times $ normalized power/channel reduction in the feature extraction front-end compared to the best prior art. The digital DNN is a three-hidden-layer binarized multilayer perceptron (MLP) with a 2-neuron output layer and a 48-neuron input layer that receives parallel event streams from the ED-ADCs. To obtain the DNN weights via off-line training, a customized front-end model written in python is constructed to accelerate feature generation in software emulation, and the model parameters are extracted from Spectre simulations. The chip, fabricated in 0.18- $\mu \text{m}$ CMOS, has a core area of 1.66 $\times $ 1.52 mm2 and consumes 1 $\mu \text{W}$ . The classification measurements using the 1-hour 10-dB signal-to-noise ratio audio with restaurant background noise show a mean speech/non-speech hit rate of 84.4%/85.4% with a 1.88%/4.65% 1- $\sigma $ variation across ten dies that are all loaded with the same weights.

Journal ArticleDOI
TL;DR: A novel reconfigurable 12-stage rectifier with matching network with integrated hill-climbing, maximum power point tracking (MPPT) function for wide input power from −22 to 4 dBm and a conceptual linear model with high accuracy is proposed to analyze the rectifier efficiency for MPPT operations.
Abstract: To overcome the low-efficiency and limited working range of the existing RF energy harvesting (EH) systems for the wireless Internet-of-Things (IoT) sensors, a novel reconfigurable system is proposed with integrated hill-climbing, maximum power point tracking (MPPT) function for wide input power from −22 to 4 dBm. A conceptual linear model with high accuracy is also proposed to analyze the rectifier efficiency for MPPT operations. The rectifier with off-chip matching is designed with a patch antenna at 915-MHz the industrial, scientific and medical (ISM) band. To further improve the end-to-end efficiency, the harvested power is used to power up the circuit block in system on a chip (SoC) directly, avoiding additional conversion loss. Our proposed reconfigurable 12-stage rectifier with matching network achieves −18.1-dBm sensitivity for 1- $\text{M}\Omega $ loading and 36% peak efficiency at 1 dBm. The proposed MPPT function can detect and determine the optimal rectifier stage for loading from 10 $\text{K}\Omega $ to 1 $\text{M}\Omega $ . The measured MPPT accuracy is over 87% from −22 to 4 dBm compared to external tuning conditions. The minimum stand-by power is 20 nW at 0.5 V and the overall MPPT power efficiency is over 72% with a peak value of 99.8% including dissipated power. Measurements also show the system can achieve self-startup and self-sustained functions with a 10- $\mu \text{F}$ external capacitor buffer.

Journal ArticleDOI
TL;DR: This paper demonstrates the improved power and electromagnetic (EM) side-channel attack (SCA) resistance of 128-bit Advanced Encryption Standard (AES) engines in 130-nm CMOS using random fast voltage dithering (RFVD) enabled by integrated voltage regulator with the bond-wire inductors and an on-chip all-digital clock modulation (ADCM) circuit.
Abstract: This paper demonstrates the improved power and electromagnetic (EM) side-channel attack (SCA) resistance of 128-bit Advanced Encryption Standard (AES) engines in 130-nm CMOS using random fast voltage dithering (RFVD) enabled by integrated voltage regulator (IVR) with the bond-wire inductors and an on-chip all-digital clock modulation (ADCM) circuit. RFVD scheme transforms the current signatures with random variations in AES input supply while adding random shifts in the clock edges in the presence of global and local supply noises. The measured power signatures at the supply node of the AES engines show upto 37 $\times $ reduction in peak for higher order test vector leakage assessment (TVLA) metric and upto 692 $\times $ increase in minimum traces required to disclose (MTD) the secret encryption key with correlation power analysis (CPA). Similarly, SCA on the measured EM signatures from the chip demonstrates a reduction of upto 11.3 $\times $ in TVLA peak and upto 37 $\times $ increase in correlation EM analysis (CEMA) MTD.

Journal ArticleDOI
TL;DR: This paper proposes a fully integrated digital low-dropout (DLDO) regulator using a beat-frequency (BF) quantizer implemented in a 65-nm low power (LP) CMOS technology, replacing the conventional voltage quantizer by a pair of voltage-controlled oscillator and a time quantizer.
Abstract: This paper proposes a fully integrated digital low-dropout (DLDO) regulator using a beat-frequency (BF) quantizer implemented in a 65-nm low power (LP) CMOS technology. A time-based approach, replacing the conventional voltage quantizer by a pair of voltage-controlled oscillator and a time quantizer, makes the design highly digital. A D-flip-flop is utilized as a BF generator, which is used as the sampling clock for the DLDO. The variable sampling frequency in the BF DLDO can achieve fast response, LP consumption, and excellent stability at the same time. In addition to that, the DLDO has a built-in active voltage positioning (AVP) for lower peak-to-peak voltage deviation during load step. The load capacitor is only 40 pF, and the total core area of the DLDO is 0.0374 mm2. A 50-mA step in load current produces a voltage droop of 108 mV, which is recovered in 1.24 $\mu \text{s}$ . It can operate for a wide input voltage from 0.6 to 1.2 V while generating a 0.4–1.1-V output for a maximum load current of 100 mA. The peak current efficiency is 99.5% and the figure of merit (FOM) is 1.38 ps.

Journal ArticleDOI
TL;DR: A fully integrated split-electrode synchronized switch harvesting on capacitors (SSHC) rectifier is proposed, which achieves significant performance enhancement without employing any off-chip components.
Abstract: In order to efficiently extract power from piezoelectric vibration energy harvesters, various active rectifiers have been proposed in the past decade, which include synchronized switch harvesting on inductor (SSHI), synchronous electric charge extraction (SECE), and so on. Although reported active rectifiers show good performance improvements compared to full-bridge rectifiers (FBRs), large off-chip inductors are typically required and the system volume is inevitably increased as a result, counter to the requirement for system miniaturization. In this paper, a fully integrated split-electrode synchronized switch harvesting on capacitors (SSHC) rectifier is proposed, which achieves significant performance enhancement without employing any off-chip components. The proposed circuit is designed and fabricated in a 0.18- $\mu \text{m}$ CMOS process and it is co-integrated with a custom microelectromechanical systems (MEMS) piezoelectric transducer with its electrode layer equally split into four regions. The measured results show that the proposed rectifier can provide up to 8.2 $\times $ and 5.2 $\times $ boost, using on-chip and off-chip diodes, respectively, in harvested power compared to an FBR under low excitation levels and the peak rectified output power achieves 186 $\mu \text{W}$ .

Journal ArticleDOI
TL;DR: An all-digital ring oscillator (RO)-based Bluetooth low-energy (BLE) transmitter for ultra-low-power radios in short range Internet-of-Things (IoT) applications and proposes an RO-based solution for power and cost savings.
Abstract: In this paper, we present an all-digital ring oscillator (RO)-based Bluetooth low-energy (BLE) transmitter (TX) for ultra-low-power radios in short range Internet-of-Things (IoT) applications. The power consumption of state-of-the-art BLE TXs has been limited by the relatively power-hungry local oscillator (LO) due to the use of LC oscillators for superior phase noise (PN) performance. This paper addresses this issue by analyzing the PN limit of a BLE TX and proposes an RO-based solution for power and cost savings. The proposed transmitter features: 1) a wideband all-digital phase-locked loop (ADPLL) featuring an $f_{\mathrm {RF}} / {4}$ RO, with an embedded 5-bit TDC; 2) a 4 $\times $ frequency edge combiner to generate the 2.4-GHz signal; and 3) a switch-capacitor digital PA optimized for high efficiency at low transmit power levels. These not only help reduce the power consumption and improve PN performance but also enhance the TX efficiency for short range applications. The TX is prototyped in 40-nm CMOS, occupies an active area of 0.0166 mm2, and consumes 486 $\mu \text{W}$ in its low-power mode, while configured as a non-connectable advertiser. The TX has been validated by wirelessly communicating beacon messages to a mobile phone.