scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Journal of Solid-state Circuits in 2023"



Journal ArticleDOI
TL;DR: DIgital-ANAlog (DIANA) as mentioned in this paper is a heterogeneous multi-core accelerator that combines a reduced instruction set computer (RISC-V) host processor with an analog in-memory computing (AIMC) artificial intelligence (AI) accelerator and a digital reconfigurable deep neural network (DNN) accelerator in a single system-on-chip (SoC).
Abstract: DIgital-ANAlog (DIANA), a heterogeneous multi-core accelerator, combines a reduced instruction set computer - five (RISC-V) host processor with an analog in-memory computing (AIMC) artificial intelligence (AI) accelerator and a digital reconfigurable deep neural network (DNN) accelerator in a single system-on-chip (SoC) to support a wide variety of neural network (NN) workloads. AIMC cores can bring extreme computational parallelism and efficiency at the expense of accuracy and dataflow flexibility. Digital AI co-processors, on the other hand, guarantee accuracy through deterministic compute, but cannot achieve the same computational density and efficiency. DIANA exploits this fundamental tradeoff by integrating both types of cores in a shared and optimized memory system, to enable seamless execution of the workloads on the parallel cores. The system’s performance benefits further from pipelined parallel execution across both accelerator cores and enhanced AIMC spatial unrolling techniques, leading to drastically reduced execution latency and reduced memory footprints. The design has been implemented in a 22-nm technology and achieves peak efficiencies of 600 TOP/s/W for the AIMC core (I/W/O: 7/1.5/6 bit) and 14 TOP/s/W (I/W/O: 8/8/8 bit) for the digital accelerator, respectively. End-to-end performance evaluation of CIFAR-10 and ImageNet classification workloads is carried out on the chip, reporting 7.02 and 5.56 TOP/s/W, respectively, at the system level.

6 citations


Journal ArticleDOI
TL;DR: T-PIM as mentioned in this paper is a PIM accelerator for end-to-end on-device training, which supports both fully connected and convolutional layers and achieves high-speed inference.
Abstract: Recently, on-device training has become crucial for the success of edge intelligence. However, frequent data movement between computing units and memory during training has been a major problem for battery-powered edge devices. Processing-in-memory (PIM) is a novel computing paradigm that merges computing logic into memory, which can address the data movement problem with excellent power efficiency. However, previous PIM accelerators cannot support the entire training process on chip due to its computing complexity. This article presents a PIM accelerator for end-to-end on-device training (T-PIM), the first PIM realization that enables end-to-end on-device training as well as high-speed inference. Its full-custom PIM macro contains 8T-SRAM cells to perform the energy-efficient in-cell AND operation and the bit-serial-based computation logic enables fully variable bit-precision for input data. The macro supports various data mapping methods and computational paths for both fully connected and convolutional layers, in order to handle the complex training process. An efficient tiling scheme is also proposed to enable T-PIM to compute any size of deep neural network with the implemented hardware. In addition, configurable arithmetic units in a forward propagation path make T-PIM handle power-of-two bit-precision for weight data, enabling a significant performance boost during inference. Finally, T-PIM efficiently handles sparsity in both operands by skipping the computation of zeros in the input data and by gating-off computing units when the weight data are zero. Finally, we fabricate the T-PIM chip in 28-nm CMOS technology, occupying a die area of 5.04 mm2, including five T-PIM cores. It dissipates 5.25–51.23 mW at 50–280 MHz operating frequency with 0.75–1.05-V supply voltage. We successfully demonstrate that T-PIM can run the end-to-end training of VGG16 model on the CIFAR10 and CIFAR100 datasets, achieving 0.13–161.08- and 0.25–7.59-TOPS/W power efficiency during inference and training, respectively. The result shows that T-PIM is $2.02\times $ more energy-efficient than the state-of-the-art PIM chip that only supports backward propagation, not a whole training. Furthermore, we conduct an architectural experiment using a cycle-level simulator based on actual measurement results, which suggests that the T-PIM architecture is scalable and its scaled-up version provides up to $203.26\times $ higher power efficiency than a comparable GPU.

5 citations


Journal ArticleDOI
TL;DR: In this article , a 256-channel single-photon avalanche diode (SPAD) line sensor was designed for time-resolved Raman spectroscopy in 110-nm CMOS technology.
Abstract: A 256-channel single-photon avalanche diode (SPAD) line sensor was designed for time-resolved Raman spectroscopy in 110-nm CMOS technology. The line sensor consists of an $8\times256$ SPAD array and 256 parallel connected time-to-digital converters (TDCs). The adjustable temporal resolution and dynamic range of TDCs are 25.6–65 ps and 3.2–8.2 ns, respectively. The median timing skew along 256 channels is 43.7 ps, and TDC bin boundaries can be fine-tuned at the ps-level to enable precise timing skew compensation. The sensor is capable of real-time dark count measurement (two dark measurements for each excitation pulse) that gives accurate data for dark count compensation without any increment in measurement time. The maximum excitation pulse rate with real-time dark count measurement is 680 kHz. Raman spectra of six different samples were measured to prove the performance of the sensor in time-resolved Raman spectroscopy.

4 citations


Journal ArticleDOI
TL;DR: In this paper , a 192-Gb 896-GB/s 12-high stacked third-generation high-bandwidth memory (HBM3 DRAM) with low power consumption and high reliability traits is introduced.
Abstract: This article introduces a 192-Gb 896-GB/s 12-high stacked third-generation high-bandwidth memory (HBM3 DRAM) with low power consumption and high-reliability traits. New design schemes and features, including internal low-voltage signaling, center strobe calibration, through-silicon via (TSV) auto-calibration, a symbol-correcting in-DRAM ECC, and machine-learning-based layout optimization, allow large amounts of data transfers among the vertically stacked base and core dies with limited delay mismatch or SI degradation, as well as reduced power consumption from low-voltage swings. Experimental results confirm 896-GB/s bandwidth operations at 1.0-V voltage conditions with up to 15% improved power efficiency.

3 citations


Journal ArticleDOI
TL;DR: In this paper , a threshold-based bioluminescence detector with a CMOS-integrated photodiode array in a 65-nm technology was presented for biochemical detection via onboard genetically engineered biosensor bacteria.
Abstract: This article presents a highly miniaturized ingestible electronic capsule for biochemical detection via onboard genetically engineered biosensor bacteria. The core integrated circuit (IC) is a threshold-based bioluminescence detector with a CMOS-integrated photodiode array in a 65-nm technology that utilizes a dual-duty-cycling front end to achieve low power consumption. The implemented IC achieved 59-nW active power consumption, 25-fA/count resolution, and a 59-fA minimum detectable signal (MDS) using a calibrated optical source. The IC was then integrated with other system components into a battery-powered wireless ingestible capsule measuring just 6.5 mm thick $\times $ 12 mm diameter. We demonstrated successful detection of low-intensity bioluminescent signals from bioengineered bacterial sensors when exposed to the intestinal inflammation biomarker tetrathionate in vitro. Together, the IC and mm-scale smart pill systems demonstrate high sensitivity with low-power multiplexed measurement capability suitable for noninvasive disease diagnosis and monitoring in the gastrointestinal (GI) tract.

3 citations


Journal ArticleDOI
TL;DR: In this paper , a threshold-based bioluminescence detector with a CMOS-integrated photodiode array in a 65-nm technology was presented for biochemical detection via onboard genetically engineered biosensor bacteria.
Abstract: This article presents a highly miniaturized ingestible electronic capsule for biochemical detection via onboard genetically engineered biosensor bacteria. The core integrated circuit (IC) is a threshold-based bioluminescence detector with a CMOS-integrated photodiode array in a 65-nm technology that utilizes a dual-duty-cycling front end to achieve low power consumption. The implemented IC achieved 59-nW active power consumption, 25-fA/count resolution, and a 59-fA minimum detectable signal (MDS) using a calibrated optical source. The IC was then integrated with other system components into a battery-powered wireless ingestible capsule measuring just 6.5 mm thick $\times $ 12 mm diameter. We demonstrated successful detection of low-intensity bioluminescent signals from bioengineered bacterial sensors when exposed to the intestinal inflammation biomarker tetrathionate in vitro. Together, the IC and mm-scale smart pill systems demonstrate high sensitivity with low-power multiplexed measurement capability suitable for noninvasive disease diagnosis and monitoring in the gastrointestinal (GI) tract.

3 citations


Journal ArticleDOI
TL;DR: In this article , an 8k-multiply-accumulate (MAC) neural processing unit (NPU) was proposed for 4-nm mobile system-on-chip (SoC).
Abstract: This article presents an 8k-multiply-accumulate (MAC) neural processing unit (NPU) in 4-nm mobile system-on-chip (SoC). The unified multi-precision MACs support from integer (INT)4/8/16 to floating point (FP)16 data with high area and energy efficiency. When the NPU meets some layers having low hardware (HW) utilization, such as depthwise convolution or shallow layers with a few input channels, the NPU reconfigures the computational flow to enhance the utilization up to four times after getting basic tensor information from a compiler, such as operation types and shapes. The NPU supports a dynamic operation mode to cover extremely low-power to low-latency requirements. The NPU achieves 4.26 tera FP operations per second (TFLOPS)/W and 11.59 tera operations per second (TOPS)/W for DeepLabV3 (FP16) and MobileNetEdgeTPU (INT8), respectively, as well as high area efficiency (1.72 TFLOPS/mm2 and 3.45 TOPS/mm2).

3 citations


Journal ArticleDOI
TL;DR: In this article , a low-noise bioimpedance (BioZ) sensor interface IC for small-area dry-electrode cardio-respiratory signals monitoring is presented.
Abstract: This article describes a high-input-impedance, low-noise bioimpedance (BioZ) sensor interface IC for small-area dry-electrode cardio-respiratory signals monitoring. To facilitate high-precision BioZ sensing with high-impedance dry electrodes, the IC utilizes three key techniques as follows: 1) a bias control loop (BCL) to eliminate the excitation current mismatch, reducing the voltage fluctuation on high-impedance input; 2) a quiet-chopping current feedback instrumentation amplifier (QC-CFIA) to mitigate the input-signal-dependent noise; and 3) a full pre-charge (FPC) technique to cancel the input parasitic capacitance for impedance boosting. Manufactured in a 0.18- $\mu \text{m}$ CMOS process, the BioZ prototype IC occupies an area of 0.4 mm2 while consuming 15.8- and 14.4–128.4- $\mu \text{W}$ current, respectively, from the amplifier and the excitation current generator (CG). With these proposed techniques, this IC achieves a high input impedance of 100 $\text{M}\Omega $ at 50 kHz, 0.5- $\text{m}\Omega /\surd $ Hz sensitivity at 1 Hz, and a 106-dB signal-to-noise ratio (SNR). A gel-free respiration and impedance cardiography (ICG) recording has been successfully demonstrated on the human body with four 0.45-cm2 dry electrodes.

3 citations


Journal ArticleDOI
TL;DR: In this article , the authors demonstrate a fully integrated broadband four-channel phased array transceiver, capable of wireless data rates up to 200 Gb/s covering the entire $D$ -band (110-170 GHz).
Abstract: This article demonstrates a fully integrated broadband four-channel phased array transceiver, capable of wireless data rates up to 200 Gb/s covering the entire $D$ -band (110–170 GHz). The circuit is developed in a 130-nm SiGe BiCMOS technology, featuring ${f}~_{\text {t}}/{f}~_{\text {max}}$ of 300/500 GHz, and includes localized back-side etching-based ON-chip patch antennas. In both transmit and receive modes, direct up- and down-conversions are performed by in-phase and quadrature mixers driven by a multiplier-by-four local oscillator chain. A bidirectional true time delay circuit, with a resolution of 0.446 ps, which is equivalent to the accuracy of a 4-bit phase shifter, provides the squint-free beam-steering capability. Beam-steering measurements show how the beam can be steered from −45° to 45° in a 7° step. The transceiver achieves a 3-dB baseband bandwidth of 30 and 27 GHz in the transmit and receive modes, respectively. A wireless link demonstration is performed by mounting two chips on printed circuit boards, one in the transmit and one in the receive mode, together with plastic lenses on both sides, at a distance of 15 cm. Hardware-in-the-loop measurements show record data rates of 180 Gb/s with EVM of 12.2% using 16-QAM and 200 Gb/s with 8.3% EVM using 32-QAM. The four-channel transceiver consumes 1.95 and 2.5 W in the receive and transmit modes, respectively, which correspond to power efficiencies of 9.75 pJ/bit in the receiver mode and 12.5 pJ/bit in the transmitter mode.

3 citations


Journal ArticleDOI
TL;DR: In this article , a wideband watt-level digital power amplifier (DPA) with high efficiency and large dynamic range is presented in CMOS technology for wireless applications, where the wideband matching network based on a reconfigurable power-combining transformer is used.
Abstract: In this article, a wideband watt-level digital power amplifier (DPA) with high efficiency and large dynamic range is presented in CMOS technology for wireless applications. To achieve high output power with enhanced operation bandwidth (BW), the wideband matching network based on a reconfigurable power-combining transformer is used. Meanwhile, the $L$ $C$ circuit is used to suppress the harmonics, which further improves the output power of the fundamental signal. In addition, the LO leakage is suppressed by the 12-bit power digital-to-analog converter (power DAC), which leads to high dynamic range of the proposed DPA. To verify the mechanism, a 1.2–3.6-GHz watt-level 12-bit polar DPA is implemented and fabricated using a conventional 40-nm CMOS technology. With 1.1-/2.5-V supply, the fabricated DPA exhibits peak output power ( $P_{\text {out}}$ ) of 32.67 dBm, peak drain efficiency (DE) of 45.1%, and peak power-added efficiency (PAE) of 35.5% at 2 GHz. It supports 50-MSyms/s 256-QAM with average output power ( $P_{\text {avg}}$ ) of 22.76 dBm, error vector magnitude (EVM) of −31.46 dB, and adjacent channel leakage ratio (ACLR) of −30.67 dBc, 10-MSyms/s 1024-QAM with $P_{\text {avg}}$ of 25.54 dBm, EVM of −38.2 dB, and ACLR of −38.71 dBc, and 5-MSym/s 4096-QAM with $P_{\text {avg}}$ of 22.97 dBm, EVM of −43.0 dB, and ACLR of −46.32 dBc, respectively.

Journal ArticleDOI
TL;DR: In this article , the authors proposed a nvCIM macro featuring a direct-current-free time-space-based in-memory computing (DCFTS-IMC), a wordline-based serial access computing (WSAC), an integration-based voltage-to-time converter (IVTC), and a hidden-latency time to MAC value conversion (HLTMC) scheme.
Abstract: Compute-in-memory (nvCIM) macros based on non-volatile memory make it possible for artificial intelligence (AI) edge devices to perform energy-efficient multiply-and-accumulate (MAC) operations by minimizing the movement of data between the processors and memory. However, nvCIM imposes tradeoffs between energy efficiency, computing latency, and readout accuracy against process variation. To overcome these challenges, this work proposed a nvCIM macro featuring: 1) a direct-current-free time-space-based in-memory computing (DCFTS-IMC) scheme; 2) a wordline-based serial access computing (WSAC) scheme; 3) an integration-based voltage-to-time converter (IVTC); and 4) a hidden-latency time-to-MAC value conversion (HLTMC) scheme. The proposed 22-nm 8-Mb resistive random access memory-CIM (ReRAM-CIM) macro was fabricated to demonstrate MAC operations with 8-b input, 8-b weight, and 19-b output. Our nvCIM macro achieved computing latency of 14.4 ns under 8-b precision with an energy efficiency of 21.6 TOPS/W.

Journal ArticleDOI
TL;DR: PIMCA as discussed by the authors is a programmable in-memory computing accelerator for low-precision (1-2 b) deep neural network (DNN) inference, which integrates 108 of such IMC static random-access memory (SRAM) macros with the custom six-stage pipeline and the custom instruction set architecture (ISA) for instruction-level programmability.
Abstract: This article presents a programmable in-memory computing accelerator (PIMCA) for low-precision (1–2 b) deep neural network (DNN) inference. The custom 10T1C bitcell in the in-memory computing (IMC) macro has four additional transistors and one capacitor to perform capacitive-coupling-based multiply and accumulation (MAC) in analog-mixed-signal (AMS) domain. A macro containing $256 \times 128$ bitcells can simultaneously activate all the rows, and as a result, it can perform a matrix-vector multiplication (VMM) in one cycle. PIMCA integrates 108 of such IMC static random-access memory (SRAM) macros with the custom six-stage pipeline and the custom instruction set architecture (ISA) for instruction-level programmability. The results of IMC macros are fed to a single-instruction-multiple-data (SIMD) processor for other computations such as partial sum accumulation, max-pooling, activation functions, etc. To effectively use the IMC and SIMD datapath, we customize the ISA especially by adding hardware loop support, which reduces the program size by up to 73%. The accelerator is prototyped in a 28-nm technology, and integrates a total of 3.4-Mb IMC SRAM and 1.5-Mb off-the-shelf activation SRAM, demonstrating one of the largest IMC accelerators to date. It achieves the system-level energy efficiency of 437 TOPS/W and the peak throughput of 49 TOPS at the 42-MHz clock frequency and 1-V supply for the VGG9 and the ResNet-18 on the CIFAR-10 dataset.

Journal ArticleDOI
TL;DR: In this article , a dual-function mode multiplexer, a power-combining PA with high output power, a current choking high-gain mixer, a two-point modulation (TPM) frequency-modulated continuous-wave (FMCW) digital phase-locked loop (PLL), and a wideband I/Q local oscillator (LO) generator is implemented in 28-nm CMOS technology.
Abstract: A $D$ -band joint radar-communication complementary metal–oxide–semiconductor (CMOS) transceiver featuring a dual-function mode multiplexer, a power-combining PA with high output power, a current choking high-gain mixer, a two-point modulation (TPM) frequency-modulated continuous-wave (FMCW) digital phase-locked loop (PLL) with a dual-core DCO, and a wideband I/Q local oscillator (LO) generator is implemented in 28-nm CMOS technology. In the radar mode, the RF front end demonstrates 46-GHz bandwidth (BW), and the on-chip PLL/LO generated FMCW chirp achieves a BW of 30 GHz and a slope of 30 GHz/ $50~\mu \text{s}$ . In the communication mode, the transceiver including the analog baseband realizes 20-GHz BW, and the image rejection ratio (IRR) is better than 40 dB. The measured transmitter (TX) saturated output power is 13 dBm, and the output 1-dB compression point (OP1dB) is 8.3 dBm. The measured typical PLL phase noise is −111.3 dBc/Hz at a 1-MHz offset from an 11.69-GHz carrier frequency. The TX-to-RX over-the-air (OTA) modulation–demodulation measurement with QPSK and 16-QAM signals shows the error vector magnitude (EVM) of −16.5 and −19.7 dB, respectively.

Journal ArticleDOI
TL;DR: In this paper , the authors demonstrate a fully integrated broadband four-channel phased array transceiver, capable of wireless data rates up to 200 Gb/s covering the entire 110-170 GHz band.
Abstract: This article demonstrates a fully integrated broadband four-channel phased array transceiver, capable of wireless data rates up to 200 Gb/s covering the entire $D$ -band (110–170 GHz). The circuit is developed in a 130-nm SiGe BiCMOS technology, featuring ${f}~_{\text {t}}/{f}~_{\text {max}}$ of 300/500 GHz, and includes localized back-side etching-based ON-chip patch antennas. In both transmit and receive modes, direct up- and down-conversions are performed by in-phase and quadrature mixers driven by a multiplier-by-four local oscillator chain. A bidirectional true time delay circuit, with a resolution of 0.446 ps, which is equivalent to the accuracy of a 4-bit phase shifter, provides the squint-free beam-steering capability. Beam-steering measurements show how the beam can be steered from −45° to 45° in a 7° step. The transceiver achieves a 3-dB baseband bandwidth of 30 and 27 GHz in the transmit and receive modes, respectively. A wireless link demonstration is performed by mounting two chips on printed circuit boards, one in the transmit and one in the receive mode, together with plastic lenses on both sides, at a distance of 15 cm. Hardware-in-the-loop measurements show record data rates of 180 Gb/s with EVM of 12.2% using 16-QAM and 200 Gb/s with 8.3% EVM using 32-QAM. The four-channel transceiver consumes 1.95 and 2.5 W in the receive and transmit modes, respectively, which correspond to power efficiencies of 9.75 pJ/bit in the receiver mode and 12.5 pJ/bit in the transmitter mode.

Journal ArticleDOI
TL;DR: In this article , the authors proposed a 9T-SRAM in-memory computing (IMC)-based region proposal (RP) network for event-based binary image (EBBI) frames from a NVS.
Abstract: Neuromorphic vision sensors (NVSs) are key enablers of energy savings in Internet of Things (IoT)-based traffic monitoring and surveillance systems that exploit the temporal redundancy in video streams. However, for these scenarios, an object typically occupies a fraction of the full image frame leading to a significant spatial redundancy in the active image. Hence, there is a need for energy-efficient, dedicated hardware to detect the region of interests (RoI) to exploit spatial redundancy in the valid frames and reduce computations in the succeeding recognition modules. This article proposes a 9T-SRAM in-memory computing (IMC)-based region proposal (RP) network for event-based binary image (EBBI) frames from a NVS. The proposed 9T-SRAM cell enables a 1-D projection of objects on the horizontal and vertical axes of an image. An iterative and selective search (ISS) of the rising and falling edges of 1-D projection yields the coordinates of a bounding box encapsulating an object. To demonstrate the energy-saving and effectiveness of the algorithm, we fabricated the proposed architecture, RP integrated circuit (RPIC) in a 65 nm CMOS process. Tested with the video recordings from a Dynamic and Active-pixel Vision Sensor (DAVIS), the RPIC achieves a peak throughput of 1259 ft/s at 1 Meps event rate. Moreover, the proposed RP architecture achieves a high energy efficiency of 389 TOPS/W due to in-memory operation.

Journal ArticleDOI
TL;DR: In this article , an adaptive TCMRR enhancing loop is implemented in parallel with a charge-pump-based common-mode suppressing loop (CMSL), which adjusts the CMI current through each contact impedance so that commonmode (CM) to differential-mode (DM) conversion is minimized.
Abstract: In this article, we present an electrocardiogram (ECG) amplifier that has a large total CMRR (TCMRR) regardless of contact impedance mismatch and a large tolerance to common-mode interference (CMI). To achieve these features, an adaptive TCMRR enhancing loop is implemented in parallel with a charge-pump-based common-mode suppressing loop (CMSL). It adjusts the CMI current through each contact impedance so that common-mode (CM) to differential-mode (DM) conversion is minimized. We also propose a fast settling technique for the adaptive loop so that contact impedance variation can be tracked fast enough to allow robust ECG acquisition. A prototype chip fabricated in 180-nm CMOS achieves TCMRR larger than 105 dB even when there is contact impedance mismatch of up to 30%. It also achieves tolerance to CMI of 18 VPP at 60 Hz and input-referred noise of 1.90 ${\mu }\rm V_{rms}$ while consuming 43.3 $\mu \text{W}$ .

Journal ArticleDOI
TL;DR: In this article , the authors proposed a 9T-SRAM in-memory computing (IMC)-based region proposal (RP) network for event-based binary image (EBBI) frames from a NVS.
Abstract: Neuromorphic vision sensors (NVSs) are key enablers of energy savings in Internet of Things (IoT)-based traffic monitoring and surveillance systems that exploit the temporal redundancy in video streams. However, for these scenarios, an object typically occupies a fraction of the full image frame leading to a significant spatial redundancy in the active image. Hence, there is a need for energy-efficient, dedicated hardware to detect the region of interests (RoI) to exploit spatial redundancy in the valid frames and reduce computations in the succeeding recognition modules. This article proposes a 9T-SRAM in-memory computing (IMC)-based region proposal (RP) network for event-based binary image (EBBI) frames from a NVS. The proposed 9T-SRAM cell enables a 1-D projection of objects on the horizontal and vertical axes of an image. An iterative and selective search (ISS) of the rising and falling edges of 1-D projection yields the coordinates of a bounding box encapsulating an object. To demonstrate the energy-saving and effectiveness of the algorithm, we fabricated the proposed architecture, RP integrated circuit (RPIC) in a 65 nm CMOS process. Tested with the video recordings from a Dynamic and Active-pixel Vision Sensor (DAVIS), the RPIC achieves a peak throughput of 1259 ft/s at 1 Meps event rate. Moreover, the proposed RP architecture achieves a high energy efficiency of 389 TOPS/W due to in-memory operation.

Journal ArticleDOI
TL;DR: T-PIM as mentioned in this paper is a PIM accelerator for end-to-end on-device training, which supports both fully connected and convolutional layers and achieves high-speed inference.
Abstract: Recently, on-device training has become crucial for the success of edge intelligence. However, frequent data movement between computing units and memory during training has been a major problem for battery-powered edge devices. Processing-in-memory (PIM) is a novel computing paradigm that merges computing logic into memory, which can address the data movement problem with excellent power efficiency. However, previous PIM accelerators cannot support the entire training process on chip due to its computing complexity. This article presents a PIM accelerator for end-to-end on-device training (T-PIM), the first PIM realization that enables end-to-end on-device training as well as high-speed inference. Its full-custom PIM macro contains 8T-SRAM cells to perform the energy-efficient in-cell AND operation and the bit-serial-based computation logic enables fully variable bit-precision for input data. The macro supports various data mapping methods and computational paths for both fully connected and convolutional layers, in order to handle the complex training process. An efficient tiling scheme is also proposed to enable T-PIM to compute any size of deep neural network with the implemented hardware. In addition, configurable arithmetic units in a forward propagation path make T-PIM handle power-of-two bit-precision for weight data, enabling a significant performance boost during inference. Finally, T-PIM efficiently handles sparsity in both operands by skipping the computation of zeros in the input data and by gating-off computing units when the weight data are zero. Finally, we fabricate the T-PIM chip in 28-nm CMOS technology, occupying a die area of 5.04 mm2, including five T-PIM cores. It dissipates 5.25–51.23 mW at 50–280 MHz operating frequency with 0.75–1.05-V supply voltage. We successfully demonstrate that T-PIM can run the end-to-end training of VGG16 model on the CIFAR10 and CIFAR100 datasets, achieving 0.13–161.08- and 0.25–7.59-TOPS/W power efficiency during inference and training, respectively. The result shows that T-PIM is $2.02\times $ more energy-efficient than the state-of-the-art PIM chip that only supports backward propagation, not a whole training. Furthermore, we conduct an architectural experiment using a cycle-level simulator based on actual measurement results, which suggests that the T-PIM architecture is scalable and its scaled-up version provides up to $203.26\times $ higher power efficiency than a comparable GPU.

Journal ArticleDOI
TL;DR: In this paper , a digital low dropout (DLDO) with a feedforward controller and weight redistribution algorithm (WRA) for line regulation improvement is proposed, which achieves peak current efficiency of 99.99% at heavy loads.
Abstract: In this article, a digital LDO with a feedforward controller and weight redistribution algorithm (WRA) for line regulation improvement is proposed. The proposed digital low dropout (DLDO) uses a feedforward path to obtain the information of $V_{\mathrm {IN}}$ and applies WRA and body voltage controller to adjust $I_{\mathrm {OUT}}$ to minimize the output voltage ripple $\Delta V_{\mathrm {OUT}}$ . Different from conventional freeze mode, the feedforward control (FFC) with low quiescent current can keep $\Delta V_{\mathrm {OUT}} < 0.5$ mV in steady state and $\Delta V_{\mathrm {OUT}} < 4$ mV during line transient. In order for the feedback loop to rapidly wake up, the transient pump circuit is used to reduce the undershoot to less than 30 mV in the case of load change from 1 to 200 mA. Due to low quiescent current in the FFC, the DLDO achieves peak current efficiency of 99.99% at heavy loads.

Journal ArticleDOI
TL;DR: In this paper , a 15-bit self-timed incremental ADC for event-driven applications is presented, which is based on an asynchronous zoom ADC structure and a fully differential self-timeed dynamic-amplifier (DA)-based integrator.
Abstract: A 15-bit self-timed incremental analog-to-digital converter (ADC) for event-driven applications is presented. It is based on an asynchronous zoom ADC structure and a fully differential self-timed dynamic-amplifier (DA)-based integrator. Dynamic element matching (DEM) and chopping techniques mitigate the mismatch and noise. Measurements show that the ADC without oversampled clock achieves 93-dB SNR in a conversion time of 0.37 ms while consuming only 4.96- $\mu \text{A}$ current from a 1-V supply. This corresponds to a Schreier FoM of 177.3 dB. The 0.23-mm2 chip was fabricated in a standard 55-nm CMOS process.

Journal ArticleDOI
TL;DR: In this paper , a 60-MS/s 5-MHz BW noise-shaping (NS) successive-approximation-register (SAR) ADC with an integrated highly linear input buffer is presented.
Abstract: This article presents a 60-MS/s 5-MHz BW noise-shaping (NS) successive-approximation-register (SAR) analog-to-digital converter (ADC) with an integrated highly linear input buffer in a 40-nm CMOS process. A dynamic level-shifting (DLS) technique is proposed to adaptively adjust the output common-mode voltage of the integrated input buffer for different operation phases, achieving the optimal linearity while alleviating the voltage breakdown problem at the capacitive digital-to-analog converter (CDAC) top plate. The mismatch-error-shaping (MES) technique is utilized to minimize the CDAC mismatch, which also generates undesired inter-symbol-interference (ISI) errors as a side effect. An ISI-error correction (IEC) technique is proposed to mitigate this problem and further improve the linearity. Both foreground and background calibrations are adopted in the subranging ADC and the NS filter to improve the gain accuracy. This SAR ADC prototype, with an integrated input buffer driving a 4.4-pF CDAC within 2.7 ns, achieves signal-to-noise ratio (SNR)/signal-to-noise-and-distortion ratio (SNDR)/spurious-free dynamic range (SFDR)/dynamic range (DR)/total harmonic distortion (THD) of 85.4 dB/84.2 dB/97.3 dBc/86.4 dB/−92.9 dBc while consuming 8.06 mW from 2.5- and 1.1-V dual supply voltages. The Walden FoM (FoM $_{\mathrm {W}}$ ) and figure-of-merit (FoM $_{\mathrm {S}}$ ) are 60.6 fJ/conv.-step and 172.1 dB, respectively.

Journal ArticleDOI
TL;DR: In this article , a threshold implementation masking-based NN accelerator is proposed to secure model parameters and inputs against power and electromagnetic (EM) side-channel attacks, which reduces the area and energy overhead to 64% and 5.5
Abstract: With the recent advancements in machine learning (ML) theory, a lot of energy-efficient neural network (NN) accelerators have been developed. However, their associated side-channel security vulnerabilities pose a major concern. There have been several proof-of-concept attacks demonstrating the extraction of their model parameters and input data. This work introduces a threshold implementation (TI) masking-based NN accelerator that secures model parameters and inputs against power and electromagnetic (EM) side-channel attacks. The 0.159 mm2 demonstration in 28 nm runs at 125 MHz at 0.95 V and limits the area and energy overhead to 64% and $5.5\times $ , respectively, while demonstrating security even greater than 2M traces. The accelerator also secures model parameters through encryption and the inputs against horizontal power analysis (HPA) attacks.

Journal ArticleDOI
TL;DR: In this article , a 28 GHz fully differential four-channel beamforming front-end IC with variable gain phase shifters (VGPSs) is presented, of which orthogonal phase and gain control is achieved in a single block using the dual-vector synthesis technique.
Abstract: A 28-GHz fully differential four-channel beamforming front-end IC with variable gain phase shifters (VGPSs) is presented, of which orthogonal phase and gain control is achieved in a single block using the proposed dual-vector synthesis technique. This greatly reduces chip size, power consumption, and calibration complexity. The antenna switch embedded in the front-end matching network minimizes degradation of the transmitter (TX) efficiency and receiver (RX) noise figure. The multi-mode power amplifier (PA) with a built-in linearizer is presented to have high efficiency and linearity for all power modes, which reduces power consumption during gain control. Also, a differential four-way power divider and an single pole double throw (SPDT) switch are introduced in the common path, which has a small chip size and a low insertion loss. By adopting differential structures for all components in the four-channel IC, the effect of parasitic grounding inductances due to bonding and package is minimized. Thanks to the proposed dual-vector VGPS, the rms gain error of 0.21 dB and the peak phase variation of ±1.7° are achieved during phase control and gain control, respectively, without any calibration. The four-channel front-end IC also achieves rms phase error of only 1.4° during the simultaneous phase and gain control without any calibration. It has a TX OP1dB of 13.3 dBm and the highest linear output power of 7.2 dBm with 400-MHz 64-QAM fifth-generation (5G) new radio (NR) signals with 9.6-dB peak-to-average power ratio (PAPR). Also, this article presents a 64-element brick-type phased array antenna using four-channel core chips. It shows a high effective isotropic radiated power (EIRP) of 54.4 dBm at 28 GHz. An over-the-air link is demonstrated with a competitive data rate of 4 Gb/s using 64-QAM waveforms over all scan angles at a distance of 100 m.

Journal ArticleDOI
TL;DR: In this paper , the authors present a recording front end for high-density CMOS neuronal probes with in situ digitization and electrode offset voltage compensation, which is based on a continuous-time (CT) two-step (TS) incremental delta-sigma.
Abstract: This article presents a recording front end for high-density CMOS neuronal probes with in situ digitization and electrode offset voltage compensation. The analog front end (AFE) is based on a continuous-time (CT) two-step (TS) incremental delta–sigma ( $\text{I}\Delta \Sigma $ ) analog-to-digital converter (ADC) with an extended counting technique and features an input offset voltage compensation of 120 mVpp. Hardware sharing in the TS quantization process allows the integration of the front end in an area of only 0.0046 mm 2 and, thus, directly under the recording electrodes on the shank of a probe in 180-nm CMOS. The average integrated noise is as low as 4.88 $\mu \text{V}_{\text {rms}}$ , 4.46 $\mu \text{V}_{\text {rms}}$ , and 2.51 $\mu \text{V}_{\text {rms}}$ in the full bandwidth of 0 Hz–10 kHz, in the frequency band of action potentials (AP, 0.3–10 kHz) and local field potentials (LFP, 0.5 Hz–1 kHz), respectively. Each recording front end consumes 8.57 $\mu \text{W}$ , and transmitting the digitized data to an external host needs additionally 6.05 $\mu \text{W}$ per channel.

Journal ArticleDOI
TL;DR: In this article , a wideband watt-level digital power amplifier (DPA) with high efficiency and large dynamic range is presented in CMOS technology for wireless applications, where the wideband matching network based on a reconfigurable power-combining transformer is used.
Abstract: In this article, a wideband watt-level digital power amplifier (DPA) with high efficiency and large dynamic range is presented in CMOS technology for wireless applications. To achieve high output power with enhanced operation bandwidth (BW), the wideband matching network based on a reconfigurable power-combining transformer is used. Meanwhile, the $L$ $C$ circuit is used to suppress the harmonics, which further improves the output power of the fundamental signal. In addition, the LO leakage is suppressed by the 12-bit power digital-to-analog converter (power DAC), which leads to high dynamic range of the proposed DPA. To verify the mechanism, a 1.2–3.6-GHz watt-level 12-bit polar DPA is implemented and fabricated using a conventional 40-nm CMOS technology. With 1.1-/2.5-V supply, the fabricated DPA exhibits peak output power ( $P_{\text {out}}$ ) of 32.67 dBm, peak drain efficiency (DE) of 45.1%, and peak power-added efficiency (PAE) of 35.5% at 2 GHz. It supports 50-MSyms/s 256-QAM with average output power ( $P_{\text {avg}}$ ) of 22.76 dBm, error vector magnitude (EVM) of −31.46 dB, and adjacent channel leakage ratio (ACLR) of −30.67 dBc, 10-MSyms/s 1024-QAM with $P_{\text {avg}}$ of 25.54 dBm, EVM of −38.2 dB, and ACLR of −38.71 dBc, and 5-MSym/s 4096-QAM with $P_{\text {avg}}$ of 22.97 dBm, EVM of −43.0 dB, and ACLR of −46.32 dBc, respectively.

Journal ArticleDOI
TL;DR: In this paper , a 40.68-MHz active rectifier for high-current biomedical implant was proposed to improve the power conversion efficiency (PCE) and voltage conversion ratio (VCR) of the rectifier.
Abstract: This article presents a 40.68-MHz active rectifier for high-current biomedical implants. Cycle-based timing control (CBTC) is proposed to significantly extend the duration for compensating both turn-on and turn-off delays of active diodes at 40.68 MHz, thereby improving both the power conversion efficiency (PCE) and the voltage conversion ratio (VCR) of the rectifier. The supply independent ramp, the delay-mimicking sample-and-hold (DMSH) circuit, and the low-voltage (LV)-stress startup scheme are also developed to maintain high PCE and VCR, and ensure rectifier reliability over a wide input range. Implemented in standard 0.18- $\mu \text{m}$ CMOS, the proposed fully integrated rectifier delivers a maximum output power of 207 mW and operates at 40.68 MHz to enable the use of a small-diameter receiver coil of 4 mm. This rectifier is the first to achieve full turn-on and turn-off delay compensation at 40.68 MHz, so its maximum PCE and VCR obtain 86% and 96.3%, respectively. Compared with the prior art, this work not only provides more stable high PCE and VCR over a wider input range from 1.9 to 3.8 V at 40.68 MHz but also uses the smallest receiver coil for wireless power delivery.

Journal ArticleDOI
TL;DR: In this paper , the authors propose to run CNNs in a deep layer fusion mode, dubbed depth-first execution, made possible by a control flow that supports frequently switching between layers.
Abstract: Applying convolutional neural networks (CNNs) on high-resolution images leads to very large intermediate feature maps (FMs), which dominate the memory traffic. Processing in the classical layer-by-layer order creates the requirement to store the complete FMs at once, when moving from one layer to the next. As the size of these FMs only realistically allows this in off-chip memory, this leads to high off-chip bandwidth, which comes at great energy costs. The DepFiN processor chip, presented in this article, overcomes this cost by running CNNs in a deep layer fusion mode, dubbed depth-first execution, made possible by a control flow that supports frequently switching between layers. To furthermore tackle the computational cost as well, the computationally efficient depthwise + pointwise (DW + PW) layer pairs are explicitly supported in DepFiN by a novel accelerator core that can dynamically change its configuration to manage the low computational intensity of the depthwise layers. Benchmarking measurements show the 12-nm DepFiN chip reaching up to 20 TOPS/W peak, 8.2 TOPS/W on the MC-CNN-fast stereo-matching network excluding input-output (IO) power (at 8-bit 0.6 Vdd) and, crucially, 3.95 TOPS/W with the IO power included on the same network and an up to $18\times $ improvement realized by supporting depth-first (MC-CNN-fast at 8-bit, 0.65 V Vdd).

Journal ArticleDOI
TL;DR: In this paper , a bidirectional static random access memory (SRAM) array structure comprising self-cycling eight-transistor (8T) cells was proposed to achieve full-array Boolean logic operations and read/write in two directions.
Abstract: Computing in-memory (CIM) is a promising new computing method to solve problems caused by von Neumann bottlenecks. It mitigates the need for transmitting large amounts of data between the processing and memory units, significantly decreasing the latency and energy consumption. However, writing back the calculation results for CIM can become a new bottleneck if only parallel computing is implemented. This study proposes a bidirectional static random access memory (SRAM) array structure comprising self-cycling eight-transistor (8T) cells, which can achieve full-array Boolean logic operations and read/write in two directions. The CIM results can be restored in in situ bit cells in a single cycle without additional memory. In addition, any data row can be copied into another row by controlling the intermediate transistor in the 8T cell. A 16-kb SRAM was implemented in the 28-nm CMOS technology to verify the effectiveness of the proposed design. The throughput of the proposed CIM macro is 1851.4 GOPS. Compared with the existing CIM macros, the throughput increased 3–56.6 times and the energy efficiency was as high as 270.5 TOPS/W at a supply voltage of 0.66 V. When the proposed circuits were applied to advanced encryption standard (AES) algorithms, the energy efficiency is increased by about 47.5%–63% compared to the von Neumann architecture.

Journal ArticleDOI
TL;DR: In this article , a switched-capacitor (SC)-parallel-inductor buck (CPL-Buck) converter with reduced inductor voltage and current is presented.
Abstract: This article presents a switched-capacitor (SC)-parallel-inductor buck (CPL-Buck) converter with reduced inductor voltage and current. The proposed CPL-Buck converter reduces the voltage stress on the power inductor with a series-connected flying capacitor in one phase, alleviating the current stress with a parallel-connected SC path in both phases. Therefore, it effectively lowers the average inductor current as well as its ripple, allowing the utilization of a small-volume inductor to deliver a large output current. In addition, to cover a wide voltage conversion ratio (VCR) range, the proposed CPL-Buck is able to operate in either a sub-1/3X mode or a sub-1/2X mode. This work, fabricated in 65-nm CMOS, occupies an area of 2.72 mm2. Measurement results show that the proposed CPL-Buck obtains a peak efficiency of 92.9% and a peak current density of 0.3 A/mm2 with a power inductor as small as $1.6\times 0.8\times0.8$ mm3, with an input range of 3–4.2 V, an output range of 0.6–1 V, and 1.2-A maximum output current.