scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Journal of Solid-state Circuits in 2020"


Journal ArticleDOI
TL;DR: The macro is an SRAM module with the circuits embedded in bitcells and peripherals to perform hardware acceleration for neural networks with binarized weights and activations and utilizes analog-mixed-signal capacitive-coupling computing to evaluate the main computations of binary neural networks, binary-multiply-and-accumulate operations.
Abstract: This article presents C3SRAM, an in-memory-computing SRAM macro. The macro is an SRAM module with the circuits embedded in bitcells and peripherals to perform hardware acceleration for neural networks with binarized weights and activations. The macro utilizes analog-mixed-signal (AMS) capacitive-coupling computing to evaluate the main computations of binary neural networks, binary-multiply-and-accumulate operations. Without the need to access the stored weights by individual row, the macro asserts all its rows simultaneously and forms an analog voltage at the read bitline node through capacitive voltage division. With one analog-to-digital converter (ADC) per column, the macro realizes fully parallel vector–matrix multiplication in a single cycle. The network type that the macro supports and the computing mechanism it utilizes are determined by the robustness and error tolerance necessary in AMS computing. The C3SRAM macro is prototyped in a 65-nm CMOS. It demonstrates an energy efficiency of 672 TOPS/W and a speed of 1638 GOPS (20.2 TOPS/mm2), achieving 3975 $\times $ better energy–delay product than the conventional digital baseline performing the same operation. The macro achieves 98.3% accuracy for MNIST and 85.5% for CIFAR-10, which is among the best in-memory computing works in terms of energy efficiency and inference accuracy tradeoff.

144 citations


Journal ArticleDOI
TL;DR: XNOR-SRAM is a mixed-signal in-memory computing (IMC) SRAM macro that computes ternary-X NOR-and-accumulate (XAC) operations in binary/ternary deep neural networks (DNNs) without row-by-row data access and represents among the best tradeoff in energy efficiency and DNN accuracy.
Abstract: We present XNOR-SRAM, a mixed-signal in-memory computing (IMC) SRAM macro that computes ternary-XNOR-and-accumulate (XAC) operations in binary/ternary deep neural networks (DNNs) without row-by-row data access. The XNOR-SRAM bitcell embeds circuits for ternary XNOR operations, which are accumulated on the read bitline (RBL) by simultaneously turning on all 256 rows, essentially forming a resistive voltage divider. The analog RBL voltage is digitized with a column-multiplexed 11-level flash analog-to-digital converter (ADC) at the XNOR-SRAM periphery. XNOR-SRAM is prototyped in a 65-nm CMOS and achieves the energy efficiency of 403 TOPS/W for ternary-XAC operations with 88.8% test accuracy for the CIFAR-10 data set at 0.6-V supply. This marks $33\times $ better energy efficiency and $300\times $ better energy–delay product than conventional digital hardware and also represents among the best tradeoff in energy efficiency and DNN accuracy.

130 citations


Journal ArticleDOI
Hongyang Jia1, Hossein Valavi1, Yinqi Tang1, Jintao Zhang2, Naveen Verma1 
TL;DR: This paper presents a programmable in-memory-computing processor, demonstrated in a 65nm CMOS technology, and takes the approach of tight coupling with an embedded CPU, through accelerator interfaces enabling integration in the standard processor memory space.
Abstract: In-memory computing (IMC) addresses the cost of accessing data from memory in a manner that introduces a tradeoff between energy/throughput and computation signal-to-noise ratio (SNR). However, low SNR posed a primary restriction to integrating IMC in larger, heterogeneous architectures required for practical workloads due to the challenges with creating robust abstractions necessary for the hardware and software stack. This work exploits recent progress in high-SNR IMC to achieve a programmable heterogeneous microprocessor architecture implemented in 65-nm CMOS and corresponding interfaces to the software that enables mapping of application workloads. The architecture consists of a 590-Kb IMC accelerator, configurable digital near-memory-computing (NMC) accelerator, RISC-V CPU, and other peripherals. To enable programmability, microarchitectural design of the IMC accelerator provides the integration in the standard processor memory space, area- and energy-efficient analog-to-digital conversion for interfacing to NMC, bit-scalable computation (1–8 b), and input-vector sparsity-proportional energy consumption. The IMC accelerator demonstrates excellent matching between computed outputs and idealized software-modeled outputs, at 1b TOPS/W of 192|400 and 1b-TOPS/mm2 of 0.60|0.24 for MAC hardware, at $V_{DD}$ of 1.2|0.85 V, both of which scale directly with the bit precision of the input vector and matrix elements. Software libraries developed for application mapping are used to demonstrate CIFAR-10 image classification with a ten-layer CNN, achieving accuracy, throughput, and energy of 89.3%|92.4%, 176|23 images/s, and $5.31\mid 105.2~\mu \text{J}$ /image, for 1|4 b quantization levels.

121 citations


Journal ArticleDOI
TL;DR: An static random access memory (SRAM) CIM unit-macro using compact-rule compatible twin-8T cells for weighted CIM MAC operations to reduce area overhead and vulnerability to process variation and an even–odd dual-channel (EODC) input mapping scheme to extend input bandwidth is presented.
Abstract: Computation-in-memory (CIM) is a promising candidate to improve the energy efficiency of multiply-and-accumulate (MAC) operations of artificial intelligence (AI) chips. This work presents an static random access memory (SRAM) CIM unit-macro using: 1) compact-rule compatible twin-8T (T8T) cells for weighted CIM MAC operations to reduce area overhead and vulnerability to process variation; 2) an even–odd dual-channel (EODC) input mapping scheme to extend input bandwidth; 3) a two’s complement weight mapping (C2WM) scheme to enable MAC operations using positive and negative weights within a cell array in order to reduce area overhead and computational latency; and 4) a configurable global–local reference voltage generation (CGLRVG) scheme for kernels of various sizes and bit precision. A 64 $\times $ 60 b T8T unit-macro with 1-, 2-, 4-b inputs, 1-, 2-, 5-b weights, and up to 7-b MAC-value (MACV) outputs was fabricated as a test chip using a foundry 55-nm process. The proposed SRAM-CIM unit-macro achieved access times of 5 ns and energy efficiency of 37.5–45.36 TOPS/W under 5-b MACV output.

120 citations


Journal ArticleDOI
TL;DR: This article presents the first 39-GHz phased-array transceiver (TRX) chipset for fifth-generation new radio (5G NR), consisting of 4 sub-array TRX elements with local-oscillator (LO) phase-shifting architecture and built-in calibration on phase and amplitude.
Abstract: This article presents the first 39-GHz phased-array transceiver (TRX) chipset for fifth-generation new radio (5G NR). The proposed transceiver chipset consists of 4 sub-array TRX elements with local-oscillator (LO) phase-shifting architecture and built-in calibration on phase and amplitude. The calibration scheme is proposed to alleviate phase and amplitude mismatch between each sub-array TRX element, especially for a large-array transceiver system in the base station (BS). Based on LO phase-shifting architecture, the transceiver has a 0.04-dB maximum gain variation over the 360° full tuning range, allowing constant-gain characteristic during phase calibration. A phase-to-digital converter (PDC) and a high-resolution phase-detection mechanism are proposed for highly accurate phase calibration. The built-in calibration has a measured accuracy of 0.08° rms phase error and 0.01-dB rms amplitude error. Moreover, a pseudo-single-balanced mixer is proposed for LO-feedthrough (LOFT) cancellation and sub-array TRX LO-to-LO isolation. The transceiver is fabricated in standard 65-nm CMOS technology with flip-chip packaging. The 8TX–8RX phased-array transceiver module 1-m OTA measurement supports 5G NR 400-MHz 256-QAM OFDMA modulation with −30.0-dB EVM. The 64-element transceiver has a EIRPMAX of 53 dBm. The four-element chip consumes a power of 1.5 W in the TX mode and 0.5 W in the RX mode.

118 citations


Journal ArticleDOI
TL;DR: A neutralized bi-directional technique is introduced in this work to reduce the chip area significantly and Compact and low-cost 5G millimeter-wave MIMO systems could be realized.
Abstract: This article presents a low-cost and area-efficient 28-GHz CMOS phased-array beamformer chip for 5G millimeter-wave dual-polarized multiple-in-multiple-out (MIMO) (DP-MIMO) systems. A neutralized bi-directional technique is introduced in this work to reduce the chip area significantly. With the proposed technique, completely the same circuit chain is shared between the transmitter and receiver. To further minimize the area, an active bi-directional vector-summing phase shifter is also introduced. Area-efficient and high-resolution active phase shifting could be realized in both TX and RX modes. In measurement, the achieved saturated output power for the TX-mode beamformer is 15.1 dBm. The RX-mode noise figure is 4.2 dB at 28 GHz. To evaluate the over-the-air performance, 16 H+16 V sub-array modules are implemented in this work. Each of the sub-array modules consists of four 4 H+4 V chips. Two sub-array modules in this work are capable of scanning the beam from −50° to +50°. A saturated EIRP of 45.6 dBm is realized by 32 TX-mode beamformers. Within 1-m distance, a maximum SC-mode data rate of 15 Gb/s and the 5G new radio downlink packets transmission in 256-QAM could be supported by the module. A $2\times 2$ DP-MIMO communication is also demonstrated with two 5G new radio 64-QAM uplink streams. Thanks to the proposed area-efficient bi-directional technique, the required core area for a single element-beamformer is only 0.58 mm2. Compact and low-cost 5G millimeter-wave MIMO systems could be realized.

113 citations


Journal ArticleDOI
TL;DR: A general-purpose hybrid in-/near-memory compute SRAM (CRAM) that combines an 8T transposable bit cell with vector-based, bit-serial in-memory arithmetic to accommodate a wide range of bit-widths, as well as a complete set of operation types, including integer and floating-point addition, multiplication, and division.
Abstract: This article proposes a general-purpose hybrid in-/near-memory compute SRAM (CRAM) that combines an 8T transposable bit cell with vector-based, bit-serial in-memory arithmetic to accommodate a wide range of bit-widths, from single to 32 or 64 bits, as well as a complete set of operation types, including integer and floating-point addition, multiplication, and division. This approach provides the flexibility and programmability necessary for evolving software algorithms ranging from neural networks to graph and signal processing. The proposed design was implemented in a small Internet of Things (IoT) processor in the 28-nm CMOS consisting of a Cortex-M0 CPU and 8 CRAM banks of 16 kB each (128 kB total). The system achieves 475-MHz operation at 1.1 V and, with all CRAMs active, produces 30 GOPS or 1.4 GFLOPS on 32-bit operands. It achieves an energy efficiency of 0.56 TOPS/W for 8-bit multiplication and 5.27 TOPS/W for 8-bit addition at 0.6 V and 114 MHz.

103 citations


Journal ArticleDOI
TL;DR: An energy-efficient comparator design that achieves the highest reported comparator energy efficiency to the best of the authors' knowledge and greatly reduces the influence of the process corner and the input common-mode voltage on the comparator performance, including noise, offset, and delay.
Abstract: This article presents an energy-efficient comparator design. The pre-amplifier adopts an inverter-based input pair powered by a floating reservoir capacitor; it realizes both current reuse and dynamic bias, thereby significantly boosting $g_{m}/I_{D}$ and reducing noise. Moreover, it greatly reduces the influence of the process corner and the input common-mode voltage on the comparator performance, including noise, offset, and delay. A prototype comparator in 180 nm achieves 46- $\mu \text{V}$ input-referred noise while consuming only 1 pJ per comparison under a 1.2-V supply. This represents greater than seven-time energy efficiency boost compared with a strong-arm (SA) latch. It achieves the highest reported comparator energy efficiency to the best of our knowledge.

99 citations


Journal ArticleDOI
TL;DR: A 300-GHz-band 120-Gb/s wireless transceiver front-ends (TRX) using the in-house InP-based high-electron-mobility-transistor (InP-HEMT) technology for beyond-5G is developed.
Abstract: We developed a 300-GHz-band 120-Gb/s wireless transceiver front-ends (TRX) using our in-house InP-based high-electron-mobility-transistor (InP-HEMT) technology for beyond-5G. The TRX is composed of the RF power amplifiers (PAs), mixers, and local oscillation (LO) PAs which are all packaged in individual waveguide (WG) modules by using a ridge coupler for low-loss WG-to-IC transition. RF PAs are designed using the low-impedance inter-stage-matching technique to reduce the inter-stage matching loss of the amplifier stages, and the back-side DC line (BDCL) technique is used to simplify the layout and to improve the gain of the PAs. The fabricated RF PAs show a high output 1-dB compression point of more than 6 dBm from 278 to 302 GHz. The mixers are used for both up- and down-conversion in the transmitter and receiver. These mixers are designed to have high conversion gain (CG) over the wideband even after packaging by enhancing the isolation between the RF and IF ports. The measured CG of mixer module is −15 dB, and the 3-dB IF-bandwidth is 32 GHz. The LO PAs are also designed using the BDCL technique so that they can supply the required LO power to the mixers. The TRX with these InP building blocks enables the data transmission of a 120 Gb/s 16QAM signal over a link distance of 9.8 m.

87 citations


Journal ArticleDOI
TL;DR: This article proposes a serial-input non-weighted product (SINWP) structure; a down-scaling weighted current translator and positive–negative current-subtractor scheme; a current-aware bitline clamper scheme; and a triple-margin small-offset current-mode sense amplifier (TMCSA).
Abstract: Computing-in-memory (CIM) based on embedded nonvolatile memory is a promising candidate for energy-efficient multiply-and-accumulate (MAC) operations in artificial intelligence (AI) edge devices. However, circuit design for NVM-based CIM (nvCIM) imposes a number of challenges, including an area-latency-energy tradeoff for multibit MAC operations, pattern-dependent degradation in signal margin, and small read margin. To overcome these challenges, this article proposes the following: 1) a serial-input non-weighted product (SINWP) structure; 2) a down-scaling weighted current translator (DSWCT) and positive–negative current-subtractor (PN-ISUB); 3) a current-aware bitline clamper (CABLC) scheme; and 4) a triple-margin small-offset current-mode sense amplifier (TMCSA). A 55-nm 1-Mb ReRAM-CIM macro was fabricated to demonstrate the MAC operation of 2-b-input, 3-b-weight with 4-b-out. This nvCIM macro achieved $T_{\text {MAC}}= 14.6$ ns at 4-b-out with peak energy efficiency of 53.17 TOPS/W.

76 citations


Journal ArticleDOI
TL;DR: A unified model description framework and a unified processing architecture (Tianjic), which covers the full stack from software to hardware, and a compatible routing infrastructure that enables homogeneous and heterogeneous scalability on a decentralized many-core network.
Abstract: Toward the long-standing dream of artificial intelligence, two successful solution paths have been paved: 1) neuromorphic computing and 2) deep learning. Recently, they tend to interact for simultaneously achieving biological plausibility and powerful accuracy. However, models from these two domains have to run on distinct substrates, i.e., neuromorphic platforms and deep learning accelerators, respectively. This architectural incompatibility greatly compromises the modeling flexibility and hinders promising interdisciplinary research. To address this issue, we build a unified model description framework and a unified processing architecture (Tianjic), which covers the full stack from software to hardware. By implementing a set of integration and transformation operations, Tianjic is able to support spiking neural networks, biological dynamic neural networks, multilayered perceptron, convolutional neural networks, recurrent neural networks, and so on. A compatible routing infrastructure enables homogeneous and heterogeneous scalability on a decentralized many-core network. Several optimization methods are incorporated, such as resource and data sharing, near-memory processing, compute/access skipping, and intra-/inter-core pipeline, to improve performance and efficiency. We further design streaming mapping schemes for efficient network deployment with a flexible tradeoff between execution throughput and resource overhead. A 28-nm prototype chip is fabricated with >610-GB/s internal memory bandwidth. A variety of benchmarks are evaluated and compared with GPUs and several existing specialized platforms. In summary, the fully unfolded mapping can achieve significantly higher throughput and power efficiency; the semi-folded mapping can save 30x resources while still presenting comparable performance on average. Finally, two hybrid-paradigm examples, a multimodal unmanned bicycle and a hybrid neural network, are demonstrated to show the potential of our unified architecture. This article paves a new way to explore neural computing.

Journal ArticleDOI
TL;DR: This article presents a complete mixed-signal system-on-chip, capable of directly interfacing to an analog microphone and performing keyword spotting (KWS) and speaker verification (SV), without any need for further external accesses.
Abstract: The use of speech-triggered wake-up interfaces has grown significantly in the last few years for use in ubiquitous and mobile devices. Since these interfaces must always be active, power consumption is one of their primary design metrics. This article presents a complete mixed-signal system-on-chip, capable of directly interfacing to an analog microphone and performing keyword spotting (KWS) and speaker verification (SV), without any need for further external accesses. Through the use of: 1) an integrated single-chip digital-friendly design; b) hardware-aware algorithmic optimization; and c) memory- and power-optimized accelerators, ultra-low power is achieved while maintaining high accuracy for speech recognition tasks. The 65-nm implementation achieves 18.3- $\mu \text{W}$ worst case power consumption or 10.6- $\mu \text{W}$ power for typical real-time scenarios, $10\times $ below state of the art (SoA).

Journal ArticleDOI
TL;DR: A 12-b 18-GS/s analog-to-digital converter (ADC) implemented in 16-nm FinFET process achieves 80% higher sample rate and 2.4 $\times $ higher input bandwidth, and incorporates a THA that supports a 3.3 non-interleaved sample rate.
Abstract: We discuss a 12-b 18-GS/s analog-to-digital converter (ADC) implemented in 16-nm FinFET process. The ADC is composed of an integrated high-speed track-and-hold amplifier (THA) driving up to eight interleaved pipeline ADCs that employ open-loop inter-stage amplifiers. Up to 10 GS/s, the THA operates at the full sampling rate using a non-interleaved single sample network, thereby eliminating the interleaving sampling time and bandwidth mismatch. Above 10 GS/s, the THA is programmed to use two ping-ponged, or an optional (2 + 1) randomized, sample networks to spread the residual post-calibration interleaving spurs in the noise floor. The THA enables an input bandwidth of 18 GHz and employs dither injection and optional pseudorandom chopping. In the pipeline stages, dither-based background calibration detects and corrects gain, settling, memory, and kick-back errors. New dither-based background calibration algorithms are employed to detect and correct the arbitrary non-linearity in the form of integral non-linearity (INL) breaks and harmonic distortion up to the fifth order in the THA and in the references, DACs, and inter-stage open-loop amplifiers of the pipeline ADCs. Moreover, new dither-based background calibration is implemented to detect and correct the chopping non-idealities, memory errors, interleaving mismatches, and order-dependent randomization errors. Compared to the fastest state-of-the-art with similar performance, this ADC achieves 80% higher sample rate and 2.4 $\times $ higher input bandwidth, and incorporates a THA that supports a 3.3 $\times $ higher non-interleaved sample rate.

Journal ArticleDOI
TL;DR: The capability to translate quantum algorithms to microwave signals has been demonstrated by coherently controlling a spin qubit at both 14 and 18 GHz, thus enabling high-fidelity qubit control and exploiting the on-chip 4096-instruction memory.
Abstract: Building a large-scale quantum computer requires the co-optimization of both the quantum bits (qubits) and their control electronics. By operating the CMOS control circuits at cryogenic temperatures (cryo-CMOS), and hence in close proximity to the cryogenic solid-state qubits, a compact quantum-computing system can be achieved, thus promising scalability to the large number of qubits required in a practical application. This work presents a cryo-CMOS microwave signal generator for frequency-multiplexed control of $4\times 32$ qubits (32 qubits per RF output). A digitally intensive architecture offering full programmability of phase, amplitude, and frequency of the output microwave pulses and a wideband RF front end operating from 2 to 20 GHz allow targeting both spin qubits and transmons. The controller comprises a qubit-phase-tracking direct digital synthesis (DDS) back end for coherent qubit control and a single-sideband (SSB) RF front end optimized for minimum leakage between the qubit channels. Fabricated in Intel 22-nm FinFET technology, it achieves a 48-dB SNR and 45-dB spurious-free dynamic range (SFDR) in a 1-GHz data bandwidth when operating at 3 K, thus enabling high-fidelity qubit control. By exploiting the on-chip 4096-instruction memory, the capability to translate quantum algorithms to microwave signals has been demonstrated by coherently controlling a spin qubit at both 14 and 18 GHz.

Journal ArticleDOI
TL;DR: A scalable DNN accelerator consisting of 36 chips connected in a mesh network on a multi-chip-module (MCM) using ground-referenced signaling (GRS) enables flexible scaling for efficient inference on a wide range of DNNs, from mobile to data center domains.
Abstract: Custom accelerators improve the energy efficiency, area efficiency, and performance of deep neural network (DNN) inference. This article presents a scalable DNN accelerator consisting of 36 chips connected in a mesh network on a multi-chip-module (MCM) using ground-referenced signaling (GRS). While previous accelerators fabricated on a single monolithic chip are optimal for specific network sizes, the proposed architecture enables flexible scaling for efficient inference on a wide range of DNNs, from mobile to data center domains. Communication energy is minimized with large on-chip distributed weight storage and a hierarchical network-on-chip and network-on-package, and inference energy is minimized through extensive data reuse. The 16-nm prototype achieves 1.29-TOPS/mm2 area efficiency, 0.11 pJ/op (9.5 TOPS/W) energy efficiency, 4.01-TOPS peak performance for a one-chip system, and 127.8 peak TOPS and 1903 images/s ResNet-50 batch-1 inference for a 36-chip system.

Journal ArticleDOI
TL;DR: A 20-GHz low-power low-noise amplifier (LNA) in 65-nm CMOS is presented and an elaborate analysis of the current-reused CG–CS LNA using a transformer-based-boost technique and transformer- based MCR is proposed.
Abstract: A 20-GHz low-power low-noise amplifier (LNA) in 65-nm CMOS is presented. The LNA is cascaded with a single-ended $g_{\mathrm {m}}$ -boosted common-gate (CG) stage and a differential neutralized common-source (CS) stage. Current-reuse technique is employed to save the power consumption with little deterioration in gain and noise figure (NF). The transformer-based $g_{\mathrm {m}}$ -boost technique in the CG stage and neutralization technique in CS stage further enhances the RF performances. Inter-stage magnetically coupled resonator (MCR) extends the bandwidth. An elaborate analysis of the current-reused CG–CS LNA using a transformer-based $g_{\mathrm {m}}$ -boost technique and transformer-based MCR is proposed. Fabricated in 65-nm CMOS technology, the LNA achieves a measured power gain of 14.9 dB at 21 GHz with a −3-dB bandwidth of 4.8 GHz. The lowest NF is 3.3 dB at 19.5 GHz and is below 4 dB from 17 to 21 GHz. The LNA consumes 1.9 mW from a 1-V supply, with a chip area of 600 $\mu \text{m}\,\,\times $ 700 $\mu \text{m}$ .

Journal ArticleDOI
TL;DR: A “second-order” passive mixer-first receiver is proposed to improve channel selectivity, linearity, and noise figure (NF) in the presence of out-of-band blockers, by presenting an impedance that rolls off at 40 dB/decade as the load to an N-path filter.
Abstract: A “second-order” passive mixer-first receiver is proposed to improve channel selectivity, linearity, and noise figure (NF) in the presence of out-of-band blockers, by presenting an impedance that rolls off at 40 dB/decade as the load to an N-path filter. The synthesis of this impedance is described in a step-by-step manner starting from the required impedance transfer function to its actual circuit realization. Various tradeoffs and limitations of the architecture are described in detail, and layout-related techniques are also provided. Two integrated circuit prototypes were fabricated in 28-nm bulk CMOS as proof of concept for this circuit, including a low-power version. The receiver, capable of broadband operation from 0.2 to 2 GHz, achieves an out-of-band IIP3 of +33 dBm and a blocker P1dB of +12 dBm. Additionally, it achieves an NF of 4.4 dB with less than 2-dB degradation in NF for a 0-dBm blocker.

Journal ArticleDOI
TL;DR: The first CMOS RX front end that covers 24.5–43.5-GHz mm-Wave 5G bands and supports instantaneous full-band IR with no calibration, switching/tuning elements, or external controls is presented, enabling future wideband low-latency 5G MIMOs.
Abstract: This article presents an extremely broadband 24.5–43.5 GHz receiver (RX) achieving 32–56-dB instantaneous full-band image rejection (IR), which supports multiple major mm-Wave 5G bands at 24.5/28/37/39/43 GHz. A compact transformer-based I/Q network (0.14 mm2) is proposed to generate high-precision LO I/Q signals at millimeter-wave (mm-Wave) and provide built-in load impedance up-transformation for passive voltage amplification, boosting the LO swing for a higher RX conversion gain (CG). The high-quality differential I/Q generation is measured with phase/amplitude variation less than ±1.8°/±0.15 dB over an instantaneous wide bandwidth of 25–50 GHz without any calibration or switching/tunable elements. The RX is measured with a peak 35.2-dB CG and 18-dB gain tuning to accommodate complex EM environments. The RX modulation tests successfully demonstrate receiving 18-Gb/s 64-QAM and 14.4-Gb/s 256-QAM signals. In addition, the RX is tested with concurrent injection of a desired signal and an image, while the image uses the same wideband modulation scheme and data rate as the desired signal. The RX successfully rejects the wideband images and receives the desired signals of 12-Gb/s 64-QAM with −27.6-dB EVM and 8-Gb/s 256-QAM with −33.47-dB EVM. To the best of our knowledge, this article presents the first CMOS RX front end that covers 24.5–43.5-GHz mm-Wave 5G bands and supports instantaneous full-band IR with no calibration, switching/tuning elements, or external controls, enabling future wideband low-latency 5G MIMOs.

Journal ArticleDOI
TL;DR: Three new features are proposed in this article to support wide sparsity distribution efficiently and include a multi-sparsity-compatible set-associative convolution processing element (PE) array, designed to efficiently carry out convolution operations under different sparsity modes.
Abstract: STICKER is an energy-efficient convolutional neural network (NN) processor. It mainly improves energy efficiency by making full use of sparsity. The network sparsity can potentially lower storage and computation requirements. However, the sparsity distribution of both activations and weights ranges from 2% to 99% in different layers or models. Therefore, good support for the sparsity distribution is the key to improve the energy efficiency. Three new features are proposed in this article to support wide sparsity distribution efficiently. First, multi-sparsity control and data flow are implemented for finer sparsity granularity support. It can automatically switch the processor among nine sparsity modes for higher energy efficiency. Second, a multi-mode hierarchical data memory which can be reconfigured for networks with different sparsity modes is designed for higher storage efficiency. Third, a multi-sparsity-compatible set-associative convolution processing element (PE) array is designed to efficiently carry out convolution operations under different sparsity modes, especially when both activations and weights are sparse. STICKER was implemented in a 65-nm CMOS technology. With its wide-range sparsity-supported capacity, the peak energy efficiency reaches 62.1 TOPS/W when sparsity ratios of both activations and weights are 5%. In a completely pruned Alexnet model, STICKER achieves 2.82 TOPS/W energy efficiency 1.8 $\times $ higher than that of the state-of-the-art processors.

Journal ArticleDOI
TL;DR: The design is complemented by a theoretical investigation of noise upconversion caused by short-channel effects in the cross-coupled transistors, obtaining the first instance of a closed-form phase noise expression in the $1/f^{3}$ region.
Abstract: Class-C operation is leveraged to implement a $K$ -band CMOS voltage-controlled oscillator (VCO) where the upconversion of $1/f$ current noise from the cross-coupled transistors in the oscillator core is robustly contained at a very low level. Implemented in a bulk 28-nm CMOS technology, the 12%-tuning-range VCO shows a phase noise as low as −112 dBc/Hz at 1-MHz offset (−86 dBc/Hz at 100 kHz offset) from a 19.5 GHz carrier while consuming 20.7 mW, achieving a figure of merit (FoM) of −185 dBc/Hz. The design is complemented by a theoretical investigation of $1/f$ noise upconversion caused by short-channel effects in the cross-coupled transistors, obtaining the first instance of a closed-form phase noise expression in the $1/f^{3}$ region.

Journal ArticleDOI
TL;DR: A fully integrated 76–81-GHz frequency-modulated, continuous-wave (FMCW) radar transceiver (TRX) in a 65-nm CMOS is presented and real-time experimental results show that the distance and the angular resolution of the MIMO radar achieved are 5 cm and 9°.
Abstract: A fully integrated 76–81-GHz frequency-modulated, continuous-wave (FMCW) radar transceiver (TRX) in a 65-nm CMOS is presented. Two transmitters (TXs) and three receivers (RXs) are integrated for multiple-input multiple-output (MIMO) processing. A 38.5-GHz mixed-mode phase-locked loop (PLL) with reconfigurable loop bandwidth and a frequency doubling scheme are employed to generate the reconfigurable FMCW chirp waveforms. The coarse-to-fine-segmented current DAC is utilized to support sawtooth FMCW chirps with fast frequency ramping-down capability, and the delay lock loop (DLL)-based delay time calibration is used to improve the linearity of the embedded 2-D Vernier time-to-digital converter (TDC). Passive voltage-mode down-conversion is utilized to improve the RX linearity against TX leakage and short-range interference. A bottom-switching Gilbert-type modulator in the TX is proposed to realize the bi-phase modulation, and the magnetically coupled resonator technique is used to effectively expand the link bandwidth. The measurement results show that the FMCW TRX could generate reconfigurable chirps with the bandwidth from 250 MHz to 4 GHz and the period from 30 $\mu \text{s}$ to 10 ms. The root-mean-square (rms) frequency error is 110 kHz for a sawtooth chirp with 4-GHz bandwidth and 300- $\mu \text{s}$ period. The TX maximum output power is 13.4 dBm and is adjustable within 3 dB by reconfiguring its low dropout regulator (LDO) voltage. The RX achieves a 15.3-dB noise figure at 600-kHz IF and a −8.5-dBm RF input-referred P1dB. The overall power consumption is 921 mW, with two TXs and three RXs powered ON. Based on the proposed TRX chip, prototype hardware and a data process algorithm are developed. Real-time experimental results show that the distance and the angular resolution of the MIMO radar achieved are 5 cm and 9°, respectively.

Journal ArticleDOI
Kunyang Liu1, Yue Min1, Xuan Yang1, Hanfeng Sun1, Hirofumi Shinohara1 
TL;DR: This article presents a highly stable SRAM-based physically unclonable function (PUF) using enhancement–enhancement (EE)-structure bit cells for native stability improvement using a lightweight integrated dark-bit detection technique and eliminated all unstable bits in the accelerated aging test.
Abstract: This article presents a highly stable SRAM-based physically unclonable function (PUF) using enhancement–enhancement (EE)-structure bit cells for native stability improvement. The PUF bit cells are power-gated 2-D and are normally in the OFF state, which largely reduces power and is beneficial to attack tolerance. In addition, a dark-bit detection technique based on a lightweight integrated ${V}_{\text {SS}}$ -bias generator is implemented in order to screen out potentially unstable bit cells (dark bits) induced by supply voltage/temperature (VT) variations and other factors. Measured native bit error rate (BER) of prototype chips fabricated in 130-nm standard CMOS is 0.21% at 0.8 V and 23 °C, which is 14 $\times $ better compared with the conventional SRAM-based PUF. After masking the detected dark bits, no bit error (3339 bits $\times $ 500 evaluations) appeared at the worst VT corner across 0.8 to 1.4 V and −40 °C to 120 °C. This technique also eliminated all unstable bits in the accelerated aging test. Both the data before and after dark-bit masking have passed all applicable NIST SP 800–22 randomness tests. The measured operational energy at 0.8 V is 128 fJ/bit and the standby power is 0.44 pW/bit, thanks to the 2-D power-gating scheme. The nMOS-only bit cell is highly compact, with a normalized bit cell area of 373 F 2.

Journal ArticleDOI
TL;DR: An energy efficient convolutional neural network (CNN) engine by performing multiply-and-accumulate (MAC) operations in the time domain by employing a novel bi-directional memory delay line (MDL) unit to perform signed accumulation of input and weight products.
Abstract: In this article, we demonstrate an energy efficient convolutional neural network (CNN) engine by performing multiply-and-accumulate (MAC) operations in the time domain. The multi-bit inputs are compactly represented as a single pulse width encoded input. This translates into reduced switching capacitance ( $C_{\mathrm{ DYN}}$ ), compared to baseline digital implementation, and can enable low power neural network computing in an edge device. The time-domain CNN engine employs a novel bi-directional memory delay line (MDL) unit to perform signed accumulation of input and weight products. The proposed MDL design leverages standard digital circuits and does not require any capacitors and complex analog-to-digital converters (ADCs) to realize the convolution operation, thereby enabling easy scaling across the process technology nodes. Four speed-up modes and a configurable MDL length are supported to address throughput versus accuracy trade-off of the time-domain computing approach. Delay calibration units have been accommodated to mitigate the process variation induced delay mismatch among concurrently operating MDL units. The proposed time-domain MDL design implements a LeNet-5 CNN engine in a commercial 40-nm CMOS process achieving an energy efficiency of 12.08 TOPS/W, a throughput of 0.365 GOPS at 537 mV in the 16 $\times $ speed-up mode. 40-nm CMOS test-chip measurements over 100 MNIST images show 97% classification accuracy. Simulation results over the entire 10 000 MNIST validation dataset images taking into account the circuit non-ideal effects of the MDL-based time-domain approach show a classification accuracy of 98.42%. The test-chip is operational down to the near-threshold voltage (up to 375 mV) while maintaining the classification accuracy over 90% in the 1 $\times $ speed-up mode. Furthermore, two methods of scaling MDLs to multi-bit weights are proposed. Simulation results for 1000-class AlexNet over 50 000 ImageNet validation dataset images show classification accuracy loss within 1% when compared with software implementation. The proposed MDL based time-domain approach performing 1-bit/8-bit weight and 8-bit input MAC operations when compared with the corresponding baseline digital implementations shows 2.09 $\times $ −2.32 $\times $ higher energy efficiency and 2.22 $\times $ −3.45 $\times $ smaller area.

Journal ArticleDOI
TL;DR: An optimized architecture of pseudo-resistor, made in standard CMOS 0.35 technology, is presented to bias a low-noise transimpedance amplifier for high-sensitivity applications in the frequency range 100 kHz–10 MHz.
Abstract: Pseudo-resistor circuits are used to mimic large value resistors and base their success on the reduction of occupied areas with respect to physical devices of equal value. This article presents an optimized architecture of pseudo-resistor, made in standard CMOS 0.35 $\mu \text{m}$ technology to bias a low-noise transimpedance amplifier for high-sensitivity applications in the frequency range 100 kHz–10 MHz. The architecture was selected after a critical review of the different topologies to implement high-value resistances with MOSFET transistors, considering their performance in terms of linearity of response, symmetric dynamic range, frequency behavior, and simplicity of realization. The resulting circuit consumes an area of 0.017 mm2 and features a tunable resistance from ${20\quad \text {M} \Omega }$ to ${20\quad \text {G} \Omega }$ , dynamic offset reduction due to a more than linear $I$ – $V$ curve, and a high-frequency noise well below the one of a physical resistor of equal value. This latter aspect highlights the larger perspective of pseudo-resistors as building blocks in very low-noise applications in addition to the advantage in occupied areas they provide.

Journal ArticleDOI
TL;DR: This work presents a 1-to-8-bit configurable SRAM CIM unit-macro using a hybrid structure combining 6T-SRAM based in-memory binary product-sum operations with digital near-memory-computing multibit PS accumulation to increase read accuracy and reduce area overhead.
Abstract: Previous SRAM-based computing-in-memory (SRAM-CIM) macros suffer small read margins for high-precision operations, large cell array area overhead, and limited compatibility with many input and weight configurations. This work presents a 1-to-8-bit configurable SRAM CIM unit-macro using: 1) a hybrid structure combining 6T-SRAM based in-memory binary product-sum (PS) operations with digital near-memory-computing multibit PS accumulation to increase read accuracy and reduce area overhead; 2) column-based place-value-grouped weight mapping and a serial-bit input (SBIN) mapping scheme to facilitate reconfiguration and increase array efficiency under various input and weight configurations; 3) a self-reference multilevel reader (SRMLR) to reduce read-out energy and achieve a sensing margin 2 $\times $ that of the mid-point reference scheme; and 4) an input-aware bitline voltage compensation scheme to ensure successful read operations across various input-weight patterns. A 4-Kb configurable 6T-SRAM CIM unit-macro was fabricated using a 55-nm CMOS process with foundry 6T-SRAM cells. The resulting macro achieved access times of 3.5 ns per cycle (pipeline) and energy efficiency of 0.6–40.2 TOPS/W under binary to 8-b input/8-b weight precision.

Journal ArticleDOI
TL;DR: A low-noise feedback amplifier with interstage noise matching is implemented in 22-nm fully depleted silicon-on-insulator (SOI)-CMOS technology with continuous dc power control on the fly using modulation of FET backgates.
Abstract: A low-noise feedback amplifier (LNA) with interstage noise matching is implemented in 22-nm fully depleted silicon-on-insulator (SOI)-CMOS technology. Minimum noise figure (NF) is 1.7 dB centered at 28 GHz, and NF remains below 1.98±0.25 dB across a 10-GHz range. Peak gain of the two-stage LNA is 21.5 dB at 22 GHz, and the bandwidth (BW) for $|{S_{21}} |$ is 19–36 GHz. Input and output return losses are better than 10 dB across an effective LNA BW of 22–32 GHz. The third-order input intercept is −13.4 dBm at peak gain when dissipating 17.3 mW. Continuous dc power control on the fly is implemented using modulation of FET backgates. When dc power consumption is reduced 5.6 mW, NF increases by less than 0.5 dB, peak gain decreases by 3.6 dB, and input return loss remains better than 10 dB with no change in effective BW.

Journal ArticleDOI
TL;DR: This article presents a split time-interleaved (TI) successive-approximation register (SAR) analog-to-digital converter (ADC) with digital background timing-skew mismatch calibration, which divides a TI-SAR ADC into two split parts with the same overall sampling rate but different numbers of TI channels.
Abstract: This article presents a split time-interleaved (TI) successive-approximation register (SAR) analog-to-digital converter (ADC) with digital background timing-skew mismatch calibration. It divides a TI-SAR ADC into two split parts with the same overall sampling rate but different numbers of TI channels. Benefitting from the proposed split TI topology, the timing-skew calibration convergence speed is fast without any extra analog circuits. The input impedance of the overall TI-ADC remains unchanged, which is essential for the preceding driving stage in a high-speed application. We designed a prototype seven-/eight-way split TI-ADC implemented in 28-nm CMOS. After a digital background timing-skew calibration, it reaches a 54.2-dB signal-to-noise-and-distortion ratio (SNDR) and 67.1-dB spurious free dynamic range (SFDR) with a near Nyquist rate input signal and a 2.5-GHz effective resolution bandwidth (ERBW). Furthermore, the power consumption of ADC core (mismatch calibration off-chip) is 12.2-mW running at 1.6 GS/s, leading to a Walden figure-of-merit (FOM) of 18.2 fJ/conv.-step and a Schreier FOM of 162.4 dB, respectively.

Journal ArticleDOI
TL;DR: This article presents a second-order noise-shaping (NS) successive approximation register (SAR) analog-to-digital converter (ADC) with a process, voltage, and temperature (PVT)-robust closed-loop dynamic amplifier, enabling the first fully dynamic NS-SAR ADC that realizes sharp noise transfer function (NTF) while not requiring any gain calibration.
Abstract: This article presents a second-order noise-shaping (NS) successive approximation register (SAR) analog-to-digital converter (ADC) with a process, voltage, and temperature (PVT)-robust closed-loop dynamic amplifier. The proposed closed-loop dynamic amplifier combines the merits of closed-loop architecture and dynamic operation, realizing robustness, high accuracy, and high energy-efficiency simultaneously. It is embedded in the loop filter of an NS SAR design, enabling the first fully dynamic NS-SAR ADC that realizes sharp noise transfer function (NTF) while not requiring any gain calibration. Fabricated in 40-nm CMOS technology, the prototype ADC achieves an SNDR of 83.8 dB over a bandwidth of 625 kHz while consuming only $107~\mu \text{W}$ . It results in an SNDR-based Schreier figure-of-merit (FoM) of 181.5 dB.

Journal ArticleDOI
TL;DR: This article presents a compact analog-to-digital converter (ADC)/digital- to-analog converter (DAC) digital signal processing (DSP)-based long reach (LR) transceiver in 7-nm FinFET technology that operates seamlessly from 3.5—56 Gb/s in pulse-amplitude modulation (PAM-4) and consumes only 243 mW at 56 GB/s.
Abstract: This article presents a compact analog-to-digital converter (ADC)/digital-to-analog converter (DAC) digital signal processing (DSP)-based long reach (LR) transceiver in 7-nm FinFET technology that operates seamlessly from 3.5—56 Gb/s in pulse-amplitude modulation (PAM-4) [from 1.25 to 28 Gb/s in non-return to zero (NRZ) mode] and consumes only 243 mW at 56 Gb/s. The receiver (RX) front end consists of a two-stage continuous-time linear equalizer (CTLE), a 40-way time-interleaved (TI) successive approximation register (SAR)-ADC, a DSP equalizer containing a 17-tap feed-forward equalizer (FFE) working concurrently with a one-tap speculative decision feedback equalizer (DFE) and a reflection canceling FFE, which implements four individually roaming taps. Clock recovery is achieved on a dedicated low latency path consisting of a five-tap FFE, slicer, time error detector (TED), and loop filter driving a dedicated LC —digital-controlled oscillator (DCO). The transmit section consists of a variety of pattern generators, a five-tap finite impulse response (FIR) section, and a terminated DAC as an analog transmitter. When working on a 42.5-dB-LR channel at 56 Gb/s PAM-4, the transceiver consumes 243 mW from the 0.9-V (analog) and 0.75-V (digital) supplies, corresponding to an efficiency of 4.3 pJ/b.

Journal ArticleDOI
Dongyi Liao1, Yucai Zhang2, Fa Foster Dai2, Zhenqi Chen, Yanjie Wang 
TL;DR: Using a two-stage scheme allows separately dealing with the low phase noise (PN) frequency synthesis in the first stage and the mm-wave frequency multiplication in the second stage, achieving the best overall power efficiency.
Abstract: In this article, a two-stage millimeter (mm)-wave frequency synthesizer with low in-band noise and robust locking reference-sampling techniques is presented. Using a two-stage scheme allows separately dealing with the low phase noise (PN) frequency synthesis in the first stage and the mm-wave frequency multiplication in the second stage, achieving the best overall power efficiency. In the first stage, a voltage domain reference-sampling phase detector (RSPD)-locked loop (RSPLL) is adopted to achieve both low PN and robust locking without additional frequency locking loop. A reference reshaping buffer is implemented to improve the phase detector gain and in-band PN. The reference rising/falling time is programmable to achieve optimal RSPLL performance even under external disturbances. The second stage employs an injection-locked voltage-controlled oscillator (ILVCO) for 4 $\times $ frequency multiplication. A low-power digital frequency tracking loop (FTL) detecting actual frequency errors is implemented in order to achieve wide operation range for the ILVCO while using a high ${Q}$ tank with low power. The prototype synthesizer was fabricated in a 45-nm partially depleted silicon on insulator (PDSOI) CMOS technology. The first stage 9-GHz RSPLL achieves 144-fs integrated jitter with 7.2-mW power consumption, achieving a figure of merit (FoM) of −248 dB and the overall mm-wave synthesizer achieves 251-fs integrated jitter with 20.6-mW power consumption at 35.84 GHz, achieving an FoM of −238.9 dB.