# Showing papers in "IEEE Transactions on Very Large Scale Integration Systems in 2020"

••

TL;DR: In this paper, a cobweb-based redundant through-silicon-via (TSV) design is proposed with efficient hardware as well as high repair rate to repair clustered faulty TSVs (FTSVs).

Abstract: In this brief, a cobweb-based redundant through-silicon-via (TSV) design is proposed with efficient hardware as well as high repair rate to repair clustered faulty TSVs (FTSVs). The experimental simulation results demonstrate that for highly clustered faults, the repair rate of the proposed RTSV method is 48.59% and 1.75% higher than that of the ring-based and router-based RTSV methods, respectively. Furthermore, the proposed design can achieve 63.93% and 16.34% hardware reductions compared with the router-based and the ring-based design, respectively.

126 citations

••

Drexel University

^{1}, Katholieke Universiteit Leuven^{2}, IMEC^{3}, University of Zurich^{4}, University of California, Irvine^{5}TL;DR: SpiNeMap, a design methodology to map SNNs to crossbar-based neuromorphic hardware, minimizing spike latency and energy consumption is presented, and is shown to reduce average energy consumption and spike latency by 45% and 21%, compared to the best-performing SNN mapping technique.

Abstract: Neuromorphic hardware implements biological neurons and synapses to execute a spiking neural network (SNN)-based machine learning. We present SpiNeMap, a design methodology to map SNNs to crossbar-based neuromorphic hardware, minimizing spike latency and energy consumption. SpiNeMap operates in two steps: SpiNeCluster and SpiNePlacer. SpiNeCluster is a heuristic-based clustering technique to partition an SNN into clusters of synapses, where intracluster local synapses are mapped within crossbars of the hardware and intercluster global synapses are mapped to the shared interconnect. SpiNeCluster minimizes the number of spikes on global synapses, which reduces spike congestion and improves application performance. SpiNePlacer then finds the best placement of local and global synapses on the hardware using a metaheuristic-based approach to minimize energy consumption and spike latency. We evaluate SpiNeMap using synthetic and realistic SNNs on a state-of-the-art neuromorphic hardware. We show that SpiNeMap reduces average energy consumption by 45% and spike latency by 21%, compared to the best-performing SNN mapping technique.

101 citations

••

TL;DR: This article replaces the exact multipliers in two representative NNs with approximate designs to evaluate their effect on the classification accuracy and shows that using AMs can also improve the NN accuracy by introducing noise.

Abstract: Improving the accuracy of a neural network (NN) usually requires using larger hardware that consumes more energy. However, the error tolerance of NNs and their applications allow approximate computing techniques to be applied to reduce implementation costs. Given that multiplication is the most resource-intensive and power-hungry operation in NNs, more economical approximate multipliers (AMs) can significantly reduce hardware costs. In this article, we show that using AMs can also improve the NN accuracy by introducing noise. We consider two categories of AMs: 1) deliberately designed and 2) Cartesian genetic programing (CGP)-based AMs. The exact multipliers in two representative NNs, a multilayer perceptron (MLP) and a convolutional NN (CNN), are replaced with approximate designs to evaluate their effect on the classification accuracy of the Mixed National Institute of Standards and Technology (MNIST) and Street View House Numbers (SVHN) data sets, respectively. Interestingly, up to 0.63% improvement in the classification accuracy is achieved with reductions of 71.45% and 61.55% in the energy consumption and area, respectively. Finally, the features in an AM are identified that tend to make one design outperform others with respect to NN accuracy. Those features are then used to train a predictor that indicates how well an AM is likely to work in an NN.

89 citations

••

TL;DR: A domain-specific FPGA overlay processor, named OPU, is proposed to accelerate CNN networks, which offers software-like programmability for CNN end users, as CNN algorithms are automatically compiled into executable codes, which are loaded and executed by OPU without reconfiguration of FPGa for switch or update of CNN networks.

Abstract: Field-programmable gate array (FPGA) provides rich parallel computing resources with high energy efficiency, making it ideal for deep convolutional neural network (CNN) acceleration. In recent years, automatic compilers have been developed to generate network-specific FPGA accelerators. However, with more cascading deep CNN algorithms adapted by various complicated tasks, reconfiguration of FPGA devices during runtime becomes unavoidable when network-specific accelerators are employed. Such reconfiguration can be difficult for edge devices. Moreover, network-specific accelerator means regeneration of RTL code and physical implementation whenever the network is updated. This is not easy for CNN end users. In this article, we propose a domain-specific FPGA overlay processor, named OPU to accelerate CNN networks. It offers software-like programmability for CNN end users, as CNN algorithms are automatically compiled into executable codes, which are loaded and executed by OPU without reconfiguration of FPGA for switch or update of CNN networks. Our OPU instructions have complicated functions with variable runtimes but a uniform length. The granularity of instruction is optimized to provide good performance and sufficient flexibility, while reducing complexity to develop microarchitecture and compiler. Experiments show that OPU can achieve an average of 91% runtime multiplication and accumulation unit (MAC) efficiency (RME) among nine different networks. Moreover, for VGG and YOLO networks, OPU outperforms automatically compiled network-specific accelerators in the literature. In addition, OPU shows $5.35\times $ better power efficiency compared with Titan Xp. For a real-time cascaded CNN networks scenario, OPU is $2.9\times $ faster compared with edge computing GPU Jetson Tx2, which has a similar amount of computing resources.

74 citations

••

TL;DR: A sparsewise dataflow to skip the cycles of processing multiply-and-accumulates (MACs) with zero weights and exploit data statistics to minimize energy through zeros gating to avoid unnecessary computations is proposed.

Abstract: Deep convolutional neural networks (CNNs) have achieved state-of-the-art performance in a wide range of applications. However, deeper CNN models, which are usually computation consuming, are widely required for complex artificial intelligence (AI) tasks. Though recent research progress on network compression, such as pruning, has emerged as a promising direction to mitigate computational burden, existing accelerators are still prevented from completely utilizing the benefits of leveraging sparsity due to the irregularity caused by pruning. On the other hand, field-programmable gate arrays (FPGAs) have been regarded as a promising hardware platform for CNN inference acceleration. However, most existing FPGA accelerators focus on dense CNN and cannot address the irregularity problem. In this article, we propose a sparsewise dataflow to skip the cycles of processing multiply-and-accumulates (MACs) with zero weights and exploit data statistics to minimize energy through zeros gating to avoid unnecessary computations. The proposed sparsewise dataflow leads to a low bandwidth requirement and high data sharing. Then, we design an FPGA accelerator containing a vector generator module (VGM) that can match the index between sparse weights and input activations according to the proposed dataflow. Experimental results demonstrate that our implementation can achieve 987-, 46-, and 57-imag/s performance for AlexNet, VGG-16, and ResNet-50 on Xilinx ZCU102, respectively, which provides $1.5\times $ – $6.7\times $ speedup and $2.0\times $ – $6.0\times $ energy efficiency over previous CNN FPGA accelerators.

69 citations

••

TL;DR: Two hardware architectures optimized for accelerating the encryption and decryption operations of the Brakerski/Fan-Vercauteren (BFV) homomorphic encryption scheme with high-performance polynomial multipliers are presented.

Abstract: Fully homomorphic encryption (FHE) is a technique that allows computations on encrypted data without the need for decryption and it provides privacy in various applications such as privacy-preserving cloud computing. In this article, we present two hardware architectures optimized for accelerating the encryption and decryption operations of the Brakerski/Fan-Vercauteren (BFV) homomorphic encryption scheme with high-performance polynomial multipliers. For proof of concept, we utilize our architectures in a hardware/software codesign accelerator framework, in which encryption and decryption operations are offloaded to an FPGA device, while the rest of operations in the BFV scheme are executed in software running on an off-the-shelf desktop computer. Specifically, our accelerator framework is optimized to accelerate Simple Encrypted Arithmetic Library (SEAL), developed by the Cryptography Research Group at Microsoft Research. The hardware part of the proposed framework targets the XILINX VIRTEX-7 FPGA device, which communicates with its software part via a peripheral component interconnect express (PCIe) connection. For proof of concept, we implemented our designs targeting 1024-degree polynomials with 8-bit and 32-bit coefficients for plaintext and ciphertext, respectively. The proposed framework achieves almost $12\times $ and $7\times $ latency speedups, including I/O operations for the offloaded encryption and decryption operations, respectively, compared to their pure software implementations.

68 citations

••

TL;DR: A simple class AB power-efficient ULV structure has been obtained, which can operate from supply voltages less than the threshold voltages of the employed MOS transistors, while offering rail-to-rail input common-mode range at the same time.

Abstract: In this article, a new solution for an ultralow-voltage (ULV) ultralow-power (ULP) operational transconductance amplifier (OTA) is presented. Thanks to the combination of a low-voltage bulk-driven nontailed differential stage with the multipath Miller zero compensation technique, a simple class AB power-efficient ULV structure has been obtained, which can operate from supply voltages less than the threshold voltages of the employed MOS transistors, while offering rail-to-rail input common-mode range at the same time. The proposed OTA was fabricated using the 180-nm CMOS process from Taiwan Semiconductor Manufacturing Company (TSMC) and can operate from $V_{\mathbf {DD}}$ ranging from 0.3 to 0.5 V. The 0.3-V version dissipates only 12.6 nW of power while showing a 64.7-dB voltage gain at 1-Hz, 2.96-kHz gain-bandwidth product, and a 4.15-V/ms average slew-rate at 30-pF load capacitance. The measured results agree well with simulations.

53 citations

••

TL;DR: A flux-controlled memristor emulator built with off-the-shelf electronic devices and based on a TiO

_{2}model is presented in this article, which shows a clear fingerprint of an ideal Memristor.Abstract: A flux-controlled memristor emulator built with off-the-shelf electronic devices and based on a TiO2 model is presented in this article. The circuit proposed in this article uses the current mode approach based analog building blocks such as a second-generation current conveyor (CCII) and an operational transconductance amplifier (OTA) as an active element with few passive elements. The circuit shows a clear fingerprint of an ideal memristor. The offered emulator circuit can be made to operate in incremental and decremental modes and functions well up to 26.3 MHz. Nonvolatility, Monte Carlo sampling, and corner analysis simulations are executed to verify the robustness of the circuit. The functional verification of the presented circuit is performed using the 0.18- $\mu \text{m}$ CMOS parameter at a supply voltage of ±1.2 V. The experimental demonstration is carried out by making a prototype on a breadboard using ICs AD844AN and CA3080, which exhibits a good agreement with theoretical and simulation results. The layout of the circuit, which requires a total chip area of $75\times 70\,\,\mu \text{m}^{2}$ , is also created. Single/parallel combinations of a memristor, a high-pass filter, and a chaotic system are presented to demonstrate its application.

46 citations

••

TL;DR: This work proposes a novel physical unclonable function-based CPRNG (PUF-CPRNG), where the initial seed is secured by generating it from PUF, and includes dynamic refreshing logic to ensure that the random numbers generated are nonperiodic.

Abstract: Pseudorandom number generators (PRNGs) play a pivotal role in generating key sequences of cryptographic protocols. Among different schemes, a simple chaotic PRNG (CPRNG) exhibits the property of being extremely sensitive to the initial seed and, hence, unpredictable. However, CPRNG is vulnerable if the initial seed is compromised. In this brief, we propose a novel physical unclonable function-based CPRNG (PUF-CPRNG), where the initial seed is secured by generating it from PUF. Furthermore, the proposed PUF-CPRNG includes dynamic refreshing logic to ensure that the random numbers generated are nonperiodic. To further secure the PUF-CPRNG, the feedback values of CPRNG are fed from PUF. An hardware architecture for the proposed methodology has been designed, and the proof of concept implementation was carried out using Xilinx Virtex-7 field-programmable gate array (FPGA). The proposed PUF-CPRNG passes the statistical test NIST 800-22, ENT, and correlation analysis.

43 citations

••

TL;DR: A novel method to apply the Winograd algorithm to a stride of 2 is presented, valid for one, two, or three dimensions and achieved digital signal processor (DSP) efficiencies of 1.22 giga operations per second (GOPS)/DSPs and 1.33 GOPS/D SPs, respectively.

Abstract: Convolutional neural networks (CNNs) have been widely adopted for computer vision applications. CNNs require many multiplications, making their use expensive in terms of both computational complexity and hardware. An effective method to mitigate the number of required multiplications is via the Winograd algorithm. Previous implementations of CNNs based on Winograd use the 2-D algorithm $F(2 \times 2,3 \times 3)$ , which reduces computational complexity by a factor of 2.25 over regular convolution. However, current Winograd implementations only apply when using a stride (shift displacement of a kernel over an input) of 1. In this article, we presented a novel method to apply the Winograd algorithm to a stride of 2. This method is valid for one, two, or three dimensions. We also introduced new Winograd versions compatible with a kernel of size 3, 5, and 7. The algorithms were successfully implemented on an NVIDIA K20c GPU. Compared to regular convolutions, the implementations for stride 2 are 1.44 times faster for a $3 \times 3$ kernel, $2.04\times $ faster for a $5\times 5$ kernel, $2.42\times $ faster for a $7 \times 7$ kernel, and $1.73\times $ faster for a $3 \times 3 \times 3$ kernel. Additionally, a CNN accelerator using a novel processing element (PE) performs two 2-D Winograd stride 1, or one 2-D Winograd stride 2, and operations per clock cycle was implemented on an Intel Arria-10 field-programmable gate array (FPGA). We accelerated the original and our proposed modified VGG-16 architectures and achieved digital signal processor (DSP) efficiencies of 1.22 giga operations per second (GOPS)/DSPs and 1.33 GOPS/DSPs, respectively.

41 citations

••

KAIST

^{1}TL;DR: A deep convolutional neural network (CNN) inference processor based on a novel enhanced output stationary (EOS) dataflow that employs dedicated register files (RFs) for storing reused activation data to eliminate redundant memory accesses for highly energy-consuming SRAM banks.

Abstract: We propose a deep convolutional neural network (CNN) inference processor based on a novel enhanced output stationary (EOS) dataflow. Based on the observation that some activations are commonly used in two successive convolutions, the EOS dataflow employs dedicated register files (RFs) for storing such reused activation data to eliminate redundant memory accesses for highly energy-consuming SRAM banks. In addition, processing elements (PEs) are split into multiple small groups such that each group covers a tile of input activation map to increase the usability of activation RFs (ARFs). The processor has two different voltage/frequency domains. The computation domain with 512 PEs operates at near-threshold voltage (NTV) (0.4 V) and 60-MHz frequency to increase energy efficiency, while the rest of the processors including 848-KB SRAMs run at 0.7 V and 120-MHz frequency to increase both on-chip and off-chip memory bandwidths. The measurement results show that our processor is capable of running AlexNet at 831 GOPS/W, VGG-16 at 1151 GOPS/W, ResNet-18 at 1004 GOPS/W, and MobileNet at 948 GOPS/W energy efficiency.

••

TL;DR: A new architecture for a digital full-adder is presented, which is up to 41% faster than existing IMPLY-based serial designs while requiring up to 78% less area (memristors) compared to the existing parallel design.

Abstract: Passive implementation of memristors has led to several innovative works in the field of electronics. Despite being primarily a candidate for memory applications, memristors have proven to be beneficial in several other circuits and applications as well. One of the use cases is the implementation of digital circuits such as adders. Among several logic implementations using memristors, IMPLY logic is one of the promising candidates. In this brief, we present a new architecture for a digital full-adder, which is up to 41% faster than existing IMPLY-based serial designs while requiring up to 78% less area (memristors) compared to the existing parallel design.

••

TL;DR: It is demonstrated that reconfigurable constant coefficient multipliers (RCCMs) offer a better alternative for saving the silicon area than utilizing low-precision arithmetic for deep-learning applications on field-programmable gate arrays (FPGAs).

Abstract: Low-precision arithmetic operations to accelerate deep-learning applications on field-programmable gate arrays (FPGAs) have been studied extensively, because they offer the potential to save silicon area or increase throughput. However, these benefits come at the cost of a decrease in accuracy. In this article, we demonstrate that reconfigurable constant coefficient multipliers (RCCMs) offer a better alternative for saving the silicon area than utilizing low-precision arithmetic. RCCMs multiply input values by a restricted choice of coefficients using only adders, subtractors, bit shifts, and multiplexers (MUXes), meaning that they can be heavily optimized for FPGAs. We propose a family of RCCMs tailored to FPGA logic elements to ensure their efficient utilization. To minimize information loss from quantization, we then develop novel training techniques that map the possible coefficient representations of the RCCMs to neural network weight parameter distributions. This enables the usage of the RCCMs in hardware, while maintaining high accuracy. We demonstrate the benefits of these techniques using AlexNet, ResNet-18, and ResNet-50 networks. The resulting implementations achieve up to 50% resource savings over traditional 8-bit quantized networks, translating to significant speedups and power savings. Our RCCM with the lowest resource requirements exceeds 6-bit fixed point accuracy, while all other implementations with RCCMs achieve at least similar accuracy to an 8-bit uniformly quantized design, while achieving significant resource savings.

••

TL;DR: An analysis on several vectorizable linear algebra computation kernels for a range of different matrix and vector sizes gives insight into performance limitations and bottlenecks for vector processors and outlines directions to maintain high energy efficiency even for small matrix sizes where the vector architecture achieves suboptimal utilization of the available FPUs.

Abstract: In this article, we present Ara, a 64-bit vector processor based on the version 0.5 draft of RISC-V’s vector extension, implemented in GlobalFoundries 22FDX fully depleted silicon-on-insulator (FD-SOI) technology. Ara’s microarchitecture is scalable, as it is composed of a set of identical lanes, each containing part of the processor’s vector register file and functional units. It achieves up to 97% floating-point unit (FPU) utilization when running a $256\times256$ double-precision matrix multiplication on 16 lanes. Ara runs at more than 1 GHz in the typical corner (TT/0.80 V/25 °C), achieving a performance up to 33 DP–GFLOPS. In terms of energy efficiency, Ara achieves up to 41 DP–GFLOPS $\text {W}^{-1}$ under the same conditions, which is slightly superior to similar vector processors found in the literature. An analysis on several vectorizable linear algebra computation kernels for a range of different matrix and vector sizes gives insight into performance limitations and bottlenecks for vector processors and outlines directions to maintain high energy efficiency even for small matrix sizes where the vector architecture achieves suboptimal utilization of the available FPUs.

••

TL;DR: A new DNN accelerator is designed to support configurable multibit activations and large-scale DNNs seamlessly while substantially improving the chip-level energy-efficiency with favorable accuracy tradeoff compared to conventional digital ASIC.

Abstract: To enable essential deep learning computation on energy-constrained hardware platforms, including mobile, wearable, and Internet of Things (IoT) devices, a number of digital ASIC designs have presented customized dataflow and enhanced parallelism. However, in conventional digital designs, the biggest bottleneck for energy-efficient deep neural networks (DNNs) has reportedly been the data access and movement. To eliminate the storage access bottleneck, new SRAM macros that support in-memory computing have been recently demonstrated. Several in-SRAM computing works have used the mix of analog and digital circuits to perform XNOR-and-ACcumulate (XAC) operation without row-by-row memory access and can map a subset of DNNs with binary weights and binary activations. In the single array level, large improvement in energy efficiency (e.g., two orders of magnitude improvement) has been reported in computing XAC over digital-only hardware performing the same operation. In this article, by integrating many instances of such in-memory computing SRAM macros with an ensemble of peripheral digital circuits, we architect a new DNN accelerator, titled Vesti . This new accelerator is designed to support configurable multibit activations and large-scale DNNs seamlessly while substantially improving the chip-level energy-efficiency with favorable accuracy tradeoff compared to conventional digital ASIC. Vesti also employs double-buffering with two groups of in-memory computing SRAMs, effectively hiding the row-by-row write latencies of in-memory computing SRAMs. The Vesti accelerator is fully designed and laid out in 65-nm CMOS, demonstrating ultralow energy consumption of $ for CIFAR-10 classification at 1.0-V supply.

••

TL;DR: A novel architecture based on multiple-parallel-branch with folding (MPBF) technique is proposed, which parallelizes the branches and reuses the multiplier and adder in each folded branch so that the tradeoff between throughput and the usage of the hardware resources is balanced.

Abstract: Multichannel active noise control (MCANC) is widely recognized as an effective and efficient solution for acoustic noise and vibration cancellation, such as in high-dimensional ventilation ducts, open windows, and mechanical structures. The feedforward multichannel filtered-x least mean square (FFMCFxLMS) algorithm is commonly used to dynamically adjust the transfer function of the multichannel controllers for different noise environments. The computational load incurred by the FFMCFxLMS algorithm, however, increases exponentially with increasing channel count, thus requiring high-end field-programmable gate array (FPGA) processors. Nevertheless, such processors still need specific configurations to cope with soaring computing loads as the channel count increases. To achieve a high-efficiency implementation of the FFMCFxLMS algorithm with floating-point arithmetic, a novel architecture based on multiple-parallel-branch with folding (MPBF) technique is proposed. This architecture parallelizes the branches and reuses the multiplier and adder in each folded branch so that the tradeoff between throughput and the usage of the hardware resources is balanced. The proposed architecture is validated in an experimental setup that implements the FFMCFxLMS algorithm for the MCANC system with 24 reference sensors, 24 secondary sources, and 24 error sensors, at a sampling and throughput rates of 25 kHz and 260 Mb/s, respectively.

••

TL;DR: A low resource utilization field-programmable gate array (FPGA)-based long short-term memory (LSTM) network architecture for accelerating the inference phase is presented, which has low-power and high-speed features that are achieved through overlapping the timing of the operations and pipelining the datapath.

Abstract: In this brief, a low resource utilization field-programmable gate array (FPGA)-based long short-term memory (LSTM) network architecture for accelerating the inference phase is presented. The architecture has low-power and high-speed features that are achieved through overlapping the timing of the operations and pipelining the datapath. Moreover, this architecture requires negligible internal memory size for storing the intermediate data leading to low resource utilization and simple routing, which provides lower interconnect delay (higher operating frequency). A designer may adjust the resource utilization (as well as the latency) of the proposed architecture readily at the register-transfer level (RTL) design by adjusting the amount of parallelization. This makes the process of mapping the architecture onto different types of FPGAs, subject to defined constraints, a simple one. The efficacy of the proposed architecture is assessed by implementing an LSTM network on different types of FPGAs. Compared with the recent works, the proposed architecture provides up to about $1.6\times $ , $43.6\times $ , $21.9\times $ , and $114.5\times $ improvements in frequency, power efficiency, GOP/s, and GOP/s/W, respectively. Finally, our proposed architecture operates at 17.64 GOP/s, which is $2.31\times $ faster than the best previously reported results.

••

Anhui University

^{1}TL;DR: Simulation results in Semiconductor Manufacturing International Corporation (SMIC) 65-nm CMOS commercial standard process show that the proposed RHPD-12T cell can tolerate all single-node upsets, and Monte Carlo (MC) simulation has proved that under high frequency and low supply (0.6 V) voltage, RHPD -12T has the minimum write failure probability.

Abstract: In this brief, we proposed, based on the polarity upset mechanism of single-event transient voltage of n-channel metal–oxide–semiconductor (nMOS) transistors, a novel radiation hardened by polar design (RHPD) 12T SRAM cell to enhance the reliability and operation speed for space applications. Simulation results in Semiconductor Manufacturing International Corporation (SMIC) 65-nm CMOS commercial standard process show that the proposed RHPD-12T cell can tolerate all single-node upsets. Meanwhile, compared with We-QUATRO, QUATRO, and dual interlocked storage cell (DICE), the write speed of the proposed cell can be reduced by ~41.8 and ~35.3%, and the static power consumption is reduced by ~41.6 and ~46.3%, respectively. Monte Carlo (MC) simulation has proved that under high frequency and low supply (0.6 V) voltage, RHPD-12T has the minimum write failure probability compared with five other SRAM cells.

••

TL;DR: This is the first in-depth study to completely unify the computation process of zero-TCONV, NN-TConV, and CONV layers, and it is observed that high acceleration performance is also achieved on Nn- TCONV networks, the acceleration of which have not been explored before.

Abstract: In this article, we design the first full software/ hardware stack, called Uni-OPU , for an efficient uniform hardware acceleration of different types of transposed convolutional (TCONV) networks and conventional convolutional (CONV) networks. Specifically, a software compiler is provided to transform the computation of various TCONV, i.e., zero-inserting-based TCONV (zero-TCONV), nearest-neighbor resizing-based TCONV (NN-TCONV), and CONV layers into the same pattern. The compiler conducts the following optimizations: 1) eliminating up to 98.4% of operations in TCONV by making use of the fixed pattern of TCONV upsampling; 2) decomposing and reformulating TCONV and CONV into streaming parallel vector multiplication with a uniform address generation scheme and data flow pattern; and 3) efficient scheduling and instruction compilation to map networks onto a hardware processor. An instruction-based hardware acceleration processor is developed to efficiently speedup our uniform computation pattern with throughput up to 2.35 TOPS for the TCONV layer, consuming only 2.89 W dynamic power. We evaluate Uni-OPU on a benchmark set composed of six TCONV networks from different application fields. Extensive experimental results indicate that Uni-OPU is able to gain $1.45 \times $ to $3.68 \times $ superior power efficiency compared with state-of-the-art zero-TCONV accelerators. High acceleration performance is also achieved on NN-TCONV networks, the acceleration of which have not been explored before. In summary, we observe $1.90 \times $ and $1.63 \times $ latency reduction, as well as $15.04 \times $ and $12.43 \times $ higher power efficiency on zero-TCONV and NN-TCONV networks compared with Titan Xp GPU on average. To the best of our knowledge, ours is the first in-depth study to completely unify the computation process of zero-TCONV, NN-TCONV, and CONV layers.

••

TL;DR: A high-speed, low-power 10-T xor–xnor circuit is proposed, which provides full swing outputs simultaneously with improved delay performance and 2%–28.13% improvement in terms of PDP than that of other architectures.

Abstract: Hybrid logic style is widely used to implement full adder (FA) circuits. Performance of hybrid FA in terms of delay, power, and driving capability is largely dependent on the performance of xor–xnor circuit. In this article, a high-speed, low-power 10-T xor–xnor circuit is proposed, which provides full swing outputs simultaneously with improved delay performance. The performance of the proposed circuit is measured by simulating it in cadence virtuoso environment using 90-nm CMOS technology. The proposed circuit reduces the power delay product (PDP) at least by 7.5% than that of the available xor–xnor modules. Four different designs of FAs are also proposed in this article utilizing the proposed xor–xnor circuit and available sum and carry modules. The proposed FAs provide 2%–28.13% improvement in terms of PDP than that of other architectures. To measure the driving capabilities, the proposed FAs are embedded in 2-, 4-, and 8-bit cascaded full adder (CFA) structures. Results show that two of the proposed FAs provide the best performance for a higher number of bits among all the FAs.

••

TL;DR: This article presents a piecewise linear approximation computation (PLAC) method for all nonlinear unary functions, which is an enhanced universal and error-flattened piecewiselinear (PWL) approximation approach.

Abstract: This article presents a piecewise linear approximation computation (PLAC) method for all nonlinear unary functions, which is an enhanced universal and error-flattened piecewise linear (PWL) approximation approach. Compared with the previous methods, PLAC features two main parts, an optimized segmenter to seek the minimum number of segments under the predefined software maximum absolute error (MAE), raising the segmentation performance to the highest theoretical level for logarithm, and a novel quantizer to completely simulate the hardware behavior and determine the required bit width and ${\text {MAE}}_{c}$ (MAE in circuits) for hardware implementation. In addition, the hardware architecture is also improved by simplifying the indexing logic, leading to nonredundant hardware overhead. The ASIC implementation results reveal that the proposed PLAC can improve all metrics without any compromise. Compared with the state-of-the-art methods, when computing logarithmic function, PLAC reduces 2.80% area, 3.77% power consumption, and 1.83% ${\text {MAE}}_{c}$ with the same delay; when approximating hyperbolic tangent function, PLAC reduces 6.25% area, 4.31% power consumption, and 18.86% ${\text {MAE}}_{c}$ with the same delay; when evaluating sigmoid function, PLAC reduces 16.50% area, 4.78% power consumption with the same delay, and ${\text {MAE}}_{c}$ ; and when calculating softsign function, PLAC reduces 17.28% area, 11.34% power consumption, 12.50% delay, and 33.28% ${\text {MAE}}_{c}$ .

••

Virginia Tech

^{1}TL;DR: This article designs and fabricates a hybrid-structured DNN (hybrid-DNN), combining both depth-in-space (spatial) and depth- in-time (temporal) deep learning characteristics, and demonstrates its high computing parallelism and energy efficiency with low hardware implementation cost.

Abstract: The continued success in the development of neuromorphic computing has immensely pushed today’s artificial intelligence forward. Deep neural networks (DNNs), a brainlike machine learning architecture, rely on the intensive vector–matrix computation with extraordinary performance in data-extensive applications. Recently, the nonvolatile memory (NVM) crossbar array uniquely has unvailed its intrinsic vector–matrix computation with parallel computing capability in neural network designs. In this article, we design and fabricate a hybrid-structured DNN (hybrid-DNN), combining both depth-in-space (spatial) and depth-in-time (temporal) deep learning characteristics. Our hybrid-DNN employs memristive synapses working in a hierarchical information processing fashion and delay-based spiking neural network (SNN) modules as the readout layer. Our fabricated prototype in 130-nm CMOS technology along with experimental results demonstrates its high computing parallelism and energy efficiency with low hardware implementation cost, making the designed system a candidate for low-power embedded applications. From chaotic time-series forecasting benchmarks, our hybrid-DNN exhibits $1.16\times $ – $13.77\times $ reduction on the prediction error compared to the state-of-the-art DNN designs. Moreover, our hybrid-DNN records 99.03% and 99.63% testing accuracy on the handwritten digit classification and the spoken digit recognition tasks, respectively.

••

TL;DR: TIM-DNN, a programmable in-memory accelerator that is specifically designed to execute ternary DNNs, is proposed and evaluated across a suite of state-of-the-art DNN benchmarks including both deep convolutional and recurrent neural networks.

Abstract: The use of lower precision has emerged as a popular technique to optimize the compute and storage requirements of complex deep neural networks (DNNs). In the quest for lower precision, recent studies have shown that ternary DNNs (which represent weights and activations by signed ternary values) represent a promising sweet spot, achieving accuracy close to full-precision networks on complex tasks. We propose TiM-DNN, a programmable in-memory accelerator that is specifically designed to execute ternary DNNs. TiM-DNN supports various ternary representations including unweighted {−1, 0, 1}, symmetric weighted $\{-a,0,a\}$ , and asymmetric weighted $\{-a,0,b\}$ ternary systems. The building blocks of TiM-DNN are TiM tiles—specialized memory arrays that perform massively parallel signed ternary vector–matrix multiplications with a single access. TiM tiles are in turn composed of ternary processing cells (TPCs), bit-cells that function as both ternary storage units and signed ternary multiplication units. We evaluate an implementation of TiM-DNN in 32-nm technology using an architectural simulator calibrated with SPICE simulations and RTL synthesis. We evaluate TiM-DNN across a suite of state-of-the-art DNN benchmarks including both deep convolutional and recurrent neural networks. A 32-tile instance of TiM-DNN achieves a peak performance of 114 TOPs/s, consumes 0.9-W power, and occupies 1.96 mm2 chip area, representing a $300\times $ and $388\times $ improvement in TOPS/W and TOPS/mm2, respectively, compared to an NVIDIA Tesla V100 GPU. In comparison to specialized DNN accelerators, TiM-DNN achieves $55\times $ - $240\times $ and $160\times $ - $291\times $ improvement in TOPS/W and TOPS/mm2, respectively. Finally, when compared to a well-optimized near-memory accelerator for ternary DNNs, TiM-DNN demonstrates $3.9\times $ - $4.7\times $ improvement in system-level energy and $3.2\times $ - $4.2\times $ speedup, underscoring the potential of in-memory computing for ternary DNNs.

••

TL;DR: In this article, parameter guidelines and design techniques for ERSFQ circuits are presented, and the proposed guidelines enable more robust circuits resistant to severe variations in supplied bias currents, while providing a means to decrease the size of an FJTL and, thereby, reduce the physical area, power dissipation, and overall bias currents.

Abstract: Rapid single-flux quantum (RSFQ) circuits have recently attracted considerable attention as a promising cryogenic beyond CMOS technology for exascale computing. Energy-efficient RSFQ (ERSFQ) is an energy-efficient, inductive bias scheme for RSFQ circuits, where the power dissipation is drastically lowered by eliminating the bias resistors, while the cell library remains unchanged. An ERSFQ bias scheme requires the introduction of multiple circuit elements—current limiting Josephson junctions, bias inductors, and feeding Josephson transmission lines (FJTLs). In this article, parameter guidelines and design techniques for ERSFQ circuits are presented. The proposed guidelines enable more robust circuits resistant to severe variations in supplied bias currents. Trends are considered, and advantageous tradeoffs are discussed for the different components within a bias network. The guidelines provide a means to decrease the size of an FJTL and, thereby, reduce the physical area, power dissipation, and overall bias currents, supporting further increases in circuit complexity. A distributed approach to ERSFQ FJTL is also presented to simplify placement and minimize the effects of the parasitic inductance of the bias lines. This methodology and related circuit techniques are applicable to automating the synthesis of bias networks to enable large-scale ERSFQ circuits.

••

TL;DR: This article presents a differential vector modulator-based phase rotator (PR) performing 360° full-span phase interpolation over a first-ever decade-wide instantaneous bandwidth from 2 to 24 GHz.

Abstract: This article presents a differential vector modulator-based phase rotator (PR) performing 360° full-span phase interpolation over a first-ever decade-wide instantaneous bandwidth from 2 to 24 GHz. The proposed PR employs a three-stage transformer poly-phase network with high-precision and ultra-wide-bandwidth, two highly linear 5-bit variable gain amplifiers (VGAs), a differential series–shunt–series inductor peaking load network for bandwidth extension and an open-drain buffer. It is implemented in a standard 65-nm bulk CMOS process with a chip area of 1.2 mm $\times1.8$ mm. The measurement results demonstrate the maximum rms quantization phase error of 1.22° within a 1.5-dB output magnitude variation for full 360° interpolations from 2 to 24 GHz and the −3-dB magnitude bandwidth is up to 19 GHz, respectively. Moreover, due to the wideband high-quality in-phase/quadrature ( $I/Q$ ) signal generation and high-precision $I/Q$ interpolation of the VGAs, the PR can perform full-span phase synthesis with a constant set of phase shift code settings for all the operating frequencies. For interpolating 22.5°/15° phase step over the 360° full-span, the “one-code” setting operation achieves an rms phase error of 1.56°/1.42° from 3.5 to 22.5 GHz without any frequency-dependent code/look-up table (LUT), tunable element, or band-selection switch. Furthermore, with the “one-code” setting operation, the modulation tests demonstrate measured rms error-vector-magnitude (EVM) values below 5% for a 50-kSym/s QPSK signal from 3.3 to 22.3 GHz and for a 16-quadratic-amplitude modulation (QAM) signal from 2.7 to 22 GHz.

••

TL;DR: CiM-HE is introduced, a CiM architecture that can support operations for the Brakerski/Fan–Vercauteren (B/FV) scheme, a somewhat HE scheme for general computation, and a set of four end-to-end tasks for homomorphic multiplications.

Abstract: Homomorphic encryption (HE) allows direct computations on encrypted data. Despite numerous research efforts, the practicality of HE schemes remains to be demonstrated. In this regard, the enormous size of ciphertexts involved in HE computations degrades computational efficiency. Near-memory processing (NMP) and computing-in-memory (CiM)—paradigms where computation is done within the memory boundaries—represent architectural solutions for reducing latency and energy associated with data transfers in data-intensive applications, such as HE. This article introduces CiM-HE, a CiM architecture that can support operations for the Brakerski/Fan–Vercauteren (B/FV) scheme, a somewhat HE scheme for general computation. CiM-HE hardware consists of customized peripherals, such as sense amplifiers, adders, bit shifters, and sequencing circuits. The peripherals are based on CMOS technology and could support computations with memory cells of different technologies. Circuit-level simulations are used to evaluate our CiM-HE framework assuming a 6T-SRAM memory. We compare our CiM-HE implementation against: 1) two optimized CPU HE implementations and 2) a field-programmable gate array (FPGA)-based HE accelerator implementation. Compared with a CPU solution, CiM-HE obtains speedups between $4.6\times $ and $9.1\times $ and energy savings between $266.4\times $ and $532.8\times $ for homomorphic multiplications (the most expensive HE operation). Also, a set of four end-to-end tasks, i.e., mean, variance, linear regression, and inference, are up to $1.1\times $ , $7.7\times $ , $7.1\times $ , and $7.5\times $ faster (and $301.1\times $ , $404.6\times $ , $532.3\times $ , and $532.8\times $ more energy efficient). Compared with CPU-based HE in previous work, CiM-HE obtains $14.3\times $ speedup and $> 2600\times $ energy savings. Finally, our design offers $2.2\times $ speedup with $88.1\times $ energy savings compared with a state-of-the-art FPGA-based accelerator.

••

TL;DR: An energy-efficient accuracy-configurable Dadda (X-Dadda) multiplier is investigated, which employs the voltage overscaling and approximate width setting as the approximation knobs for improving the energy consumption as well as the reliability and lifetime of the multiplier.

Abstract: This article investigates an energy-efficient accuracy-configurable Dadda (X-Dadda) multiplier. The structure employs the voltage overscaling and approximate width setting as the approximation knobs for improving the energy consumption as well as the reliability and lifetime of the multiplier. While the former may be set in the design time as well as the runtime, the latter may only be invoked in the design time. For a given accuracy level, the partial product columns and the overscaled voltage for optimizing the energy are determined. Normally, to have the error within a tolerable limit, the voltage overscaled columns are those at lower bit significances which have higher switching activities. The structure makes use of a low number of level shifters for a low-overhead realization. The approximate columns which start from the first column are contiguous. To further improve the efficiency of the multiplier, four-bit truncation of the multiplier output is also suggested. The efficiency of the X-Dadda structure is investigated using a 15-nm FinFET technology. The results indicate that, for example, when the approximate mode with the mean relative error distance (MRED) of 0.11 is considered, up to 43% energy saving is achieved. In addition, for this case, the Bias temperature instability (BTI)-induced delay degradation of the multiplier decreases up to 9.9% compared to 50% in the case of the exact mode. Also, the impact of process variations on the accuracy of the X-Dadda is studied. Finally, the efficacy of the X-Dadda multiplier, when used in neural networks for image classification and image-processing applications, is assessed.

••

TL;DR: In this article, the authors proposed an area-efficient SNG by sharing the permuted output of one linear feedback shift register (LFSR) among several SNGs.

Abstract: Stochastic unary computing provides low-area circuits. However, the required area consuming stochastic number generators (SNGs) in these circuits can diminish their overall gain in area, particularly if several SNGs are required. We propose area-efficient SNGs by sharing the permuted output of one linear feedback shift register (LFSR) among several SNGs. With no hardware overhead, the proposed architecture generates stochastic bit streams with minimum stochastic computing correlation (SCC). Compared to the circular shifting approach presented in prior work, our approach produces stochastic bit streams with 67% less average SCC when a 10-bit LFSR is shared between two SNGs. To generalize our approach, we propose an algorithm to find a set of $m$ permutations ( $n > m > 2$ ) with a minimum pairwise SCC, for an $n$ -bit LFSR. The search space for finding permutations with an exact minimum SCC grows rapidly when $n$ increases and it is intractable to perform a search algorithm using accurately calculated pairwise SCC values, for $n > 9$ . We propose a similarity function that can be used in the proposed search algorithm to quickly find a set of permutations with SCC values close to the minimum one. We evaluate our approach for several applications. The results show that, compared to prior work, it achieves lower mean-squared error (MSE) with the same (or even lower) area. Additionally, based on simulation results, we show that replacing the comparator component of an SNG circuit with a weighted binary generator can reduce SCC.

••

TL;DR: This work proposes a 2T2R resistive random access memory (ReRAM) architecture that supports three types of CIM operations: 1) ternary content addressable memory (TCAM); 2) logic in-memory (LiM) primitives and arithmetic blocks such as full adder (FA) and full subtractor; and 3) in- memory dot-product for neural networks.

Abstract: Nonvolatile memory (NVM)-based computing in-memory (CIM) is a promising solution to data-intensive applications. This work proposes a 2T2R resistive random access memory (ReRAM) architecture that supports three types of CIM operations: 1) ternary content addressable memory (TCAM); 2) logic in-memory (LiM) primitives and arithmetic blocks such as full adder (FA) and full subtractor; and 3) in-memory dot-product for neural networks. The proposed architecture allows the NVM operations in both 2T2R and conventional 1T1R configurations. The proposed LiM full adder (LiM-FA) improves the delay, the static power, and the dynamic power by $3.2\times $ , $1.2\times $ , and $1.6\times $ , respectively, compared with state-of-the-art LiM-FAs. Furthermore, based on different optimization techniques and robustness analysis, a lower precharge voltage is set for each mode. This reduces the TCAM search energy and 1T1R ReRAM access energy by $1.6\times $ and $1.14\times $ , respectively, compared with the case without optimizations.

••

TL;DR: This article presents a highly integrated design flow that encompasses architecture, circuit, and package to build and simulate heterogeneous 2.5-D integrated chip (IC) designs and performs DSE studies for power delivery scheme and interposer technology to investigate the tradeoffs.

Abstract: A new trend in system-on-chip (SoC) design is chiplet-based IP reuse using 2.5-D integration. Complete electronic systems can be created through the integration of chiplets on an interposer, rather than through a monolithic flow. This approach expands access to a large catalog of off-the-shelf intellectual properties (IPs), allows reuse of them, and enables heterogeneous integration of blocks in different technologies. In this article, we present a highly integrated design flow that encompasses architecture, circuit, and package to build and simulate heterogeneous 2.5-D designs. Our target design is 64-core architecture based on Reduced Instruction Set Computer (RISC)-V processor. We first chipletize each IP by adding logical protocol translators and physical interface modules. We convert a given register transfer level (RTL) for 64-core processor into chiplets, which are enhanced with our centralized network-on-chip. Next, we use our tool to obtain physical layouts, which is subsequently used to synthesize chip-to-chip I/O drivers and these chiplets are placed/routed on a silicon interposer. Our package models are used to calculate power, performance, and area (PPA) and reliability of 2.5-D design. Our design space exploration (DSE) study shows that 2.5-D integration incurs $1.29\times $ power and $2.19\times $ area overheads compared with 2-D counterpart. Moreover, we perform DSE studies for power delivery scheme and interposer technology to investigate the tradeoffs in 2.5-D integrated chip (IC) designs.