scispace - formally typeset
Search or ask a question

Showing papers on "Adder published in 2020"


Journal ArticleDOI
11 Mar 2020-Nature
TL;DR: This work provides a viable platform for scalable all-electric magnetic logic, paving the way for memory-in-logic applications and demonstrates electrical control of magnetic data and device interconnection in logic circuits.
Abstract: Spin-based logic architectures provide nonvolatile data retention, near-zero leakage, and scalability, extending the technology roadmap beyond complementary metal-oxide-semiconductor logic1-13. Architectures based on magnetic domain walls take advantage of the fast motion, high density, non-volatility and flexible design of domain walls to process and store information1,3,14-16. Such schemes, however, rely on domain-wall manipulation and clocking using an external magnetic field, which limits their implementation in dense, large-scale chips. Here we demonstrate a method for performing all-electric logic operations and cascading using domain-wall racetracks. We exploit the chiral coupling between neighbouring magnetic domains induced by the interfacial Dzyaloshinskii-Moriya interaction17-20, which promotes non-collinear spin alignment, to realize a domain-wall inverter, the essential basic building block in all implementations of Boolean logic. We then fabricate reconfigurable NAND and NOR logic gates, and perform operations with current-induced domain-wall motion. Finally, we cascade several NAND gates to build XOR and full adder gates, demonstrating electrical control of magnetic data and device interconnection in logic circuits. Our work provides a viable platform for scalable all-electric magnetic logic, paving the way for memory-in-logic applications.

247 citations


Journal ArticleDOI
01 Jul 2020
TL;DR: It is shown that a homojunction device made from two-dimensional tungsten diselenide can exhibit diverse field-effect characteristics controlled by polarity combinations of the gate and drain voltage inputs, which suggests that the devices could be cascaded to create complex circuits.
Abstract: Reconfigurable logic and neuromorphic devices are crucial for the development of high-performance computing. However, creating reconfigurable devices based on conventional complementary metal–oxide–semiconductor technology is challenging due to the limited field-effect characteristics of the fundamental silicon devices. Here we show that a homojunction device made from two-dimensional tungsten diselenide can exhibit diverse field-effect characteristics controlled by polarity combinations of the gate and drain voltage inputs. These electrically tunable devices can achieve reconfigurable multifunctional logic and neuromorphic capabilities. With the same logic circuit, we demonstrate a 2:1 multiplexer, D-latch and 1-bit full adder and subtractor. These functions exhibit a full-swing output voltage and the same supply and signal voltage, which suggests that the devices could be cascaded to create complex circuits. We also show that synaptic circuits based on only three homojunction devices can achieve reconfigurable spiking-timing-dependent plasticity and pulse-tunable synaptic potentiation or depression characteristics; the same function using complementary metal–oxide–semiconductor devices would require more than ten transistors. A homojunction device made from two-dimensional tungsten diselenide can be used to create circuits that exhibit multifunctional logic and neuromorphic capabilities with simpler designs than conventional silicon-based systems.

159 citations


Proceedings ArticleDOI
14 Jun 2020
TL;DR: This paper develops a special back-propagation approach for AdderNets by investigating the full-precision gradient, and proposes an adaptive learning rate strategy to enhance the training procedure of Ad DerNets according to the magnitude of each neuron's gradient.
Abstract: Compared with cheap addition operation, multiplication operation is of much higher computation complexity. The widely-used convolutions in deep neural networks are exactly cross-correlation to measure the similarity between input feature and convolution filters, which involves massive multiplications between float values. In this paper, we present adder networks (AdderNets) to trade these massive multiplications in deep neural networks, especially convolutional neural networks (CNNs), for much cheaper additions to reduce computation costs. In AdderNets, we take the L1-norm distance between filters and input feature as the output response. The influence of this new similarity measure on the optimization of neural network have been thoroughly analyzed. To achieve a better performance, we develop a special back-propagation approach for AdderNets by investigating the full-precision gradient. We then propose an adaptive learning rate strategy to enhance the training procedure of AdderNets according to the magnitude of each neuron's gradient. As a result, the proposed AdderNets can achieve 74.9% Top-1 accuracy 91.7% Top-5 accuracy using ResNet-50 on the ImageNet dataset without any multiplication in convolutional layer. The codes are publicly available at: (https://github.com/huaweinoah/AdderNet).

155 citations


Journal ArticleDOI
12 Aug 2020
TL;DR: A comprehensive survey and a comparative evaluation of recently developed approximate arithmetic circuits under different design constraints, synthesized and characterized under optimizations for performance and area.
Abstract: Approximate computing has emerged as a new paradigm for high-performance and energy-efficient design of circuits and systems. For the many approximate arithmetic circuits proposed, it has become critical to understand a design or approximation technique for a specific application to improve performance and energy efficiency with a minimal loss in accuracy. This article aims to provide a comprehensive survey and a comparative evaluation of recently developed approximate arithmetic circuits under different design constraints. Specifically, approximate adders, multipliers, and dividers are synthesized and characterized under optimizations for performance and area. The error and circuit characteristics are then generalized for different classes of designs. The applications of these circuits in image processing and deep neural networks indicate that the circuits with lower error rates or error biases perform better in simple computations, such as the sum of products, whereas more complex accumulative computations that involve multiple matrix multiplications and convolutions are vulnerable to single-sided errors that lead to a large error bias in the computed result. Such complex computations are more sensitive to errors in addition than those in multiplication, so a larger approximation can be tolerated in multipliers than in adders. The use of approximate arithmetic circuits can improve the quality of image processing and deep learning in addition to the benefits in performance and power consumption for these applications.

143 citations


Journal ArticleDOI
Shin Nishio1, Yulu Pan1, Takahiko Satoh1, Hideharu Amano1, Rodney Van Meter1 
TL;DR: In this paper, the authors focus on the fact that the error rates of individual qubits are not equal, with a goal of maximizing the success probability of real-world subroutines such as an adder circuit.
Abstract: NISQ (Noisy, Intermediate-Scale Quantum) computing requires error mitigation to achieve meaningful computation. Our compilation tool development focuses on the fact that the error rates of individual qubits are not equal, with a goal of maximizing the success probability of real-world subroutines such as an adder circuit. We begin by establishing a metric for choosing among possible paths and circuit alternatives for executing gates between variables placed far apart within the processor, and test our approach on two IBM 20-qubit systems named Tokyo and Poughkeepsie. We find that a single-number metric describing the fidelity of individual gates is a useful but imperfect guide. Our compiler uses this subsystem and maps complete circuits onto the machine using a beam search-based heuristic that will scale as processor and program sizes grow. To evaluate the whole compilation process, we compiled and executed adder circuits, then calculated the Kullback–Leibler divergence (KL-divergence, a measure of the distance between two probability distributions). For a circuit within the capabilities of the hardware, our compilation increases estimated success probability and reduces KL-divergence relative to an error-oblivious placement.

80 citations


Journal ArticleDOI
TL;DR: In this article, the authors proposed Quantum-dot cellular automata (QCA) is one of the most promising emerging paradigms offered for substitution of ongoing MOSFET technology.
Abstract: Quantum-dot cellular automata (QCA) is one of the most promising emerging paradigms offered for substitution of ongoing MOSFET technology. In order to qualify the QCA technology, all the previously...

78 citations


Journal ArticleDOI
TL;DR: Based on the simulation results, it can be stated that the proposed hybrid FA circuit is an attractive alternative in the data path design of modern high-speed Central Processing Units.
Abstract: A novel design of a hybrid Full Adder (FA) using Pass Transistors (PTs), Transmission Gates (TGs) and Conventional Complementary Metal Oxide Semiconductor (CCMOS) logic is presented. Performance analysis of the circuit has been conducted using Cadence toolset. For comparative analysis, the performance parameters have been compared with twenty existing FA circuits. The proposed FA has also been extended up to a word length of 64 bits in order to test its scalability. Only the proposed FA and five of the existing designs have the ability to operate without utilizing buffer in intermediate stages while extended to 64 bits. According to simulation results, the proposed design demonstrates notable performance in power consumption and delay which accounted for low power delay product. Based on the simulation results, it can be stated that the proposed hybrid FA circuit is an attractive alternative in the data path design of modern high-speed Central Processing Units.

67 citations


Journal ArticleDOI
TL;DR: The proposed ternary full adder has a significant improvement in the power-delay product (PDP) over previous designs and is applicable to both unbalanced (0, 1, 2) and balanced (−1, 0, 1) ternARY logic.
Abstract: We propose a logic synthesis methodology with a novel low-power circuit structure for ternary logic. The proposed methodology synthesizes a ternary function as a ternary logic gate using carbon nanotube field-effect transistors (CNTFETs). The circuit structure uses the body effect to mitigate the excessive power consumption for the third logic value. Energy-efficient ternary logic circuits are designed with a combination of synthesized low-power ternary logic gates. The proposed methodology is applicable to both unbalanced (0, 1, 2) and balanced (−1, 0, 1) ternary logic. To verify the improvement in energy efficiency, we have designed various ternary arithmetic logic circuits using the proposed methodology. The proposed ternary full adder has a significant improvement in the power-delay product (PDP) over previous designs. Ternary benchmark circuits have been designed to show that complex ternary functions can be designed to more efficient circuits with the proposed methodology.

64 citations


Journal ArticleDOI
TL;DR: In this research, an optical half-adder using two-dimensional photonic crystals was designed and simulated and has the higher power difference in the high and low logic modes, which reduces errors in detecting these two values in the output.
Abstract: The use of optical devices for high-speed data transmission has been considered for some time. Structures which can be used in optical integrated circuits are very important. Photonic crystals have been used as basic structures in the design of optical devices and especially logic devices. Considering the ability of these structures to design logic gates and circuits, it is expected to use them as base structures in the design of optical integrated circuits. In this research, an optical half-adder using two-dimensional photonic crystals was designed and simulated. One of the features of this circuit is the higher power difference in the high and low logic modes, which reduces errors in detecting these two values in the output. The circuit also has the ability to use the optical gates XOR and AND. In addition, it has a small structure that makes it suitable for use in optical integrated circuits.

63 citations


Journal ArticleDOI
TL;DR: The effectiveness of the proposed approximate adder is compared with state-of-the-art approximate adders using a cost function based on the energy, delay, area, and output quality and results indicate an average of 50% reduction in terms of the cost function compared to other approximateAdders.
Abstract: In this brief, a low energy consumption block-based carry speculative approximate adder is proposed. Its structure is based on partitioning the adder into some non-overlapped summation blocks whose structures may be selected from both the carry propagate and parallel-prefix adders. Here, the carry output of each block is speculated based on the input operands of the block itself and those of the next block. In this adder, the length of the carry chain is reduced to two blocks (worst case), where in most cases only one block is employed to calculate the carry output leading to a lower average delay. In addition, to increase the accuracy and reduce the output error rate, an error detection and recovery mechanism is proposed. The effectiveness of the proposed approximate adder is compared with state-of-the-art approximate adders using a cost function based on the energy, delay, area, and output quality. The results indicate an average of 50% reduction in terms of the cost function compared to other approximate adders.

58 citations


Journal ArticleDOI
TL;DR: In this article, an ultra-fast all-optical half adder based on nonlinear ring resonators is proposed, which is an appropriate candidate for photonic integrated circuits used in the next generation of alloptical CPUs.
Abstract: Half adder and half subtractor are the basic building blocks of an arithmetic logic unit used in every optical central processing unit (CPU) to provide computational operators. In this paper, we aim to design an ultrafast all-optical half adder based on nonlinear ring resonators. The proposed structure consists of the concurrent designs of the AND and XOR logic gates inside a rod-based photonic crystal microstructure. The linear dielectric rods made of silicon and nonlinear dielectric rods composed of doped glass are used to design the nonlinear ring resonators as the fundamental blocks of a half adder. We demonstrate as the intensity of the incoming light increases, the nonlinear Kerr effect appears, and the total refractive index increases. It diverts the direction of light propagation to the desired nonlinear ring resonator depending on the signal wavelength, the radius of rods and lattice constant. Finally, after several resonances, the light is coupled to the output. Our numerical simulations using a two-dimensional finite-difference time-domain method reveal depending on the light intensity, the maximum and minimum transmissions of the half adder are 100% and 96%, respectively. The calculations also show the delay of the designed half adder is 3.6 ps. Due to the small area of 249.75 µm2, the proposed half adder is an appropriate candidate for photonic integrated circuits used in the next generation of all-optical CPUs.

Journal ArticleDOI
TL;DR: The design metrics of proposed AAs, Approximate Dadda Multipliers (ADMs) are synthesized in Cadence Register-Transfer Level (RTL) compiler and compares the design metrics with three different technology nodes.

Posted Content
Dehua Song1, Yunhe Wang1, Hanting Chen1, Chang Xu2, Chunjing Xu1, Dacheng Tao2 
TL;DR: This paper thoroughly analyzes the relationship between an adder operation and the identity mapping and insert shortcuts to enhance the performance of SR models using adder networks and develops a learnable power activation for adjusting the feature distribution and refining details.
Abstract: This paper studies the single image super-resolution problem using adder neural networks (AdderNet). Compared with convolutional neural networks, AdderNet utilizing additions to calculate the output features thus avoid massive energy consumptions of conventional multiplications. However, it is very hard to directly inherit the existing success of AdderNet on large-scale image classification to the image super-resolution task due to the different calculation paradigm. Specifically, the adder operation cannot easily learn the identity mapping, which is essential for image processing tasks. In addition, the functionality of high-pass filters cannot be ensured by AdderNet. To this end, we thoroughly analyze the relationship between an adder operation and the identity mapping and insert shortcuts to enhance the performance of SR models using adder networks. Then, we develop a learnable power activation for adjusting the feature distribution and refining details. Experiments conducted on several benchmark models and datasets demonstrate that, our image super-resolution models using AdderNet can achieve comparable performance and visual quality to that of their CNN baselines with an about 2$\times$ reduction on the energy consumption.

Journal ArticleDOI
TL;DR: A novel three- input XOR gate that is based on a cell-interaction design that can be used as a multifunctional gate by fixing one of the structure's inputs, which allows two-input XOR or XNOR gates to be easily implemented.

Journal ArticleDOI
TL;DR: A new architecture for a digital full-adder is presented, which is up to 41% faster than existing IMPLY-based serial designs while requiring up to 78% less area (memristors) compared to the existing parallel design.
Abstract: Passive implementation of memristors has led to several innovative works in the field of electronics. Despite being primarily a candidate for memory applications, memristors have proven to be beneficial in several other circuits and applications as well. One of the use cases is the implementation of digital circuits such as adders. Among several logic implementations using memristors, IMPLY logic is one of the promising candidates. In this brief, we present a new architecture for a digital full-adder, which is up to 41% faster than existing IMPLY-based serial designs while requiring up to 78% less area (memristors) compared to the existing parallel design.

Journal ArticleDOI
TL;DR: This paper explains QCA based combinational circuit design; such as half-adder and full-adder, by only one uniform layer of cells, using a novel XOR gate.
Abstract: Quantum-dot Cellular Automata (QCA) is a new technology for designing digital circuits in Nanoscale. This technology utilizes quantum dots rather than diodes and transistors. QCA supplies a new computation platform, where binary data can be represented by polarized cells, which can define by the electron’s configurations inside the cell. This paper explains QCA based combinational circuit design; such as half-adder and full-adder, by only one uniform layer of cells. The proposed design is accomplished using a novel XOR gate. The proposed XOR gate has a 50% speed improvement and 35% reduction in the number of cells needed over the best reported XOR. The results of QCADesigner software show that the proposed designs have less complexity and less power consumption than previous designs.

Journal ArticleDOI
TL;DR: This work uses a physics-based compact model to study an innovative smart IMPLY (SIMPLY) logic scheme which exploits the peripheral circuitry embedded in ordinary IMPLy architectures to solve the mentioned reliability issues, drastically reducing the energy consumption and setting clear design strategies.
Abstract: Low-power smart devices are becoming pervasive in our world. Thus, relevant research efforts are directed to the development of innovative low power computing solutions that enable in-memory computations of logic-operations, thus avoiding the von Neumann bottleneck, i.e., the known showstopper of traditional computing architectures. Emerging non-volatile memory technologies, in particular Resistive Random Access memories, have been shown to be particularly suitable to implement logic-in-memory (LIM) circuits based on the material implication logic (IMPLY). However, RRAM devices non-idealities, logic state degradation, and a narrow design space limit the adoption of this logic scheme. In this work, we use a physics-based compact model to study an innovative smart IMPLY (SIMPLY) logic scheme which exploits the peripheral circuitry embedded in ordinary IMPLY architectures to solve the mentioned reliability issues, drastically reducing the energy consumption and setting clear design strategies. We then use SIMPLY to implement a 1-bit full adder and compare the results with other LIM solutions proposed in the literature.

Journal ArticleDOI
TL;DR: It is demonstrated that reconfigurable constant coefficient multipliers (RCCMs) offer a better alternative for saving the silicon area than utilizing low-precision arithmetic for deep-learning applications on field-programmable gate arrays (FPGAs).
Abstract: Low-precision arithmetic operations to accelerate deep-learning applications on field-programmable gate arrays (FPGAs) have been studied extensively, because they offer the potential to save silicon area or increase throughput. However, these benefits come at the cost of a decrease in accuracy. In this article, we demonstrate that reconfigurable constant coefficient multipliers (RCCMs) offer a better alternative for saving the silicon area than utilizing low-precision arithmetic. RCCMs multiply input values by a restricted choice of coefficients using only adders, subtractors, bit shifts, and multiplexers (MUXes), meaning that they can be heavily optimized for FPGAs. We propose a family of RCCMs tailored to FPGA logic elements to ensure their efficient utilization. To minimize information loss from quantization, we then develop novel training techniques that map the possible coefficient representations of the RCCMs to neural network weight parameter distributions. This enables the usage of the RCCMs in hardware, while maintaining high accuracy. We demonstrate the benefits of these techniques using AlexNet, ResNet-18, and ResNet-50 networks. The resulting implementations achieve up to 50% resource savings over traditional 8-bit quantized networks, translating to significant speedups and power savings. Our RCCM with the lowest resource requirements exceeds 6-bit fixed point accuracy, while all other implementations with RCCMs achieve at least similar accuracy to an 8-bit uniformly quantized design, while achieving significant resource savings.

Journal ArticleDOI
TL;DR: A full-swing high-speed hybrid Full Adder cell based on Gate Diffusion Input technique and Conventional Complementary Metal-Oxide Semiconductor (CCMOS) logic has been proposed and achieved the best performance parameters in large cascaded circuits.

Journal ArticleDOI
TL;DR: An efficient single-layer serial-parallel multiplier (SPM) in Quantum-dot Cellular Automata (QCA) is presented using a bit-serial adder using a fully utilized majority gate (MV), and a modified E-shaped exclusive-OR (E-XOR) gate.
Abstract: This brief presents an efficient single-layer serial-parallel multiplier (SPM) in Quantum-dot Cellular Automata (QCA). We have designed a bit-serial adder (BSA) using a fully utilized majority gate (MV), and a modified E-shaped exclusive-OR (E-XOR) gate. The cell-interactive properties of the QCA cell have been utilized to realize the proposed E-XOR gate. This new gate leads the proposed SPM to achieve a reduction in cell count and area by 30% and 19%, 29% and 24%, 30% and 22%, 32% and 39%, and 36% and 46% for 4-, 8-, 16-, 32-, and 64-bit multipliers, respectively. All proposed circuits have been simulated and verified by using QCADesigner with a coherence vector simulation engine. In addition, the average switching and leakage energy dissipation are estimated using QCAPro tool.

Journal ArticleDOI
TL;DR: This work presents a semi-serial IMPLY-based adder, and proposes an IMPLy-based multiplier, which is shown to be more than $\mathbf {5\times }$ better than other works based on the figure of merit which gives equal weight to the number of steps and required die area.
Abstract: Memristors are among emerging technologies with many promising features, which makes them suitable not only for storage purposes but also for computations. In this work, focusing on in-memory computations, we first present our semi-serial IMPLY-based adder and perform an extensive analysis of its merits. In addition to providing a favorable balance between the number of steps and number of memristors, a key property of the presented adder is its compactness as compared to the state-ofthe-art adders. Next, using our semi-serial adder, we propose an IMPLY-based multiplier. We show that the proposed multiplier is more than 5× better than other works based on the figure of merit which gives equal weight to the number of steps (i.e., speed) and required die area. Additionally, we provide a deeper insight into IMPLY-based arithmetic units, their properties, design characteristics, and advantages or disadvantages compared to one another by proposing new figures of merit and performing comprehensive comparative analyses. This facilitates the process of design, or selection, of suitable units for the design engineers and researchers in the field.

Journal ArticleDOI
TL;DR: A novel architecture based on multiple-parallel-branch with folding (MPBF) technique is proposed, which parallelizes the branches and reuses the multiplier and adder in each folded branch so that the tradeoff between throughput and the usage of the hardware resources is balanced.
Abstract: Multichannel active noise control (MCANC) is widely recognized as an effective and efficient solution for acoustic noise and vibration cancellation, such as in high-dimensional ventilation ducts, open windows, and mechanical structures. The feedforward multichannel filtered-x least mean square (FFMCFxLMS) algorithm is commonly used to dynamically adjust the transfer function of the multichannel controllers for different noise environments. The computational load incurred by the FFMCFxLMS algorithm, however, increases exponentially with increasing channel count, thus requiring high-end field-programmable gate array (FPGA) processors. Nevertheless, such processors still need specific configurations to cope with soaring computing loads as the channel count increases. To achieve a high-efficiency implementation of the FFMCFxLMS algorithm with floating-point arithmetic, a novel architecture based on multiple-parallel-branch with folding (MPBF) technique is proposed. This architecture parallelizes the branches and reuses the multiplier and adder in each folded branch so that the tradeoff between throughput and the usage of the hardware resources is balanced. The proposed architecture is validated in an experimental setup that implements the FFMCFxLMS algorithm for the MCANC system with 24 reference sensors, 24 secondary sources, and 24 error sensors, at a sampling and throughput rates of 25 kHz and 260 Mb/s, respectively.

Journal ArticleDOI
TL;DR: This paper shows that the ultra-dense co-integration of FeFETs and nFets (28nm HKMG) with shared active area does not alter the FeFet’s switching behavior, nor does it affect the baseline CMOS.
Abstract: Due to their CMOS compatibility, hafnium oxide based ferroelectric field-effect transistors (FeFET) gained remarkable attention recently, not only in the context of nonvolatile memory applications but also for being an auspicious candidate for novel combined memory and logic applications. In addition to bringing nonvolatility into existing logic circuits (Memory-in-Logic), FeFETs promise to guide the way to compact Logic-in-Memory solutions, where logic computations are examined in memory arrays or array-like structures. To increase the area-efficiency of such circuits, a dense integration of FeFETs and standard FETs is essential. In this paper, we show that the ultra-dense co-integration of FeFETs and nFETs (28nm HKMG) with shared active area does not alter the FeFET’s switching behavior, nor does it affect the baseline CMOS. Based on this, we propose the integration of a FeFET-based, 2-input look-up table (memory) directly into a 4-to-1 multiplexer (logic), which is utilized directly in a 2TNOR memory array or stand-alone circuit. The latter one dramatically reduces the transistor count by at least 33% compared to similar FeFET-based circuits. By storing values of the look-up table in a nonvolatile manner, no energy is consumed during standby mode, which enables normally-off computing. To take another step towards novel Logic-in-Memory designs, we experimentally demonstrate a very compact in-array 2T half adder and simulate an array-like 14T full adder, which exploit the advantages of the array arrangement: easy write procedure and a very compact, robust design. The proposed circuits exhibit energy-efficiency in the (sub)fJ-range and operation speeds of 1GHz.

Posted Content
TL;DR: The first solution that applies different quantization schemes for different rows of the weight matrix, and an FPGA-centric mixed scheme quantization (MSQ) with an ensemble of the proposed SP2 and the fixed-point schemes can maintain, or even increase accuracy due to better matching with weight distributions.
Abstract: Deep Neural Networks (DNNs) have achieved extraordinary performance in various application domains. To support diverse DNN models, efficient implementations of DNN inference on edge-computing platforms, e.g., ASICs, FPGAs, and embedded systems, are extensively investigated. Due to the huge model size and computation amount, model compression is a critical step to deploy DNN models on edge devices. This paper focuses on weight quantization, a hardware-friendly model compression approach that is complementary to weight pruning. Unlike existing methods that use the same quantization scheme for all weights, we propose the first solution that applies different quantization schemes for different rows of the weight matrix. It is motivated by (1) the distribution of the weights in the different rows are not the same; and (2) the potential of achieving better utilization of heterogeneous FPGA hardware resources. To achieve that, we first propose a hardware-friendly quantization scheme named sum-of-power-of-2 (SP2) suitable for Gaussian-like weight distribution, in which the multiplication arithmetic can be replaced with logic shifter and adder, thereby enabling highly efficient implementations with the FPGA LUT resources. In contrast, the existing fixed-point quantization is suitable for Uniform-like weight distribution and can be implemented efficiently by DSP. Then to fully explore the resources, we propose an FPGA-centric mixed scheme quantization (MSQ) with an ensemble of the proposed SP2 and the fixed-point schemes. Combining the two schemes can maintain, or even increase accuracy due to better matching with weight distributions.

Proceedings ArticleDOI
09 Mar 2020
TL;DR: A photonic nonvolatile memory (NVM)-based accelerator, LightBulb, is proposed to process binarized CNNs by high frequency photonic XNOR gates and popcount units and adopts photonic racetrack memory to serve as input/output registers to achieve high operating frequency.
Abstract: Although Convolutional Neural Networks (CNNs) have demonstrated the state-of-the-art inference accuracy in various intelligent applications, each CNN inference involves millions of expensive floating point multiply-accumulate (MAC) operations To energy-efficiently process CNN inferences, prior work proposes an electro-optical accelerator to process power-of-2 quantized CNNs by electro-optical ripple-carry adders and optical binary shifters The electro-optical accelerator also uses SRAM registers to store intermediate data However, electro-optical ripple-carry adders and SRAMs seriously limit the operating frequency and inference throughput of the electro-optical accelerator, due to the long critical path of the adder and the long access latency of SRAMs In this paper, we propose a photonic nonvolatile memory (NVM)-based accelerator, Light-Bulb, to process binarized CNNs by high frequency photonic XNOR gates and popcount units LightBulb also adopts photonic racetrack memory to serve as input/output registers to achieve high operating frequency Compared to prior electro-optical accelerators, on average, LightBulb improves the CNN inference throughput by 17× ~ 173× and the inference throughput per Watt by 175 × ~ 660×

Journal ArticleDOI
TL;DR: This work aims to demonstrate the viability of RRAM in the design of ternary logic systems and shows a very small variation in power consumption and energy consumption with variation in process parameters, temperature, output load, supply voltage and operating frequency.
Abstract: In this paper, the design of ternary logic gates (standard ternary inverter, ternary NAND, ternary NOR) based on carbon nanotube field effect transistor (CNTFET) and resistive random access memory (RRAM) is proposed. Ternary logic has emerged as a very promising alternative to the existing binary logic systems owing to its energy efficiency, operating speed, information density and reduced circuit overheads such as interconnects and chip area. The proposed design employs active load RRAM and CNTFET instead of large resistors to implement ternary logic gates. The proposed ternary logic gates are then utilised to carry out basic arithmetic functions and is extendable to implement additional complex functions. The proposed ternary gates show significant advantages in terms of component count, chip area, power consumption, energy consumption and dense fabrication. The results demonstrate the advantage of the proposed models with a reduction of 50% in transistor count for the STI, TNAND and TNOR logic gates. For THA and THS arithmetic modules 65.11% reduction in transistor count is observed while for TM design, around 38% reduction is observed. In this work, we aim to demonstrate the viability of RRAM in the design of ternary logic systems, thus the focus is mainly on obtaining the proper functionality of the proposed design. Also the proposed logic gates show a very small variation in power consumption and energy consumption with variation in process parameters, temperature, output load, supply voltage and operating frequency. For simulations, HSPICE tool is used to verify the authenticity of the proposed designs. The ternary half adder, ternary half subtractor and ternary multiplier circuits are then implemented utilising the proposed gates and validated through simulations.

Journal ArticleDOI
TL;DR: The approach shows that the lower-part-or and error-tolerant adder I approximate adders, as well as truncation-to-zero deliver better compression-power trade-offs, with substantial differences from the static analysis.
Abstract: A cross-layer design space exploration (DSE) method based on a proposed co-simulation technique is presented herein. The proposed method is demonstrated evaluating the impacts on both coding efficiency and power dissipation of applying distinct approximate logic operators in a s $\mu {\mathrm{ m}}$ of absolute differences (SAD) kernel that accelerates an H.265/HEVC (high-efficiency video coding) encoder. The proposed method simulates the gate-level circuit dynamically inside the application, with realistic results of the impact of the adder-tree approximate logic implementation on both quality and encoder bit-rate results. A comprehensive DSE is shown herein, with 13 types of 6 classes of approximate adders in the SAD accelerator hardware blocks. Over 3,000 logic variants of approximations at gate-level were developed. Actual video sequences as inputs to the x265 software encoder are co-simulated, to dynamically capture the video motion-estimation (ME) behavior in the presence of logic approximations. While the prior art that only estimates the impact of the approximate logic on power, area, and quality on static designs with statistical assumptions, which are agnostic to the actual algorithm data-dependent behavior in the application, our method explores accurately the trade-off between power dissipation and coding efficiency dynamically over the entire HEVC encoding. Our approach shows that the lower-part-or and error-tolerant adder I approximate adders, as well as truncation-to-zero deliver better compression-power trade-offs, with substantial differences from the static analysis.

Journal ArticleDOI
TL;DR: A novel design of an adder/subtractor-based incrementer/decrementer using quantum-dot cellular automata (QCA) technology is focused on, which shows an improvement in area usage and latency compared to its existing counterpart.

Journal ArticleDOI
TL;DR: Results show that the Improved Booth multiplier-based FIR (radix-4) filter leads to smallest power and area, and the proposed multiplier architecture helps to minimize the number steps in multiplication and also in digital circuits decrease the propagation delay.

Journal ArticleDOI
07 Jul 2020
TL;DR: A new processing element design is provided as an alternate solution for hardware implementation of CNN accelerator design that allows to reduce hardware costs by 24.5% achieving a power efficiency of 61.64 GOP/s/W that outperforms the previous designs.
Abstract: Convolutional Neural Network (CNN) has attained high accuracy and it has been widely employed in image recognition tasks. In recent times, deep learning-based modern applications are evolving and it poses a challenge in research and development of hardware implementation. Therefore, hardware optimization for efficient accelerator design of CNN remains a challenging task. A key component of the accelerator design is a processing element (PE) that implements the convolution operation. To reduce the amount of hardware resources and power consumption, this article provides a new processing element design as an alternate solution for hardware implementation. Modified BOOTH encoding (MBE) multiplier and WALLACE tree-based adders are proposed to replace bulky MAC units and typical adder tree respectively. The proposed CNN accelerator design is tested on Zynq-706 FPGA board which achieves a throughput of 87.03 GOP/s for Tiny-YOLO-v2 architecture. The proposed design allows to reduce hardware costs by 24.5% achieving a power efficiency of 61.64 GOP/s/W that outperforms the previous designs.