scispace - formally typeset
Search or ask a question

Showing papers by "Massoud Pedram published in 2021"


Proceedings Article•DOI•
01 Jan 2021
TL;DR: In this article, a two-step SNN compression technique is proposed to reduce the spiking activity of deep spiking neural networks (SNNs) while maintaining the accuracy.
Abstract: The increasing demand for on-chip edge intelligence has motivated the exploration of algorithmic techniques and specialized hardware to reduce the computation energy of current machine learning models. In particular, deep spiking neural networks (SNNs) have gained interest because their event-driven hardware implementations can consume very low energy. However, minimizing average spiking activity and thus energy consumption while preserving accuracy in deep SNNs remains a significant challenge and opportunity. This paper proposes a novel two-step SNN compression technique to reduce their spiking activity while maintaining accuracy that involves compressing specifically-designed artificial neural networks (ANNs) that are then converted into the target SNNs. Our approach uses an ultra-high ANN compression technique that is guided by the attention-maps of an uncompressed meta-model. We then evaluate the firing threshold of each ANN layer and start with the trained ANN weights to perform a sparse-learning-based supervised SNN training to minimize the number of time steps required while retaining compression. To evaluate the merits of the proposed approach, we performed experiments with variants of VGG and ResNet, on both CIFAR-10 and CIFAR-100, and VGG16 on Tiny-ImageNet. SNN models generated through the proposed technique yield state-of-the-art compression ratios of up to 33.4Ă— with no significant drop in accuracy compared to baseline unpruned counterparts. As opposed to the existing SNN pruning methods we achieve up to 8.3Ă— better compression with no drop in accuracy. Moreover, compressed SNN models generated by our methods can have up to 12.2Ă— better compute energy-efficiency compared to ANNs that have a similar number of parameters.

65 citations


Proceedings Article•DOI•
18 Jan 2021
TL;DR: In this paper, a dynamic network rewiring (DNR) method is proposed to generate pruned deep neural network (DNN) models that are robust against adversarial attacks yet maintain high accuracy on clean images.
Abstract: This paper presents a dynamic network rewiring (DNR) method to generate pruned deep neural network (DNN) models that are robust against adversarial attacks yet maintain high accuracy on clean images. In particular, the disclosed DNR method is based on a unified constrained optimization formulation using a hybrid loss function that merges ultra-high model compression with robust adversarial training. This training strategy dynamically adjusts inter-layer connectivity based on per-layer normalized momentum computed from the hybrid loss function. In contrast to existing robust pruning frameworks that require multiple training iterations, the proposed learning strategy achieves an overall target pruning ratio with only a single training iteration and can be tuned to support both irregular and structured channel pruning. To evaluate the merits of DNR, experiments were performed with two widely accepted models, namely VGG16 and ResNet-18, on CIFAR-10, CIFAR-100 as well as with VGG16 on Tiny-ImageNet. Compared to the baseline un-compressed models, DNR provides over 20Ă— compression on all the datasets with no significant drop in either clean or adversarial classification accuracy. Moreover, our experiments show that DNR consistently finds compressed models with better clean and adversarial image classification performance than what is achievable through state-of-the-art alternatives. Our models and test codes are available at https://github.com/ksouvik52/DNR_ASP_DAC2021.

20 citations


Journal Article•DOI•
TL;DR: In this paper, a 1-b full adder with sum and carries as two individual single-stage FDQ gates is proposed, and both of the sum and carry cells are demonstrated with their schematics and layouts.
Abstract: Single-flux-quantum (SFQ) circuits operate at higher frequencies with much lower power consumption when compared to their CMOS counterparts. Synchronous SFQ circuits require full path balancing by inserting D-Flipflops as needed, an operation that tends to greatly increase the number of gates in a circuit. To control this increase in the total number of gates, complex multi-input SFQ logic gates can be built and used to synthesize the circuits. In this work, a 1-b full adder is built with sum and carries as two individual single-stage SFQ gates. Both of the sum and the carry cells are demonstrated with their schematics and layouts. Postlayout simulation based on the extracted circuit parameters by InductEx is done by using JSIM and demonstrates the correct functionality of the proposed full adder design. Later, an 8-b signed multiplier using this newly designed single-stage full adder is implemented to illustrate the advantages of the new design. The structure and timing strategy of the multiplier as well as the simulation result are shown. In the end, circuits of different sizes are synthesized with the single-stage full adder circuit, and the results are discussed.

11 citations


Proceedings Article•DOI•
01 May 2021
TL;DR: In this article, the authors proposed a field-programmable gate array (FPGA)-based DNN accelerators for ultra-low-latency realization of these models for applications with stringent, sub-microsecond latency requirements.
Abstract: While there is a large body of research on efficient processing of deep neural networks (DNNs) [1]–[31], ultra-low-latency realization of these models for applications with stringent, sub-microsecond latency requirements continues to be an unresolved, challenging problem. Field-programmable gate array (FPGA)-based DNN accelerators are gaining traction as a serious contender to replace graphics processing unit/central processing unit-based platforms considering their performance, flexibility, and energy efficiency. NullaNet (2018) [32], LUTNet (2019) [33], and LogicNets (2020) [34] are among accelerators specifically designed to benefit from FPGAs’ capabilities.

8 citations


Journal Article•DOI•
TL;DR: A new TEI-inspired SoC platform (TIP) architecture for ultralow-power SoCs for Internet-of-Things (IoT) end nodes is proposed and a new electronic design automation tool to accelerate ULP SoC development, RISC-V express (RVX).
Abstract: Ranging from circuit-level characterization to designing a platform architecture, developing a design automation tool, and fabricating a System on Chip (SoC), this article deals with the entire development process for ultralow-power (ULP) SoCs for Internet-of-Things (IoT) end nodes. More precisely, this article first focuses on the unique characteristics of the ULP circuits, the temperature effect inversion (TEI), i.e., the delay of the ULP circuits decreases with increasing temperature. Existing TEI-aware low-power (TEI-LP) techniques have incredible potential to further reduce the power consumption of conventional ULP SoCs, but there is a critical limitation to be widely adopted in real SoCs. To address this limitation and realize the ULP SoCs that can fully benefit from the TEI-LP techniques, this article proposes a new TEI-inspired SoC platform (TIP) architecture. On top of that, taking into account that the highly complex, time consuming, and labor-intensive development process of these ULP SoCs may hinder their widespread use for IoT end nodes, this article presents a new electronic design automation tool to accelerate ULP SoC development, RISC-V express (RVX). Finally, by using the RVX, this article introduces a TIP prototyping chip fabricated in 28-nm FD-SOI technology. This chip demonstrates that power savings of up to 35% can be achieved by lowering the supply voltage from 0.54 to 0.48 V at 25 °C and 0.44 V at 80 °C while continuing to operate at a target 50-MHz clock frequency.

8 citations


Journal Article•DOI•
TL;DR: In this paper, the sensitivity of the neural network (NN) outputs to device parameter uncertainties (non-idealities) in inverter-based memristor ( IM) crossbar neuromorphic circuits is mathematically modeled and verified using exhaustive circuit and system-level simulations.
Abstract: In this paper, the sensitivity of the neural network (NN) outputs to device parameter uncertainties (non-idealities) in inverter-based memristor ( IM ) crossbar neuromorphic circuits is mathematically modeled and verified using exhaustive circuit and system-level simulations. The NN sensitivity is obtained by modeling the sensitivity of the IM neuron output to the non-idealities of its circuit elements. The analysis reveals a higher sensitivity of the output voltage of the IM neuron to the non-idealities of the inverters compared to the conductance variation of the memristors. Among the inverter non-idealities, horizontal shift of the inverters voltage transfer characteristic (VTC) shows the highest impact on the output voltage of the neuron. To reduce the accuracy loss due to the variations, a training approach which includes a sensitivity term in the cost function of the training phase, is suggested. The achievable improvements through the said NN training approach are evaluated. In the evaluation, the California Housing, MNIST, and Fashion MNIST datasets are employed. The results show up to 50% reduction in the NN output variations in the presence of circuit elements’ non-idealities.

6 citations


Journal Article•DOI•
TL;DR: Therminator 2 is presented, an early stage, fast, full-device thermal analyzer, which generates accurate transient- and steady-state temperature maps of an entire smartphone starting from the application processor and other key device components, extending to the skin of the device itself.
Abstract: Maintaining safe chip and device skin temperatures in small form-factor mobile devices (such as smartphones and tablets) while continuing to add new functionalities and provide higher performance has emerged as a key challenge. This article presents Therminator 2 , an early stage, fast, full-device thermal analyzer, which generates accurate transient- and steady-state temperature maps of an entire smartphone starting from the application processor and other key device components, extending to the skin of the device itself. Therminator 2 uses advanced numerical optimization techniques to perform steady-state simulations 1.6 times faster than the prior art technique and is capable of performing transient-state simulations in real time and 1.25 times faster than the prior art method. The thermal analysis is sensitive to detailed device specifications (including its material composition and 3-D layout) as well as different use cases (each case specifying the set of active device components and their activity levels.) Therminator 2 considers all major components within the device, builds a corresponding compact thermal model for each component and the whole device, and produces their transient- and steady-state temperature maps. Temperature results obtained by using Therminator 2 have been validated against a commercial computational fluid dynamics (CFDs)-based tool, i.e., Autodesk Simulation CFD, and thermocouple measurements on a Qualcomm Mobile Developer Platform and Google Nexus 5. A case study on a Samsung Galaxy S4 using Therminator 2 is provided to relate the device performance to the skin temperature and investigate the thermal path design.

6 citations


Journal Article•DOI•
25 Feb 2021
TL;DR: Coarse2Fine as discussed by the authors learns an inverse mapping function from the attended feature maps to the informative regions in the raw image, which will guide the attention maps to better attend the fine-grained features.
Abstract: Small inter-class and large intra-class variations are the key challenges in fine-grained visual classification. Objects from different classes share visually similar structures, and objects in the same class can have different poses and viewpoints. Therefore, the proper extraction of discriminative local features (e.g., bird’s beak or car’s headlight) is crucial. Most of the recent successes on this problem are based upon the attention models which can localize and attend the local discriminative objects parts. In this work, we propose a training method for visual attention networks, Coarse2Fine, which creates a differentiable path from the attended feature maps to the input space. Coarse2Fine learns an inverse mapping function from the attended feature maps to the informative regions in the raw image, which will guide the attention maps to better attend the fine-grained features. Besides, we propose an initialization method for the attention weights. Our experiments show that Coarse2Fine reduces the classification error by up to 5.1% on common fine-grained datasets.

5 citations


Journal Article•DOI•
TL;DR: Simulation results reveal that LATIM can predict the output voltage of the IM-NNs trained by LATIM consume, on average, 62% and 53% lower energy compared to PHAX and RIM methods due to proper sizing of the inverters.
Abstract: In this brief, we present a high accuracy training method for inverter-based memristive neural networks ( IM -NNs). The method, which relies on accurate modeling of the circuit element characteristics, is called LATIM (Loading-Aware offline Training method for Inverter-based Memristive NNs). In LATIM, an approximation method is proposed to estimate the effective load of the memristive crossbar (as the synapses) while two NNs are utilized to predict the voltage transfer characteristic (VTC) of the inverters (as the activation functions). Efficacy of the proposed method is compared with the recent offline training methods for IM -NNs, called PHAX and RIM. Simulation results reveal that LATIM can predict the output voltage of the IM -NNs, on average, by $14\times $ ( $6\times $ ) and $29\times $ ( $4\times $ ) smaller error for the MNIST and Fashion MNIST datasets, respectively, compared to those of PHAX (RIM) method. In addition, IM -NNs trained by LATIM consume, on average, 62% and 53% lower energy compared to PHAX and RIM methods due to proper sizing of the inverters.

4 citations


Proceedings Article•DOI•
22 Jun 2021
TL;DR: In this article, a verification framework called qMC, a model checker for single flux quantum (SFQ) circuits using formal techniques is proposed, based on well established open source back-end verification engines for MC of CMOS circuits, including Yosys-SMTBMC and EBMC, and qMC provides an automated process that constructs a SystemVerilog testbench consisting of formal assertions to verify the SFQ-specific properties of the circuits and produce system correctness results and counterexamples.
Abstract: Single flux quantum (SFQ) circuits as an example of superconducting electronics (SCE) have the potential to replace CMOS circuits as they possess a theoretical potential of three orders of magnitude reduction in power accompanied with one order of magnitude higher speed. Despite its benefits, the SCE community lacks a reliable open source formal verification solution. This paper proposes a verification framework called qMC, a model checker for SFQ circuits using formal techniques. qMC offers an automated process that constructs a SystemVerilog testbench consisting of formal assertions to verify the SFQ-specific properties of the circuits and produce system correctness results and counterexamples using model checking (MC). Instead of creating an MC tool from scratch, we have built qMC based on well established open source back-end verification engines for MC of CMOS circuits, including Yosys-SMTBMC and EBMC. qMC allows for properties to be given in SystemVerilog formal assertions, time-limited SystemVerilog assertions, or linear temporal logic (LTL). qMC provides an improvement in terms of verification time and coverage when compared to state-of-the-art semi-formal based SFQ verification frameworks. For instance, verification time for a 4-bit array multiplier is sped up by 19.5x.

4 citations


Journal Article•DOI•
TL;DR: A low-energy inference method for convolutional neural networks in image classification applications that makes use of two pruned neural networks, namely mildly and aggressively pruned networks, which are both designed offline.
Abstract: In this article, we present a low-energy inference method for convolutional neural networks in image classification applications. The lower energy consumption is achieved by using a highly pruned (...

Journal Article•DOI•
TL;DR: In this paper, a method for offline training of inverter-based memristive neural networks (IM-NNs), called ERIM, is presented, where the output voltage of the inverter is modeled very accurately by considering the loading effect of the memristively crossbar.
Abstract: In this paper, a method for offline training of inverter-based memristive neural networks ( IM -NNs), called ERIM, is presented. In this method, the output voltage of the inverter is modeled very accurately by considering the loading effect of the memristive crossbar. To properly choose the size of each inverter, its output load and the required slope of its voltage transfer characteristic (VTC) for an acceptable level of resiliency to the circuit element non-idealities are taken into account. The efficacy of ERIM is investigated by comparing its accuracy to those of two recently proposed offline training methods for IM -NNs (RIM and PHAX). The study is performed using IRIS, BCW, MNIST, and Fashion MNIST datasets. Simulation results show that 72% (56%) reduction in average energy consumption of the trained networks is achieved compared to RIM (PHAX) thanks to proper sizing of the inverters. In addition, due to the higher accuracy of the NN mathematical model, ERIM results in significant improvements in the match between the results of high-level modeling and HSPICE simulations while exhibiting lower sensitivity to circuit element variations.

Proceedings Article•DOI•
05 Dec 2021
TL;DR: In this paper, a polynomial time algorithm for the corresponding level assignment and full path balancing in sequential Single Flux Quantum (SFQ) circuits, including SFQ Finite State Machines (FSMs), is presented.
Abstract: Synthesizing general nonlinear sequential circuits in superconducting Single Flux Quantum (SFQ) technology is a challenging task involving the proper leveling of cyclic digraphs, handling nested feedback loops, and ensuring the full path balancing property throughout the synthesis process. This paper presents a precise definition of the level of a node in a cyclic digraph and a polynomial time algorithm for the corresponding level assignment and full path balancing in sequential SFQ circuits, including SFQ Finite State Machines (FSMs). A case study is conducted on a 3-bit counter, as an FSM, which has a power consumption of $44. 7 \mu W$ and $1. 4 \mu W$ using rapid SFQ and energy-efficient rapid RSFQ cells, respectively, with the local clock frequency of 55GHz (throughput of 11GHz) which is significantly higher than the typical CMOS clock frequencies. More results on larger SFQ circuits are also presented.

Journal Article•DOI•
TL;DR: In this paper, a low area overhead design-for-testability (DFT) circuit is proposed to detect resistive short defects in STT-MRAM arrays, which is based on monitoring the current mismatch flowing into and out of the cell caused by a weak or strong short defect.
Abstract: This work presents an efficient test technique for detecting resistive short defects in STT-MRAM arrays. The proposed technique is based on monitoring the current mismatch flowing into and out of the cell caused by a weak or strong short defect. This technique is used to propose a low area overhead Design-for-Testability (DFT) circuit to employ in STT-MRAM arrays to distinguish defect-free cells from faulty ones. The operation of the proposed test approach is resilient to the parameter uncertainties of the array circuit induced by process variations. The variations, however, may lower defect detection ranges. The efficacies of the proposed DFT technique under the nominal and the process variation cases are studied. Simulation results indicate that the proposed DFT circuit reduces the number of test escapes and improves the fault coverage by a factor of at least $10 \times $ ( $5 \times $ ) under short defect to ground (short defect to ${V}_{DD}$ ) cases compared to the corresponding maximum ones detected by conventional test schemes. The technique works through a single read operation with a negligible area overhead, especially, in large size arrays.

Proceedings Article•DOI•
26 Jul 2021
TL;DR: In this paper, a fast architecture for Barrett modular multiplication is presented, which replaces the integer multiplications in each iteration with carry-save compressions and uses Booth coding plus operation rescheduling to increase parallelism.
Abstract: This paper presents a fast architecture for Barrett modular multiplication. By replacing the integer multiplications in each iteration with carry-save compressions and using Booth coding plus operation rescheduling to increase parallelism, we eliminate costly multiplications while concurrently avoiding large-bitwidth additions. Our detailed error analysis proves that intermediate results are always less than twice the modulus. Experimental results show that the removal of multiplication eliminates the need for any DSPs. Even not accounting for this key benefit, compared to the best of prior art results, the proposed design results in 46.8% latency reduction with a similar area.

Posted Content•
TL;DR: In this paper, a hardware-friendly pruning algorithm for reducing energy consumption and improving the speed of LSTM neural network accelerators is presented, and an FPGA-based platform for efficient execution of the pruned networks based on the proposed algorithm is introduced.
Abstract: In this paper, first, a hardware-friendly pruning algorithm for reducing energy consumption and improving the speed of Long Short-Term Memory (LSTM) neural network accelerators is presented. Next, an FPGA-based platform for efficient execution of the pruned networks based on the proposed algorithm is introduced. By considering the sensitivity of two weight matrices of the LSTM models in pruning, different sparsity ratios (i.e., dual-ratio sparsity) are applied to these weight matrices. To reduce memory accesses, a row-wise sparsity pattern is adopted. The proposed hardware architecture makes use of computation overlapping and pipelining to achieve low-power and high-speed. The effectiveness of the proposed pruning algorithm and accelerator is assessed under some benchmarks for natural language processing, binary sentiment classification, and speech recognition. Results show that, e.g., compared to a recently published work in this field, the proposed accelerator could provide up to 272% higher effective GOPS/W and the perplexity error is reduced by up to 1.4% for the PTB dataset.

Journal Article•DOI•
TL;DR: In this article, a multicycle input dependency circuit model is proposed to explicitly capture the dependency of primary outputs of the circuit on sequences of internal signals and inputs, which can be used in the verification process of superconducting electronics.
Abstract: Traditional logical equivalence checking (LEC), which plays a major role in the entire chip design process, faces challenges of meeting the requirements demanded by the many emerging technologies that are based on logic models different from the standard complementary metal oxide semiconductor (CMOS). In this article, we propose an LEC framework to be employed in the verification process of superconducting electronics (SCE). Our LEC framework is compatible with the existing CMOS technologies and can also check features and capabilities that are unique to SCE. For instance, the performance of nonresistively biased single-flux quantum (SFQ) circuits benefits from ultradeep pipelining, and verification of such circuits requires new models and algorithms. We, therefore, present the multicycle input dependency circuit model which is a novel model representation of design to explicitly capture the dependency of primary outputs of the circuit on sequences of internal signals and inputs. Embedding the proposed circuit model and several structural checking modules, the process of verification can be independent of the underlying technology and signaling. We benchmark the proposed framework on postsynthesis SFQ and 4-phase adiabatic quantum flux parametron netlists. Results show a comparative verification time of SFQ circuit benchmark, including 16-bit integer divider and ISCAS’85 circuits with respect to the ABC tool for similar CMOS circuits.

Posted Content•
TL;DR: In this article, a non-iterative deep spiking neural network (SNN) training technique is proposed to achieve ultra-high compression with reduced spiking activity while maintaining high inference accuracy.
Abstract: Deep spiking neural networks (SNNs) have emerged as a potential alternative to traditional deep learning frameworks, due to their promise to provide increased compute efficiency on event-driven neuromorphic hardware. However, to perform well on complex vision applications, most SNN training frameworks yield large inference latency which translates to increased spike activity and reduced energy efficiency. Hence,minimizing average spike activity while preserving accuracy indeep SNNs remains a significant challenge and opportunity.This paper presents a non-iterative SNN training technique thatachieves ultra-high compression with reduced spiking activitywhile maintaining high inference accuracy. In particular, our framework first uses the attention-maps of an un compressed meta-model to yield compressed ANNs. This step can be tuned to support both irregular and structured channel pruning to leverage computational benefits over a broad range of platforms. The framework then performs sparse-learning-based supervised SNN training using direct inputs. During the training, it jointly optimizes the SNN weight, threshold, and leak parameters to drastically minimize the number of time steps required while retaining compression. To evaluate the merits of our approach, we performed experiments with variants of VGG and ResNet, on both CIFAR-10 and CIFAR-100, and VGG16 on Tiny-ImageNet.The SNN models generated through the proposed technique yield SOTA compression ratios of up to 33.4x with no significant drops in accuracy compared to baseline unpruned counterparts. Compared to existing SNN pruning methods, we achieve up to 8.3x higher compression with improved accuracy.

Journal Article•DOI•
TL;DR: In this paper, an online management of cache approximation level (OPTIMA) is proposed to adjust the approximation levels of the cache memories in the memory hierarchy of an approximate processing system.
Abstract: In this article, we present an approach for adjusting the approximation levels of the cache memories in the memory hierarchy of an approximate processing system. The technique, which is called online management of cache approximation level (OPTIMA), adjusts the approximation levels of the caches under a predefined accuracy constraint. OPTIMA may also be employed for multicore processors, which comprise cores with private and shared caches running applications with different error constraints. To reduce the energy consumption, OPTIMA determines the proper approximation level of each cache memory using heuristic algorithms in two main steps. In the first step, the approximate levels are adjusted to maximize the power efficiency by dropping the application accuracy to a level that still meets a desirable minimum output quality. In the second step, output accuracy variations due to input pattern changes are compensated by fine tuning. We suggest two algorithms (with different adjustment speeds of approximate levels) for the first step and another algorithm for the second step. To assess the efficacy of OPTIMA, we integrate it in the gem5 simulator and simulate some multiprocessor configurations by running eight approximate benchmarks. The results show that the proposed approach provides up to 44% power consumption reduction in the memory hierarchy.

Proceedings Article•DOI•
01 Feb 2021
TL;DR: This work introduces ESPRESSO-GPU, a parallel version of ESPRessO-II, which takes advantage of the computing capabilities of general-purpose graphics processors to achieve a huge speedup compared to existing serial implementations.
Abstract: Two-level logic minimization has found applications in new problems such as efficient realization of deep neural network inference. Important characteristics of these new applications are that they tend to produce very large Boolean functions (in terms of the supporting variables and/or initial sum of product representation) and have don't-care-sets that are much larger in size than the on-set and off-set sizes. Applying conventional single-threaded logic minimization heuristics to these problems becomes unwieldy. This work introduces ESPRESSO-GPU, a parallel version of ESPRESSO-II, which takes advantage of the computing capabilities of general-purpose graphics processors to achieve a huge speedup compared to existing serial implementations. Simulation results show that ESPRESSO-GPU achieves an average speedup of 97x compared to ESPRESSO-II.

Posted Content•
TL;DR: In this paper, the authors present NullaNet Tiny, an across-the-stack design and optimization framework for constructing resource and energy-efficient, ultra-low-latency FPGA-based neural network accelerators.
Abstract: While there is a large body of research on efficient processing of deep neural networks (DNNs), ultra-low-latency realization of these models for applications with stringent, sub-microsecond latency requirements continues to be an unresolved, challenging problem. Field-programmable gate array (FPGA)-based DNN accelerators are gaining traction as a serious contender to replace graphics processing unit/central processing unit-based platforms considering their performance, flexibility, and energy efficiency. This paper presents NullaNet Tiny, an across-the-stack design and optimization framework for constructing resource and energy-efficient, ultra-low-latency FPGA-based neural network accelerators. The key idea is to replace expensive operations required to compute various filter/neuron functions in a DNN with Boolean logic expressions that are mapped to the native look-up tables (LUTs) of the FPGA device (examples of such operations are multiply-and-accumulate and batch normalization). At about the same level of classification accuracy, compared to Xilinx's LogicNets, our design achieves 2.36$\times$ lower latency and 24.42$\times$ lower LUT utilization.

Posted Content•
TL;DR: In this article, an online adaptive approach called A2P-MANN is proposed to limit the number of required attention inference hops in memory-augmented neural networks by exploiting a small neural network classifier.
Abstract: In this work, to limit the number of required attention inference hops in memory-augmented neural networks, we propose an online adaptive approach called A2P-MANN. By exploiting a small neural network classifier, an adequate number of attention inference hops for the input query is determined. The technique results in elimination of a large number of unnecessary computations in extracting the correct answer. In addition, to further lower computations in A2P-MANN, we suggest pruning weights of the final FC (fully-connected) layers. To this end, two pruning approaches, one with negligible accuracy loss and the other with controllable loss on the final accuracy, are developed. The efficacy of the technique is assessed by using the twenty question-answering (QA) tasks of bAbI dataset. The analytical assessment reveals, on average, more than 42% fewer computations compared to the baseline MANN at the cost of less than 1% accuracy loss. In addition, when used along with the previously published zero-skipping technique, a computation count reduction of up to 68% is achieved. Finally, when the proposed approach (without zero-skipping) is implemented on the CPU and GPU platforms, up to 43% runtime reduction is achieved.

Journal Article•DOI•
TL;DR: In this article, single flux quantum (SFQ) logic is proposed as a promising technology to replace complementary metal-oxide-semiconductor logic for future exa-scale supercomputing but requires the development of reliable EDA.
Abstract: Single flux quantum (SFQ) logic is a promising technology to replace complementary metal-oxide-semiconductor logic for future exa-scale supercomputing but requires the development of reliable EDA t...