scispace - formally typeset
Search or ask a question
Journal ArticleDOI

STT-BSNN: An In-Memory Deep Binary Spiking Neural Network Based on STT-MRAM

TL;DR: In this article, the authors proposed an in-memory binary spiking neural network (BSNN) based on spin-transfer-torque magnetoresistive RAM (STT-MRAM).
Abstract: This paper proposes an in-memory binary spiking neural network (BSNN) based on spin-transfer-torque magnetoresistive RAM (STT-MRAM). We propose residual BSNN learning using a surrogate gradient that shortens the time steps in the BSNN while maintaining sufficient accuracy. At the circuit level, presynaptic spikes are fed to memory units through differential bit lines (BLs), while binarized weights are stored in a subarray of nonvolatile STT-MRAM. When the common inputs are fed through BLs, vector-to-matrix multiplication can be performed in a single memory sensing phase, hence achieving massive parallelism with low power and low latency. We further introduce the concept of a dynamic threshold to reduce the implementation complexity of synapses and neuron circuitry. This adjustable threshold also permits a nonlinear batch normalization (BN) function to be incorporated into the integrate-and-fire (IF) neuron circuit. The circuitry greatly improves the overall performance and enables high regularity in circuit layouts. Our proposed netlist circuits are built on a 65-nm CMOS with a fitted magnetic tunnel junction (MTJ) model for performance evaluation. The hardware/software co-simulation results indicate that the proposed design can deliver a performance of 176.6 TOPS/W for an in-memory computing (IMC) subarray size of $1\times 288$ . The classification accuracy reaches 97.92% (83.85%) on the MNIST (CIFAR-10) dataset. The impacts of the device non-idealities and process variations are also thoroughly covered in the analysis.
Citations
More filters
Journal ArticleDOI
01 Mar 2023
TL;DR: In this paper , a multiplexing temporal encoder based on an interspike interval (ISI) encoding scheme is proposed to improve the robustness of the encoder.
Abstract: From rate to temporal encoding, spiking information processing has demonstrated advantages across diverse neuromorphic applications. In the aspects of data capacity and robustness, multiplexing encoding outperforms alternative encoding schemes. In this work, we aim to implement a new class of multiplexing temporal encoders, patterning stimuli in multiple timescales to improve the information processing capability, and robustness of systems deployed in noisy environments. Benefitted by the internal reference frame using subthreshold membrane oscillation (SMO), the encoded spike patterns are less sensitive to the input noise, increasing the encoder’s robustness. Our design results in a tremendous saving on power consumption and silicon area compared with the power-hungry analog-to-digital converters. Furthermore, a working prototype of the multiplexing temporal encoder built based on an interspike interval (ISI) encoding scheme is implemented on a silicon chip using the standard 180-nm CMOS process. To the best of our knowledge, our introduced encoder demonstrates the first integrated circuit (IC) implementation of neural encoding with multiplexing topology. Finally, the accuracy and efficiency of our design are evaluated through standard machine learning benchmarks, including Modified National Institute of Standards and Technology (MNIST), Canadian Institute For Advanced Research (CIFAR)-10, Street View House Number (SVHN), and spectrum sensing in high-speed communication networks. While our multiplexing temporal encoder demonstrates a higher classification accuracy across all the benchmarks, the power consumption and dissipated energy per spike reach merely $2.6~\mu \text {W}$ and 95 fJ/spike, respectively, with an effective frame rate of 300 MHz. Compared with alternative encoding schemes, our multiplexing temporal encoder achieves at most 100% higher data capacity, 11.4% more accurate in classification, and 25% more robust against noise. Compared with the state-of-the-art designs, our work achieves up to $105 \times $ power efficiency without significantly increasing the silicon area.

2 citations

Proceedings ArticleDOI
21 Sep 2022
TL;DR: In this paper , a Fully Binarized Weight Spiking Neuron Network (FBW-SNN) is proposed, which is based on XNOR arrays with binary weights where the weights of the first and the last layers are represented by a stochastic binary stream.
Abstract: This study proposes a Fully Binarized Weight Spiking Neuron Network (FBW-SNN). The core network is built based on XNOR arrays with binary weights where the weights of the first and the last layers are represented by a stochastic binary stream (stochastic numbers). We also introduced an algorithm that combines straight-through and surrogate gradients to train FBW-SNN. The evaluation results on the CIFAR-10 dataset show that the proposed FBW-SNN could achieve an accuracy of 82.68% with only 14-time steps, which is comparable to the accuracy of the Binarized SNN (with real weights at the first and the last layer) and conventional SNNs. By fully binarized, the proposed network model could be a promising candidate for Edge-AI applications implemented on low-power and resource-constrained devices.
Journal ArticleDOI
TL;DR: In this paper , a charge-based integrate-and-fire (IF) circuit for in-memory binary spiking neural networks (BSNNs) is proposed, which can mimic both addition and subtraction operations that permit better incorporation with inmemory XNOR-based synapses.
Abstract: This paper presents a charge‐based integrate‐and‐fire (IF) circuit for in‐memory binary spiking neural networks (BSNNs). The proposed IF circuit can mimic both addition and subtraction operations that permit better incorporation with in‐memory XNOR‐based synapses to implement the BSNN processing core. To evaluate the proposed design, we have developed a framework that incorporates the circuit's imperfections effects into the system‐level simulation. The array circuits use 2T‐2J Spin‐Transfer‐Torque Magnetoresistive RAM (STT‐MRAM) based on a 65‐nm commercial CMOS and a fitted magnetic tunnel junction (MTJ). The system model has been described in Pytorch to best fit the extracted parameters from circuit levels, including the cover of device nonidealities and process variations. The simulation results show that the proposed design can achieve a performance of 5.10 fJ/synapse and reaches 82.01% classification accuracy for CIFAR‐10 under process variation.
Journal ArticleDOI
TL;DR: In this article , the authors review the fundamentals of magnetic tunnel junctions (MTJ) as well as the development of MTJ-based neurons, synapses, and probabilistic-bit.
Abstract: Abstract The conventional computing method based on the von Neumann architecture is limited by a series of problems such as high energy consumption, finite data exchange bandwidth between processors and storage media, etc., and it is difficult to achieve higher computing efficiency. A more efficient unconventional computing architecture is urgently needed to overcome these problems. Neuromorphic computing and stochastic computing have been considered to be two competitive candidates for unconventional computing, due to their extraordinary potential for energy-efficient and high-performance computing. Although conventional electronic devices can mimic the topology of the human brain, these require high power consumption and large area. Spintronic devices represented by magnetic tunnel junctions (MTJs) exhibit remarkable high-energy efficiency, non-volatility, and similarity to biological nervous systems, making them one of the promising candidates for unconventional computing. In this work, we review the fundamentals of MTJs as well as the development of MTJ-based neurons, synapses, and probabilistic-bit. In the section on neuromorphic computing, we review a variety of neural networks composed of MTJ-based neurons and synapses, including multilayer perceptrons, convolutional neural networks, recurrent neural networks, and spiking neural networks, which are the closest to the biological neural system. In the section on stochastic computing, we review the applications of MTJ-based p-bits, including Boltzmann machines, Ising machines, and Bayesian networks. Furthermore, the challenges to developing these novel technologies are briefly discussed at the end of each section.
Proceedings ArticleDOI
21 Sep 2022
TL;DR: In this paper , a novel in-memory matching circuit realized the CAM applications based on Non-volatile resistive memory and 2T-2R bit cell structure that provides reliable lookup operations is presented.
Abstract: This paper presents a novel in-memory matching circuit realizing the CAM applications based on Non-volatile resistive memory and 2T-2R bit cell structure that provides reliable lookup operations. The evaluations extended to different NV-RAM types (RRAM, PCRAM, and MRAM) demonstrate the high applicability of our design architecture. The advantages of the CAM matching circuit are verified by Monte Carlo simulations using the 65nm CMOS process technology. Compared to other conventional approaches, our proposed design can reach relatively low sensing latencies, varying from 0.14 to 0.24 ns while maintaining a good level of search error rates.
References
More filters
Dissertation
01 Jan 2009
TL;DR: In this paper, the authors describe how to train a multi-layer generative model of natural images, using a dataset of millions of tiny colour images, described in the next section.
Abstract: In this work we describe how to train a multi-layer generative model of natural images. We use a dataset of millions of tiny colour images, described in the next section. This has been attempted by several groups but without success. The models on which we focus are RBMs (Restricted Boltzmann Machines) and DBNs (Deep Belief Networks). These models learn interesting-looking filters, which we show are more useful to a classifier than the raw pixels. We train the classifier on a labeled subset that we have collected and call the CIFAR-10 dataset.

15,005 citations

Book ChapterDOI
08 Oct 2016
TL;DR: The Binary-Weight-Network version of AlexNet is compared with recent network binarization methods, BinaryConnect and BinaryNets, and outperform these methods by large margins on ImageNet, more than \(16\,\%\) in top-1 accuracy.
Abstract: We propose two efficient approximations to standard convolutional neural networks: Binary-Weight-Networks and XNOR-Networks. In Binary-Weight-Networks, the filters are approximated with binary values resulting in 32\(\times \) memory saving. In XNOR-Networks, both the filters and the input to convolutional layers are binary. XNOR-Networks approximate convolutions using primarily binary operations. This results in 58\(\times \) faster convolutional operations (in terms of number of the high precision operations) and 32\(\times \) memory savings. XNOR-Nets offer the possibility of running state-of-the-art networks on CPUs (rather than GPUs) in real-time. Our binary networks are simple, accurate, efficient, and work on challenging visual tasks. We evaluate our approach on the ImageNet classification task. The classification accuracy with a Binary-Weight-Network version of AlexNet is the same as the full-precision AlexNet. We compare our method with recent network binarization methods, BinaryConnect and BinaryNets, and outperform these methods by large margins on ImageNet, more than \(16\,\%\) in top-1 accuracy. Our code is available at: http://allenai.org/plato/xnornet.

3,288 citations

Posted Content
TL;DR: A binary matrix multiplication GPU kernel is written with which it is possible to run the MNIST BNN 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy.
Abstract: We introduce a method to train Binarized Neural Networks (BNNs) - neural networks with binary weights and activations at run-time. At training-time the binary weights and activations are used for computing the parameters gradients. During the forward pass, BNNs drastically reduce memory size and accesses, and replace most arithmetic operations with bit-wise operations, which is expected to substantially improve power-efficiency. To validate the effectiveness of BNNs we conduct two sets of experiments on the Torch7 and Theano frameworks. On both, BNNs achieved nearly state-of-the-art results over the MNIST, CIFAR-10 and SVHN datasets. Last but not least, we wrote a binary matrix multiplication GPU kernel with which it is possible to run our MNIST BNN 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy. The code for training and running our BNNs is available on-line.

2,320 citations

Journal ArticleDOI
TL;DR: Eyeriss as mentioned in this paper is an accelerator for state-of-the-art deep convolutional neural networks (CNNs) that optimizes for the energy efficiency of the entire system, including the accelerator chip and off-chip DRAM, by reconfiguring the architecture.
Abstract: Eyeriss is an accelerator for state-of-the-art deep convolutional neural networks (CNNs). It optimizes for the energy efficiency of the entire system, including the accelerator chip and off-chip DRAM, for various CNN shapes by reconfiguring the architecture. CNNs are widely used in modern AI systems but also bring challenges on throughput and energy efficiency to the underlying hardware. This is because its computation requires a large amount of data, creating significant data movement from on-chip and off-chip that is more energy-consuming than computation. Minimizing data movement energy cost for any CNN shape, therefore, is the key to high throughput and energy efficiency. Eyeriss achieves these goals by using a proposed processing dataflow, called row stationary (RS), on a spatial architecture with 168 processing elements. RS dataflow reconfigures the computation mapping of a given shape, which optimizes energy efficiency by maximally reusing data locally to reduce expensive data movement, such as DRAM accesses. Compression and data gating are also applied to further improve energy efficiency. Eyeriss processes the convolutional layers at 35 frames/s and 0.0029 DRAM access/multiply and accumulation (MAC) for AlexNet at 278 mW (batch size $N = 4$ ), and 0.7 frames/s and 0.0035 DRAM access/MAC for VGG-16 at 236 mW ( $N = 3$ ).

2,165 citations

Journal ArticleDOI
TL;DR: In this paper, an alpha-power-law MOS model that includes the carrier velocity saturation effect, which becomes prominent in short-channel MOSFETs, is introduced and closed-form expressions for the delay, short-circuit power, and transition voltage of CMOS inverters are derived.
Abstract: An alpha -power-law MOS model that includes the carrier velocity saturation effect, which becomes prominent in short-channel MOSFETs, is introduced. The model is an extension of Shockley's square-law MOS model in the saturation region. Since the model is simple, it can be used to handle MOSFET circuits analytically and can predict the circuit behavior in the submicrometer region. Using the model, closed-form expressions for the delay, short-circuit power, and transition voltage of CMOS inverters are derived. The delay expression includes input waveform slope effects and parasitic drain/source resistance effects and can be used in simulation and/or optimization CAD tools. It is found that the CMOS inverter delay becomes less sensitive to the input waveform slope and that short-circuit dissipation increases as the carrier velocity saturation effect in short-channel MOSFETs gets more severe. >

1,596 citations