scispace - formally typeset
Search or ask a question

Showing papers in "ACM Journal on Emerging Technologies in Computing Systems in 2019"


Journal ArticleDOI
TL;DR: In this article, the authors present the state of the art of hardware implementations of spiking neural networks and the current trends in algorithm elaboration from model selection to training mechanisms, and describe the strategies employed to leverage the characteristics of these event-driven algorithms at the hardware level.
Abstract: Neuromorphic computing is henceforth a major research field for both academic and industrial actors. As opposed to Von Neumann machines, brain-inspired processors aim at bringing closer the memory and the computational elements to efficiently evaluate machine learning algorithms. Recently, spiking neural networks, a generation of cognitive algorithms employing computational primitives mimicking neuron and synapse operational principles, have become an important part of deep learning. They are expected to improve the computational performance and efficiency of neural networks, but they are best suited for hardware able to support their temporal dynamics. In this survey, we present the state of the art of hardware implementations of spiking neural networks and the current trends in algorithm elaboration from model selection to training mechanisms. The scope of existing solutions is extensive; we thus present the general framework and study on a case-by-case basis the relevant particularities. We describe the strategies employed to leverage the characteristics of these event-driven algorithms at the hardware level and discuss their related advantages and challenges.

112 citations


Journal ArticleDOI
TL;DR: In this article, the authors propose hardware techniques for optimizations of hyperdimensional computing, in a synthesizable open-source VHDL library, to enable co-located implementation of both learning and classification tasks on only a small portion of Xilinx UltraScale FPGAs.
Abstract: Brain-inspired hyperdimensional (HD) computing models neural activity patterns of the very size of the brain’s circuits with points of a hyperdimensional space, that is, with hypervectors. Hypervectors are D-dimensional (pseudo)random vectors with independent and identically distributed (i.i.d.) components constituting ultra-wide holographic words: D=10,000 bits, for instance. At its very core, HD computing manipulates a set of seed hypervectors to build composite hypervectors representing objects of interest. It demands memory optimizations with simple operations for an efficient hardware realization. In this article, we propose hardware techniques for optimizations of HD computing, in a synthesizable open-source VHDL library, to enable co-located implementation of both learning and classification tasks on only a small portion of Xilinx UltraScale FPGAs: (1) We propose simple logical operations to rematerialize the hypervectors on the fly rather than loading them from memory. These operations massively reduce the memory footprint by directly computing the composite hypervectors whose individual seed hypervectors do not need to be stored in memory. (2) Bundling a series of hypervectors over time requires a multibit counter per every hypervector component. We instead propose a binarized back-to-back bundling without requiring any counters. This truly enables on-chip learning with minimal resources as every hypervector component remains binary over the course of training to avoid otherwise multibit components. (3) For every classification event, an associative memory is in charge of finding the closest match between a set of learned hypervectors and a query hypervector by using a distance metric. This operator is proportional to hypervector dimension (D), and hence may take O(D) cycles per classification event. Accordingly, we significantly improve the throughput of classification by proposing associative memories that steadily reduce the latency of classification to the extreme of a single cycle. (4) We perform a design space exploration incorporating the proposed techniques on FPGAs for a wearable biosignal processing application as a case study. Our techniques achieve up to 2.39× area saving, or 2,337× throughput improvement. The Pareto optimal HD architecture is mapped on only 18,340 configurable logic blocks (CLBs) to learn and classify five hand gestures using four electromyography sensors.

53 citations


Journal ArticleDOI
TL;DR: In this paper, a probabilistic inference network simulator (PIN-Sim) is developed to realize a circuit-level model of an RBM utilizing resistive crossbar arrays along with differential amplifiers to implement the positive and negative weight values.
Abstract: Magnetoresistive random access memory (MRAM) technologies with thermally unstable nanomagnets are leveraged to develop an intrinsic stochastic neuron as a building block for restricted Boltzmann machines (RBMs) to form deep belief networks (DBNs). The embedded MRAM-based neuron is modeled using precise physics equations. The simulation results exhibit the desired sigmoidal relation between the input voltages and probability of the output state. A probabilistic inference network simulator (PIN-Sim) is developed to realize a circuit-level model of an RBM utilizing resistive crossbar arrays along with differential amplifiers to implement the positive and negative weight values. The PIN-Sim is composed of five main blocks to train a DBN, evaluate its accuracy, and measure its power consumption. The MNIST dataset is leveraged to investigate the energy and accuracy tradeoffs of seven distinct network topologies in SPICE using the 14nm HP-FinFET technology library with the nominal voltage of 0.8V, in which an MRAM-based neuron is used as the activation function. The software and hardware level simulations indicate that a 784× 200× 10 topology can achieve less than 5% error rates with ∼400pJ energy consumption. The error rates can be reduced to 2.5% by using a 784× 500× 500× 500× 10 DBN at the cost of ∼10× higher energy consumption and significant area overhead. Finally, the effects of specific hardware-level parameters on power dissipation and accuracy tradeoffs are identified via the developed PIN-Sim framework.

43 citations


Journal ArticleDOI
TL;DR: It is shown that using current signals as internal transmission signals can largely reduce computation delay, compared to the digital implementation of long short-term memory network, which is compatible with digital CMOS technology.
Abstract: We present an analog-integrated circuit implementation of long short-term memory network, which is compatible with digital CMOS technology. We have used multiple-input floating gate MOSFETs as both the front-end to obtain converted analog signals and the differential pairs in proposed analog multipliers. Analog crossbar is built by the analog multiplier processing matrix and bitwise multiplications. We have shown that using current signals as internal transmission signals can largely reduce computation delay, compared to the digital implementation. We also have introduced analog blocks to work as activation functions for the algorithm. In the back-end of our design, we have used current comparators to achieve the output to be readable to external digital systems. We have designed the LSTM network with the matrix size of 16 × 16 in TSMC 180nm CMOS technology. The post-layout simulations show that the latency of one computing cycle is 1.19ns without memory, and power dissipation of the single analog LSTM computing core with 2 kilobytes SRAM at 200MHz is 460.3mW. The overhead of power dissipation due to SRAM access is 8.3%, in which the computing of each LSTM layer requires one computing cycle. The energy efficiency is 0.95TOP/s/W.

25 citations


Journal ArticleDOI
TL;DR: In this paper, the authors present an advanced simulation approach for droplet microfluidics that addresses the shortcomings of state-of-the-art simulation tools and allows simulating practically relevant applications.
Abstract: The complexity of droplet microfluidics grows with the implementation of parallel processes and multiple functionalities on a single device. This poses a severe challenge to the engineer designing the corresponding microfluidic networks. In today’s design processes, the engineer relies on calculations, assumptions, simplifications, as well as his/her experiences and intuitions. To validate the obtained specification of the microfluidic network, usually a prototype is fabricated and physical experiments are conducted thus far. In case the design does not implement the desired functionality, this prototyping iteration is repeated—obviously resulting in an expensive and time-consuming design process. To avoid unnecessary debugging loops involving fabrication and testing, simulation methods could help to initially validate the specification of the microfluidic network before any prototype is fabricated. However, state-of-the-art simulation tools come with severe limitations, which prevent their utilization for practically relevant applications. More precisely, they are often not dedicated to droplet microfluidics, cannot handle the required physical phenomena, are not publicly available, and can hardly be extended. In this work, we present an advanced simulation approach for droplet microfluidics that addresses these shortcomings and, eventually, allows simulating practically relevant applications. To this end, we propose a simulation framework at the one-dimensional analysis model, which directly works on the specification of the design, supports essential physical phenomena, is publicly available, and is easy to extend. Evaluations and case studies demonstrate the benefits of the proposed simulator: While current state-of-the-art tools were not applicable for practically relevant microfluidic networks, the proposed simulator allows reducing the design time and costs, e.g., of a drug screening device from one person month and USD 1200, respectively, to just a fraction of that.

23 citations


Journal ArticleDOI
TL;DR: The proposed multi-core power management scheme employs a reinforcement learner to consider the power savings and variations in the application and thermal reliability caused by DVFS, and enables an average energy savings of up to ∼20%, up to 4.926°C temperature reduction without degradation in the applications- and thermal-reliability.
Abstract: Power management through dynamic voltage and frequency scaling (DVFS) is one of the most widely adopted techniques. However, it impacts application reliability (due to soft errors, circuit aging, and deadline misses). However, increased power density impacts the thermal reliability of the chip, sometimes leading to permanent failure. To balance both application- and thermal-reliability along with achieving power savings and maintaining performance, we propose application- and thermal-reliability-aware reinforcement learning–based multi-core power management in this work. The proposed power management scheme employs a reinforcement learner to consider the power savings and variations in the application and thermal reliability caused by DVFS. To overcome the computational overhead, the power management decisions are determined at the application-level rather than per-core or system-level granularity. Experimental evaluation of proposed multi-core power management on a microprocessor with up to 32 cores, running PARSEC applications, was done to demonstrate the applicability and efficiency of the proposed technique. Compared to the existing state-of-the-art techniques, the proposed technique enables an average energy savings of up to ∼20%, up to 4.926°C temperature reduction without degradation in the application- and thermal-reliability.

20 citations


Journal ArticleDOI
TL;DR: This article presents the design and evaluation of an accelerator for CoNNs, the system-level architecture is based on mixed-signal, cellular neural networks (CeNNs), and the implementation of different layers, including convolution, ReLU, and pooling, in a CoNN using CeNN.
Abstract: Deep neural network (DNN) accelerators with improved energy and delay are desirable for meeting the requirements of hardware targeted for IoT and edge computing systems. Convolutional neural networks (CoNNs) belong to one of the most popular types of DNN architectures. This article presents the design and evaluation of an accelerator for CoNNs. The system-level architecture is based on mixed-signal, cellular neural networks (CeNNs). Specifically, we present (i) the implementation of different layers, including convolution, ReLU, and pooling, in a CoNN using CeNN, (ii) modified CoNN structures with CeNN-friendly layers to reduce computational overheads typically associated with a CoNN, (iii) a mixed-signal CeNN architecture that performs CoNN computations in the analog and mixed signal domain, and (iv) design space exploration that identifies what CeNN-based algorithm and architectural features fare best compared to existing algorithms and architectures when evaluated over common datasets—MNIST and CIFAR-10. Notably, the proposed approach can lead to 8.7× improvements in energy-delay product (EDP) per digit classification for the MNIST dataset at iso-accuracy when compared with the state-of-the-art DNN engine, while our approach could offer 4.3× improvements in EDP when compared to other network implementations for the CIFAR-10 dataset.

18 citations


Journal ArticleDOI
TL;DR: A comparative cross-layer study between DW-RM and SK-RM is intended to provide guidelines, especially for circuit and architecture researchers on RM, and to answer questions about the relationship between DW and skyrmion.
Abstract: Racetrack memory (RM), a new storage scheme in which information flows along a nanotrack, has been considered as a potential candidate for future high-density storage device instead of hard disk drive (HDD). The first RM technology, which was proposed in 2008 by IBM, relies on a train of opposite magnetic domains separated by domain walls (DWs), named DW-RM. After 10 years of intensive research, a variety of fundamental advancements has been achieved; unfortunately, no product has been available until now. With increasing effort and resources dedicated to the development of DW-RM, it is likely that new materials and mechanisms will soon be discovered for practical applications. However, new concepts might also be on the horizon. Recently, an alternative information carrier, magnetic skyrmion, which was experimentally discovered in 2009, has been regarded as a promising replacement of DW for RM, named skyrmion-based RM (SK-RM). Intensive effort has been involved and amazing advances have been made in observing, writing, manipulating, and deleting individual skyrmions. So, what is the relationship between DW and skyrmion? What are the key differences between DW and skyrmion, or between DW-RM and SK-RM? What benefits could SK-RM bring and what challenges need to be addressed before application? In this review article, we intend to answer these questions through a comparative cross-layer study between DW-RM and SK-RM. This work will provide guidelines, especially for circuit and architecture researchers on RM.

16 citations


Journal ArticleDOI
TL;DR: A taxonomy of hardware-based monitoring techniques against different cyber and hardware attacks is provided, the potentials and unique challenges are highlighted, and how power-based side-channel instruction-level monitoring can offer suitable solutions to prevailing embedded device security issues are displayed.
Abstract: With the rise of Internet of Things (IoT), devices such as smartphones, embedded medical devices, smart home appliances, as well as traditional computing platforms such as personal computers and servers have been increasingly targeted with a variety of cyber attacks. Due to limited hardware resources for embedded devices and difficulty in wide-coverage and on-time software updates, software-only cyber defense techniques, such as traditional anti-virus and malware detectors, do not offer a silver-bullet solution. Hardware-based security monitoring and protection techniques, therefore, have gained significant attention. Monitoring devices using side-channel leakage information, e.g., power supply variation and electromagnetic (EM) radiation, is a promising avenue that promotes multiple directions in security and trust applications. In this article, we provide a taxonomy of hardware-based monitoring techniques against different cyber and hardware attacks, highlight the potentials and unique challenges, and display how power-based side-channel instruction-level monitoring can offer suitable solutions to prevailing embedded device security issues. Further, we delineate approaches for future research directions.

15 citations


Journal ArticleDOI
TL;DR: In this article, a comprehensive neuromemristive crossbar architecture for the spatial pooler and the sparse distributed representation classifier is presented. And the proposed design is benchmarked for image recognition tasks using Modified National Institute of Standards and Technology (MNIST) and Yale faces datasets, and evaluated using different metrics including entropy, sparseness, and noise robustness.
Abstract: Hierarchical temporal memory (HTM) is a biomimetic sequence memory algorithm that holds promise for invariant representations of spatial and spatio-temporal inputs. This article presents a comprehensive neuromemristive crossbar architecture for the spatial pooler (SP) and the sparse distributed representation classifier, which are fundamental to the algorithm. There are several unique features in the proposed architecture that tightly link with the HTM algorithm. A memristor that is suitable for emulating the HTM synapses is identified and a new Z-window function is proposed. The architecture exploits the concept of synthetic synapses to enable potential synapses in the HTM. The crossbar for the SP avoids dark spots caused by unutilized crossbar regions and supports rapid on-chip training within two clock cycles. This research also leverages plasticity mechanisms such as neurogenesis and homeostatic intrinsic plasticity to strengthen the robustness and performance of the SP. The proposed design is benchmarked for image recognition tasks using Modified National Institute of Standards and Technology (MNIST) and Yale faces datasets, and is evaluated using different metrics including entropy, sparseness, and noise robustness. Detailed power analysis at different stages of the SP operations is performed to demonstrate the suitability for mobile platforms.

14 citations


Journal ArticleDOI
TL;DR: A new stochastic multiplier with simple CMOS transistors called the stochastics hybrid multiplier for quantized neural networks is proposed, which uses the characteristic of quantized weights and tremendously reduces the hardware cost of neural networks.
Abstract: With increased interests of neural networks, hardware implementations of neural networks have been investigated. Researchers pursue low hardware cost by using different technologies such as stochastic computing (SC) and quantization. More specifically, the quantization is able to reduce total number of trained weights and results in low hardware cost. SC aims to lower hardware costs substantially by using simple gates instead of complex arithmetic operations. However, the advantages of both quantization and SC in neural networks are not well investigated. In this article, we propose a new stochastic multiplier with simple CMOS transistors called the stochastic hybrid multiplier for quantized neural networks. The new design uses the characteristic of quantized weights and tremendously reduces the hardware cost of neural networks. Experimental results indicate that our stochastic design achieves about 7.7x energy reduction compared to its counterpart binary implementation while maintaining slightly higher recognition error rates than the binary implementation. Compared to previous stochastic neural network implementations, our work derives at least 4x, 9x, and 10x reduction in terms of area, power, and energy, respectively.

Journal ArticleDOI
TL;DR: A flexible spare core placement technique for mesh-based NoC by taking several benchmark applications into consideration is presented and significant reductions in the overall communication cost, average network latency, and network power consumption are shown.
Abstract: Network-on-Chip (NoC) has been proposed as a promising solution to overcome the communication challenges of System-on-Chip (SoC) design in nanoscale technologies. With the advancement in the nanoscale technology, the integration density of Intellectual Property (IP) cores in a single chip have increased, leading to heat dissipation, which in turn makes the system unreliable. Therefore, efficient fault-tolerant methods are necessary at different levels to improve overall system performance and make the system to operate normally. This article presents a flexible spare core placement technique for mesh-based NoC by taking several benchmark applications into consideration. An Integer Linear Programming (ILP)-based solution has been proposed for the spare core placement problem. Also, Particle Swarm Optimisation (PSO)-based meta-heuristic has been proposed for the same. Experiments have been performed by taking several application benchmarks reported in the literature and the applications generated using the TGFF tool. Comparisons have been carried out using our approach and the approach followed in the literature (i) by varying the network size with fixed fault percentage in the network, and (ii) by fixing the network size while varying the percentage of faults in the network. We have also compared the overall communication cost and CPU runtime between ILP and PSO approaches. The results show significant reductions in the overall communication cost, average network latency, and network power consumption across all the cases using our approach over the approaches reported in the literature.

Journal ArticleDOI
TL;DR: QuTiBench is a novel multi-tiered benchmarking methodology that supports algorithmic optimizations such as quantization and helps system developers understand the benefits and limitations of these novel compute architectures in regard to specific neural networks and will help drive future innovation.
Abstract: Neural Networks have become one of the most successful universal machine-learning algorithms. They play a key role in enabling machine vision and speech recognition and are increasingly adopted in other application domains. Their computational complexity is enormous and comes along with equally challenging memory requirements in regards to capacity and access bandwidth, which limits deployment in particular within energy constrained, embedded environments. To address these implementation challenges, a broad spectrum of new customized and heterogeneous hardware architectures have emerged, often accompanied with co-designed algorithms to extract maximum benefit out of the hardware. Furthermore, numerous optimization techniques are being explored for neural networks to reduce compute and memory requirements while maintaining accuracy. This results in an abundance of algorithmic and architectural choices, some of which fit specific use cases better than others.For system-level designers, there is currently no good way to compare the variety of hardware, algorithm, and optimization options. While there are many benchmarking efforts in this field, they cover only subsections of the embedded design space. None of the existing benchmarks support essential algorithmic optimizations such as quantization, an important technique to stay on chip, or specialized heterogeneous hardware architectures. We propose a novel benchmark suite, QuTiBench, that addresses this need. QuTiBench is a novel multi-tiered benchmarking methodology (Ti) that supports algorithmic optimizations such as quantization (Qu) and helps system developers understand the benefits and limitations of these novel compute architectures in regard to specific neural networks and will help drive future innovation. We invite the community to contribute to QuTiBench to support the full spectrum of choices in implementing machine-learning systems.

Journal ArticleDOI
TL;DR: A comprehensive analytical model to analyze the performance of 3D mesh NoC over variants of different SNN topologies and communications protocols and an architecture and a low-latency spike routing algorithm, named shortest path K-means based multicast (SP-KMCR), for three-dimensional NoC of spiking neurons (3DNoC-SNN).
Abstract: Spiking neural networks (SNNs) are artificial neural network models that more closely mimic biological neural networks. In addition to neuronal and synaptic state, SNNs incorporate the variant time scale into their computational model. Since each neuron in these networks is connected to thousands of others, high bandwidth is required. Moreover, since the spike times are used to encode information in SNN, very low communication latency is also needed. The 2D-NoC was used as a solution to provide a scalable interconnection fabric in large-scale parallel SNN systems. The 3D-ICs have also attracted a lot of attention as a potential solution to resolve the interconnect bottleneck. The combination of these two emerging technologies provides a new horizon for IC designs to satisfy the high requirements of low power and small footprint in emerging AI applications. In this work, we first present a comprehensive analytical model to analyze the performance of 3D mesh NoC over variants of different SNN topologies and communications protocols. Second, we present an architecture and a low-latency spike routing algorithm, named shortest path K-means based multicast (SP-KMCR), for three-dimensional NoC of spiking neurons (3DNoC-SNN). The proposed system was validated based on an RTL-level implementation, while area/power analysis was performed using 45nm CMOS technology.

Journal ArticleDOI
TL;DR: A new architecture of stochastic neural networks with a hardware-oriented approximate activation function is investigated, which can be hidden in the proposed architecture and thus reduce the whole hardware cost.
Abstract: Neural networks are becoming prevalent in many areas, such as pattern recognition and medical diagnosis. Stochastic computing is one potential solution for neural networks implemented in low-power back-end devices such as solar-powered devices and Internet of Things (IoT) devices. In this article, we investigate a new architecture of stochastic neural networks with a hardware-oriented approximate activation function. The newly proposed approximate activation function can be hidden in the proposed architecture and thus reduce the whole hardware cost. Additionally, to further reduce the hardware cost of the stochastic implementation, a new hybrid stochastic multiplier is proposed. It contains OR gates and a binary parallel counter, which aims to reduce the number of inputs of the binary parallel counter. The experimental results indicate the newly proposed approximate architecture without hybrid stochastic multipliers achieves more than 25%, 60%, and 3x reduction compared to previous stochastic neural networks, and more than 30x, 30x, and 52% reduction compared to conventional binary neural networks, in terms of area, power, and energy, respectively, while maintaining the similar error rates compared to the conventional neural networks. Furthermore, the stochastic implementation with hybrid stochastic multipliers further reduces area about 18% to 80%, power from 15% to 113.1%, and energy about 15% to 131%, respectively.

Journal ArticleDOI
TL;DR: A new NoC router to support MCS based on an accurate WCCT analysis for high-critical flows is proposed and a multi-abstraction-level evaluation of DAS is evaluated, to automatically prove several DAS properties required by critical systems designers.
Abstract: A Mixed Criticality System (MCS) combines real-time software tasks with different criticality levels. In a MCS, the criticality level specifies the level of assurance against system failure. For high-critical flows of messages, it is imperative to meet deadlines; otherwise, the whole system might fail, leading to catastrophic results, like loss of life or serious damage to the environment. In contrast, low-critical flows may tolerate some delays.Furthermore, in MCS, flow performances such as the Worst Case Communication Time (WCCT) may vary depending on the criticality level of the applications. Then execution platforms must provide different operating modes for applications with different levels of criticality. To conclude, in Network-On-Chip (NoC), sharing resources between communication flows can lead to unpredictable latencies and subsequently turns the implementation of MCS in many-core architectures challenging.In this article, we propose and evaluate a new NoC router to support MCS based on an accurate WCCT analysis for high-critical flows. The proposed router, called Double Arbiter and Switching router (DAS), jointly uses Wormhole and Store And Forward communication techniques for low- and high-critical flows, respectively. It ensures that high-critical flows meet their deadlines while maximizing the bandwidth remaining for the low-critical flows. We also propose a new method for high-critical communication time analysis, applied to Store And Forward switching mode with virtual channels. For low-critical flows communication time analysis, we adapt an existing wormhole communication time analysis with share policy to our context.The second contribution of this article is a multi-abstraction-level evaluation of DAS. We evaluate the communication time of flows, the system mode change, the cost, and four properties of DAS. Simulations with a cycle-accurate SystemC NoC simulator show that, with a 15% network use rate, the communication delay of high-critical flows is reduced by 80% while communication delay of low-critical flow is increased by 18% compared to solutions based on routers with multiple virtual channels. For 10% of network interferences, using system mode change, DAS reduces the high-critical communication delays about 66%. We synthesize our router with a 28nm SOI technology and show that the size overhead is limited of 2.5% compared to the solution based on virtual channel router. Finally, we applied model checking verification techniques to automatically prove several DAS properties required by critical systems designers.

Journal ArticleDOI
TL;DR: The crux of the article is to segment the entire system into smaller clusters of nodes and adopt a hybrid strategy for each segment that includes conventional laser modulation, as well as a novel technique for sharing power across nodes dynamically.
Abstract: This article presents BigBus, a novel design of an on-chip photonic network for a 1,024-node system. For such a large on-chip network, performance and power reduction are two mutually conflicting goals. This article uses a combination of strategies to reduce static power consumption while simultaneously improving performance and the energy-delay2 (ED2) product. The crux of the article is to segment the entire system into smaller clusters of nodes and adopt a hybrid strategy for each segment that includes conventional laser modulation, as well as a novel technique for sharing power across nodes dynamically. We represent energy internally as tokens, where one token will allow a node to send a message to any other node in its cluster. We allow optical stations to arbitrate for tokens at a global level, and then we predict the number of token equivalents of power that the off-chip laser needs to generate. Using these techniques, BigBus outperforms other competing proposals. We demonstrate a speedup of 14--34% over state of the art proposals and a 20--61% reduction in ED2.

Journal ArticleDOI
TL;DR: GARDENIA is a benchmark suite for studying irregular graph algorithms on massively parallel accelerators and exhibits irregular microarchitectural behavior, which is quite different from structured workloads and straightforwardimplemented graph benchmarks.
Abstract: This article presents the Graph Algorithm Repository for Designing Next-generation Accelerators (GARDENIA), a benchmark suite for studying irregular graph algorithms on massively parallel accelerators. Applications with limited control and data irregularity are the main focus of existing generic benchmarks for accelerators, while available graph processing benchmarks do not apply state-of-the-art algorithms and/or optimization techniques. GARDENIA includes emerging graph processing workloads from graph analytics, sparse linear algebra, and machine-learning domains, which mimic massively multithreaded commercial programs running on modern large-scale datacenters. Our characterization shows that GARDENIA exhibits irregular microarchitectural behavior, which is quite different from structured workloads and straightforward-implemented graph benchmarks.

Journal ArticleDOI
TL;DR: Deep neural networks have become the readiest answer to a range of application challenges including image recognition, stock analysis, natural language processing, and biomedical applications such as biomedical applications and machine learning.
Abstract: Deep neural networks have become the readiest answer to a range of application challenges including image recognition, stock analysis, natural language processing, and biomedical applications such as seizure detection. All while outperforming prior leading solutions that relied heavily on hand-engineered techniques. However, deployment of these neural networks often requires high-computational and memory-intensive solutions. These requirements make it challenging to deploy Deep Neural Networks (DNNs) in embedded, real-time low-power applications where classic architectures, GPUs and CPUs, still impose significant power burden. Systems-on-Chip (SoC) with Field-programmable Gate Arrays (FPGAs) can be used to improve performance and allow more fine-grain control of resources than CPUs or GPUs, but it is difficult to find the optimal balance between hardware and software to improve DNN efficiency. In the current research literature there have been few proposed solutions to address optimizing hardware and software deployments of DNNs in embedded low-power systems. To address the computation resource restriction and low-power needs for deploying these networks, we describe and implement a domain-specific metric model for optimizing task deployment on differing platforms, hardware and software. Next, we propose a DNN hardware accelerator called Scalable Low-power Accelerator for real-time deep neural Networks (SCALENet) that includes multithreaded software workers. Finally, we propose a heterogeneous aware scheduler that uses the DNN-specific metric models and the SCALENet accelerator to allocate a task to a resource based on solving a numerical cost for a series of domain objectives. To demonstrate the applicability of our contribution, we deploy nine modern deep network architectures, each containing a different number of parameters within the context of two different neural network applications: image processing and biomedical seizure detection. Utilizing the metric modeling techniques integrated into the heterogeneous aware scheduler and the SCALENet accelerator, we demonstrate the ability to meet computational requirements, adapt to multiple architectures, and lower power by providing an optimized task to resource allocation. Our heterogeneous aware scheduler improves power saving by decreasing power consumption by 10% of the total system power, does not affect the accuracy of the networks, and still meets the real-time deadlines. We demonstrate the ability to achieve parity with or exceed the energy efficiency of NVIDIA GPUs when evaluated against Jetson TK1 with embedded GPU SoC and with a 4× power savings in a power envelope of 2.0W. When compared to existing FPGA-based accelerators, SCALENet’s accelerator and heterogeneous aware scheduler achieves a 4× improvement in energy efficiency.

Journal ArticleDOI
TL;DR: In this article, a neural network-based method for laser modulation by predicting optical traffic and a distributed and altruistic algorithm for channel sharing is proposed to reduce the static power consumption.
Abstract: High static power consumption is widely regarded as one of the largest bottlenecks in creating scalable optical NoCs. The standard techniques to reduce static power are based on sharing optical channels and modulating the laser. We show in this article that state-of-the-art techniques in these areas are suboptimal, and there is a significant room for further improvement. We propose two novel techniques—a neural network--based method for laser modulation by predicting optical traffic and a distributed and altruistic algorithm for channel sharing—that are significantly closer to a theoretically ideal scheme. In spite of this, a lot of laser power still gets wasted. We propose to reuse this energy to heat micro-ring resonators (achieve thermal tuning) by efficiently recirculating it. These three methods help us significantly reduce the energy requirements. Our design consumes 4.7× lower laser power as compared to other state-of-the-art proposals. In addition, it results in a 31% improvement in performance and 39% reduction in ED2 for a suite of Splash2 and Parsec benchmarks.

Journal ArticleDOI
TL;DR: In this article, the authors prove that the problem of placement and routing for tile-based FCN circuits is NP-complete and provide a theoretical foundation for the further development of corresponding design methods.
Abstract: Field-coupled Nanocomputing (FCN) technologies provide an alternative to conventional CMOS-based computation technologies and are characterized by intriguingly low-energy dissipation. Accordingly, their design received significant attention in the recent past. FCN circuit implementations like Quantum-dot Cellular Automata (QCA) or Nanomagnet Logic (NML) have already been built in labs and basic operations such as inverters, Majority, AND, OR, and so on, are already available. The design problem basically boils down to the question of how to place basic operations and route their connections so that the desired function results while, at the same time, further constraints (related to timing, clocking, path lengths, etc.) are satisfied. While several solutions for this problem have been proposed, interestingly no clear understanding about the complexity of the underlying task exists thus far. In this research note, we consider this problem and eventually prove that placement and routing for tile-based FCN circuits is NP-complete. By this, we provide a theoretical foundation for the further development of corresponding design methods.

Journal ArticleDOI
TL;DR: This article explores bio-plausible spike-timing-dependent-plasticity (STDP) mechanisms to train liquid state machine models with and without supervision and pursues efficient hardware implementation of FPGA LSM accelerators by performing algorithm-level optimization of the two proposed training rules and exploiting the self-organizing behaviors naturally induced by STDP.
Abstract: The liquid state machine (LSM) is a model of recurrent spiking neural networks (SNNs) and provides an appealing brain-inspired computing paradigm for machine-learning applications. Moreover, operated by processing information directly on spiking events, the LSM is amenable to efficient event-driven hardware implementation. However, training SNNs is, in general, a difficult task as synaptic weights shall be updated based on neural firing activities while achieving a learning objective. In this article, we explore bio-plausible spike-timing-dependent-plasticity (STDP) mechanisms to train liquid state machine models with and without supervision. First, we employ a supervised STDP rule to train the output layer of the LSM while delivering good classification performance. Furthermore, a hardware-friendly unsupervised STDP rule is leveraged to train the recurrent reservoir to further boost the performance. We pursue efficient hardware implementation of FPGA LSM accelerators by performing algorithm-level optimization of the two proposed training rules and exploiting the self-organizing behaviors naturally induced by STDP.Several recurrent spiking neural accelerators are built on a Xilinx Zync ZC-706 platform and trained for speech recognition with the TI46 speech corpus as the benchmark. Adopting the two proposed unsupervised and supervised STDP rules outperforms the recognition accuracy of a competitive non-STDP baseline training algorithm by up to 3.47%.

Journal ArticleDOI
TL;DR: MV-Net significantly improves the performance flexibility of current CNN models and makes it possible to provide always-guaranteed QoS in a multitask environment, and it satisfies the quality-of-results (QoR) requirement, outperforming the baseline implementation significantly, and improves the system energy efficiency at the same time.
Abstract: Recently the development of deep learning has been propelling the sheer growth of vision and speech applications on lightweight embedded and mobile systems. However, the limitation of computation resource and power delivery capability in embedded platforms is recognized as a significant bottleneck that prevents the systems from providing real-time deep learning ability, since the inference of deep convolutional neural networks (CNNs) and recurrent neural networks (RNNs) involves large quantities of weights and operations. Particularly, how to provide quality-of-services (QoS)-guaranteed neural network inference ability in the multitask execution environment of multicore SoCs is even more complicated due to the existence of resource contention. In this article, we present a novel deep neural network architecture, MV-Net, which provides performance elasticity and contention-aware self-scheduling ability for QoS enhancement in mobile computing systems. When the constraints of QoS, output accuracy, and resource contention status of the system change, MV-Net can dynamically reconfigure the corresponding neural network propagation paths and thus achieves an effective tradeoff between neural network computational complexity and prediction accuracy via approximate computing. The experimental results show that (1) MV-Net significantly improves the performance flexibility of current CNN models and makes it possible to provide always-guaranteed QoS in a multitask environment, and (2) it satisfies the quality-of-results (QoR) requirement, outperforming the baseline implementation significantly, and improves the system energy efficiency at the same time.

Journal ArticleDOI
Abstract: Near zero-energy computing describes the concept of executing logic operations below the (kBT ln 2) energy limit. Landauer discussed that it is impossible to break this limit as long as the computations are performed in the conventional, non-reversible way. But even if reversible computations were performed, the basic energy needed for operating circuits realized in conventional technologies is still far above the (kBT ln 2) energy limit (i.e., the circuits do not operate in a physically reversible manner). In contrast, novel nanotechnologies like Quantum-dot Cellular Automata (QCA) allow for computations with very low energy dissipation and hence are promising candidates for breaking this limit. Accordingly, the design of reversible QCA circuits is an active field of research. But whether QCA in general and the proposed circuits in particular are indeed able to operate in a logically and physically reversible fashion is unknown thus far, because neither physical realizations nor appropriate simulation approaches are available. In this work, we address this gap by utilizing an established theoretical model that has been implemented in a physics simulator enabling a precise consideration of how energy is dissipated in QCA designs. Our results provide strong evidence that QCA is indeed a suitable technology for near zero-energy computing. Further, the first design of a logically and physically reversible adder circuit is presented, which serves as proof of concept for future circuits with the ability of near zero-energy computing.

Journal ArticleDOI
TL;DR: This work proposes an energy-efficient use of Magnetic Tunnel Junctions (MTJs), a spintronic device that exhibits probabilistic switching behavior, as Stochastic Number Generators (SNGs), which forms the basis of the NN implementation in the SC domain and develops a heuristic approach for approximating multi-layer NNs.
Abstract: Hardware implementations of Artificial Neural Networks (ANNs) using conventional binary arithmetic units are computationally expensive, energy-intensive, and have large area overheads. Stochastic Computing (SC) is an emerging paradigm that replaces these conventional units with simple logic circuits and is particularly suitable for fault-tolerant applications. We propose an energy-efficient use of Magnetic Tunnel Junctions (MTJs), a spintronic device that exhibits probabilistic switching behavior, as Stochastic Number Generators (SNGs), which forms the basis of our NN implementation in the SC domain. Further, the error resilience of target applications of NNs allows approximating the synaptic weights in our MTJ-based NN implementation, in ways brought about by properties of the MTJ-SNG, to achieve energy-efficiency. An algorithm is designed that, given an error tolerance, can perform such approximations in a single-layer NN in an optimal way owing to the convexity of the problem formulation. We then use this algorithm and develop a heuristic approach for approximating multi-layer NNs. Classification problems were evaluated on the optimized NNs and results showed substantial savings in energy for little loss in accuracy.

Journal ArticleDOI
TL;DR: The generalization of the binary controlled-NOT to the controlled-modulo-addition gate, the concept of partial versus maximal entanglement, and architectures for generating higher-radix entangled states for the partial and maximal case are presented.
Abstract: Quantum information processing and communication techniques rely heavily upon entangled quantum states, and this dependence motivates the development of methods and systems to generate entanglement. Much research has been dedicated to state preparation for radix-2 qubits, and due to the pursuit of entangled states, the Bell state generator and its generalized forms where the number of entangled qubits is greater than two have been defined. In this work, we move beyond radix-2 and propose techniques for quantum state entanglement in high-dimensional systems through the generalization of the binary bipartite entanglement states. These higher-radix quantum informatic systems are composed of n quantum digits, or qudits, that are each mathematically characterized as elements of an r-dimensioned Hilbert vector space where r > 2. Consequently, the wave function is a time-dependent state vector of dimension rn. The generalization of the binary controlled-NOT to the controlled-modulo-addition gate, the concept of partial versus maximal entanglement, and architectures for generating higher-radix entangled states for the partial and maximal case are all presented.

Journal ArticleDOI
TL;DR: In this article, the authors proposed a resource-efficient, end-to-end iris recognition flow, which consists of FCN-based segmentation and a contour fitting module, followed by Daugman normalization and encoding.
Abstract: Applications of fully convolutional networks (FCN) in iris segmentation have shown promising advances. For mobile and embedded systems, a significant challenge is that the proposed FCN architectures are extremely computationally demanding. In this article, we propose a resource-efficient, end-to-end iris recognition flow, which consists of FCN-based segmentation and a contour fitting module, followed by Daugman normalization and encoding. To attain accurate and efficient FCN models, we propose a three-step SW/HW co-design methodology consisting of FCN architectural exploration, precision quantization, and hardware acceleration. In our exploration, we propose multiple FCN models, and in comparison to previous works, our best-performing model requires 50× fewer floating-point operations per inference while achieving a new state-of-the-art segmentation accuracy. Next, we select the most efficient set of models and further reduce their computational complexity through weights and activations quantization using an 8-bit dynamic fixed-point format. Each model is then incorporated into an end-to-end flow for true recognition performance evaluation. A few of our end-to-end pipelines outperform the previous state of the art on two datasets evaluated. Finally, we propose a novel dynamic fixed-point accelerator and fully demonstrate the SW/HW co-design realization of our flow on an embedded FPGA platform. In comparison with the embedded CPU, our hardware acceleration achieves up to 8.3× speedup for the overall pipeline while using less than 15% of the available FPGA resources. We also provide comparisons between the FPGA system and an embedded GPU showing different benefits and drawbacks for the two platforms.

Journal ArticleDOI
TL;DR: Directed Placement is introduced, a physical design algorithm that leverages the natural “directedness” in most modern microfluidic designs: fluid enters at designated inputs, flows through a linear or tree-based network of channels and fluidic components, and exits the device at dedicated outputs.
Abstract: Continuous-flow microfluidic devices based on integrated channel networks are becoming increasingly prevalent in research in the biological sciences. At present, these devices are physically laid out by hand by domain experts who understand both the underlying technology and the biological functions that will execute on fabricated devices. The lack of a design science that is specific to microfluidic technology creates a substantial barrier to entry. To address this concern, this article introduces Directed Placement, a physical design algorithm that leverages the natural “directedness” in most modern microfluidic designs: fluid enters at designated inputs, flows through a linear or tree-based network of channels and fluidic components, and exits the device at dedicated outputs. Directed placement creates physical layouts that share many principle similarities to those created by domain experts. Directed placement allows components to be placed closer to their neighbors compared to existing layout algorithms based on planar graph embedding or simulated annealing, leading to an average reduction in laid-out fluid channel length of 91% while improving area utilization by 8% on average. Directed placement is compatible with both passive and active microfluidic devices and is compatible with a variety of mainstream manufacturing technologies.

Journal ArticleDOI
TL;DR: A new CNN training and implementation approach that implements weights using a trained biased number representation, which can achieve near full-precision model accuracy with as little as 2-bit weights and 2- bit activations on the CIFAR datasets.
Abstract: Recent works have demonstrated the promise of using resistive random access memory (ReRAM) to perform neural network computations in memory. In particular, ReRAM-based crossbar structures can perform matrix-vector multiplication directly in the analog domain, but the resolutions of ReRAM cells and digital/analog converters limit the precisions of inputs and weights that can be directly supported. Although convolutional neural networks (CNNs) can be trained with low-precision weights and activations, previous quantization approaches are either not amenable to ReRAM-based crossbar implementations or have poor accuracies when applied to deep CNNs on complex datasets. In this article, we propose a new CNN training and implementation approach that implements weights using a trained biased number representation, which can achieve near full-precision model accuracy with as little as 2-bit weights and 2-bit activations on the CIFAR datasets. The proposed approach is compatible with a ReRAM-based crossbar implementation. We also propose an activation-side coalescing technique that combines the steps of batch normalization, non-linear activation, and quantization into a single stage that simply performs a clipped-rounding operation. Experiments demonstrate that our approach outperforms previous low-precision number representations for VGG-11, VGG-13, and VGG-19 models on both the CIFAR-10 and CIFAR-100 datasets.

Journal ArticleDOI
TL;DR: A novel approximate floating point multiplier, called CMUL, is proposed, which significantly reduces energy and improves performance of multiplication while allowing for a controllable amount of error.
Abstract: Many applications, such as machine learning and data sensing, are statistical in nature and can tolerate some level of inaccuracy in their computation. A variety of designs have been put forward exploiting the statistical nature of machine learning through approximate computing. With approximate multipliers being the main focus due to their high usage in machine-learning designs. In this article, we propose a novel approximate floating point multiplier, called CMUL, which significantly reduces energy and improves performance of multiplication while allowing for a controllable amount of error. Our design approximately models multiplication by replacing the most costly step of the operation with a lower energy alternative. To tune the level of approximation, CMUL dynamically identifies the inputs that produces the largest approximation error and processes them in precise mode. To use CMUL for deep neural network (DNN) acceleration, we propose a framework that modifies the trained DNN model to make it suitable for approximate hardware. Our framework adjusts the DNN weights to a set of “potential weights” that are suitable for approximate hardware. Then, it compensates the possible quality loss by iteratively retraining the network. Our evaluation with four DNN applications shows that, CMUL can achieve 60.3% energy efficiency improvement and 3.2× energy-delay product (EDP) improvement as compared to the baseline GPU, while ensuring less than 0.2% quality loss. These results are 38.7% and 2.0× higher than energy efficiency and EDP improvement of the CMUL without using the proposed framework.