scispace - formally typeset
Search or ask a question

Showing papers in "ACM Journal on Emerging Technologies in Computing Systems in 2018"


Journal ArticleDOI
TL;DR: An approximate multiplier that exploits the inherent application resilience to error and utilizes the notion of computation sharing to achieve improved energy consumption for neural networks and a Multiplier-less Artificial Neuron (MAN), which is even more compact and energy efficient.
Abstract: Neural networks, with their remarkable ability to derive meaning from a large volume of complicated or imprecise data, can be used to extract patterns and detect trends that are too complex for the von Neumann computing paradigm. Their considerable computational requirements stretch the capabilities of even modern computing platforms. We propose an approximate multiplier that exploits the inherent application resilience to error and utilizes the notion of computation sharing to achieve improved energy consumption for neural networks. We also propose a Multiplier-less Artificial Neuron (MAN), which is even more compact and energy efficient. We also propose a network retraining methodology to recover some of the accuracy loss due to the use of these approximate multipliers. We evaluated the proposed algorithm/design on several recognition applications. The results show that we achieve ∼33%, ∼32%, and ∼25% reduction in power consumption and ∼33%, ∼34%, and ∼27% reduction in area, respectively, for 12-, 8-, and 4-bit MAN, with a maximum ∼2.4% loss in accuracy compared to a conventional neuron implementation of equivalent bit precision. These comparisons were performed under iso-speed conditions.

68 citations


Journal ArticleDOI
TL;DR: In this article, the authors evaluate deep learning models using three different computing architectures: quantum computing to train complex topologies, high performance computing to automatically determine network topology, and neuromorphic computing for a low-power hardware implementation.
Abstract: Current deep learning approaches have been very successful using convolutional neural networks trained on large graphical-processing-unit-based computers. Three limitations of this approach are that (1) they are based on a simple layered network topology, i.e., highly connected layers, without intra-layer connections; (2) the networks are manually configured to achieve optimal results, and (3) the implementation of the network model is expensive in both cost and power. In this article, we evaluate deep learning models using three different computing architectures to address these problems: quantum computing to train complex topologies, high performance computing to automatically determine network topology, and neuromorphic computing for a low-power hardware implementation. We use the MNIST dataset for our experiment, due to input size limitations of current quantum computers. Our results show the feasibility of using the three architectures in tandem to address the above deep learning limitations. We show that a quantum computer can find high quality values of intra-layer connection weights in a tractable time as the complexity of the network increases, a high performance computer can find optimal layer-based topologies, and a neuromorphic computer can represent the complex topology and weights derived from the other architectures in low power memristive hardware.

50 citations


Journal ArticleDOI
TL;DR: This article proposes an optimized fully mapped FPGA accelerator architecture tailored for bitwise convolution and normalization that features massive spatial parallelism with deep pipelines stages that is on a par with a Titan X GPU in terms of throughput and energy efficiency.
Abstract: FPGA-based hardware accelerators for convolutional neural networks (CNNs) have received attention due to their higher energy efficiency than GPUs. However, it is challenging for FPGA-based solutions to achieve a higher throughput than GPU counterparts. In this article, we demonstrate that FPGA acceleration can be a superior solution in terms of both throughput and energy efficiency when a CNN is trained with binary constraints on weights and activations. Specifically, we propose an optimized fully mapped FPGA accelerator architecture tailored for bitwise convolution and normalization that features massive spatial parallelism with deep pipelines stages. A key advantage of the FPGA accelerator is that its performance is insensitive to data batch size, while the performance of GPU acceleration varies largely depending on the batch size of the data. Experiment results show that the proposed accelerator architecture for binary CNNs running on a Virtex-7 FPGA is 8.3× faster and 75× more energy-efficient than a Titan X GPU for processing online individual requests in small batch sizes. For processing static data in large batch sizes, the proposed solution is on a par with a Titan X GPU in terms of throughput while delivering 9.5× higher energy efficiency.

46 citations


Journal ArticleDOI
TL;DR: This work proposes Spike Timing Dependent Plasticity-based unsupervised feature learning using convolution-over-time in Spiking Neural Network (SNN), and uses shared weight kernels that are convolved with the input patterns over time to encode representative input features, thereby improving the sparsity as well as the robustness of the learning model.
Abstract: Brain-inspired learning models attempt to mimic the computations performed in the neurons and synapses constituting the human brain to achieve its efficiency in cognitive tasks. In this work, we propose Spike Timing Dependent Plasticity-based unsupervised feature learning using convolution-over-time in Spiking Neural Network (SNN). We use shared weight kernels that are convolved with the input patterns over time to encode representative input features, thereby improving the sparsity as well as the robustness of the learning model. We show that the Convolutional SNN self-learns several visual categories for object recognition with limited number of training patterns while yielding comparable classification accuracy relative to the fully connected SNN. Further, we quantify the energy benefits of the Convolutional SNN over fully connected SNN on neuromorphic hardware implementation.

36 citations


Journal ArticleDOI
Kangjun Bai1, Yang Yi1
TL;DR: This work designs and fabricates an energy-efficient analog delayed feedback reservoir (DFR) computing system, which is built upon a temporal encoding scheme, a nonlinear transfer function, and a dynamic delayed feedback loop, and represents the first analog integrated circuit (IC) implementation of the DFR computing system.
Abstract: Neuromorphic computing, which is built on a brain-inspired silicon chip, is uniquely applied to keep pace with the explosive escalation of algorithms and data density on machine learning. Reservoir computing, an emerging computing paradigm based on the recurrent neural network with proven benefits across multifaceted applications, offers an alternative training mechanism only at the readout stage. In this work, we successfully design and fabricate an energy-efficient analog delayed feedback reservoir (DFR) computing system, which is built upon a temporal encoding scheme, a nonlinear transfer function, and a dynamic delayed feedback loop. Measurement results demonstrate its high energy efficiency with rich dynamic behaviors, making the designed system a candidate for low power embedded applications. The system performance, as well as the robustness, are studied and analyzed through the Monte Carlo simulation. The chaotic time series prediction benchmark, NARMA10, is examined through the proposed DFR computing system, and exhibits a 36%−85% reduction on the error rate compared to state-of-the-art DFR computing system designs. To the best of our knowledge, our work represents the first analog integrated circuit (IC) implementation of the DFR computing system.

31 citations


Journal ArticleDOI
TL;DR: In this article, a T-count optimized quantum square root circuit with only 2 ǫ n + 1 qubits and no garbage output was presented, which achieves an average T-Count savings of 43.44%, 98.95%, 41.06%, and 20.28% as well as qubit savings of 85.46%, 95.16%, 90.59%, and 86.77% compared to existing works.
Abstract: Quantum circuits for basic mathematical functions such as the square root are required to implement scientific computing algorithms on quantum computers. Quantum circuits that are based on Clifford+T gates can easily be made fault tolerant, but the T gate is very costly to implement. As a result, reducing T-count has become an important optimization goal. Further, quantum circuits with many qubits are difficult to realize, making designs that save qubits and produce no garbage outputs desirable. In this work, we present a T-count optimized quantum square root circuit with only 2 ṡ n + 1 qubits and no garbage output. To make a fair comparison against existing work, the Bennett’s garbage removal scheme is used to remove garbage output from existing works. We determined that our proposed design achieves an average T-count savings of 43.44%, 98.95%, 41.06%, and 20.28% as well as qubit savings of 85.46%, 95.16%, 90.59%, and 86.77% compared to existing works.

29 citations


Journal ArticleDOI
TL;DR: In this article, instead of optimizing hardware parameters to a given neural network, the authors propose a methodology of reconstructing the neural network itself to be optimized to resistive memory crossbar arrays.
Abstract: Artificial Neural Network computation relies on intensive vector-matrix multiplications. Recently, the emerging nonvolatile memory (NVM) crossbar array showed a feasibility of implementing such operations with high energy efficiency. Thus, there have been many works on efficiently utilizing emerging NVM crossbar arrays as analog vector-matrix multipliers. However, nonlinear I-V characteristics of NVM restrain critical design parameters, such as the read voltage and weight range, resulting in substantial accuracy loss. In this article, instead of optimizing hardware parameters to a given neural network, we propose a methodology of reconstructing the neural network itself to be optimized to resistive memory crossbar arrays. To verify the validity of the proposed method, we simulated various neural networks with MNIST and CIFAR-10 dataset using two different Resistive Random Access Memory models. Simulation results show that our proposed neural network produces inference accuracies significantly higher than conventional neural network when the network is mapped to synapse devices with nonlinear I-V characteristics.

22 citations


Journal ArticleDOI
TL;DR: A multi-level-optimization framework for energy-efficient CeNN implementations on FPGAs featured with three level optimizations: system-, module-, and design-space-level, with focus on computational redundancy and attainable performance, respectively is proposed.
Abstract: Cellular Neural Network (CeNN) is considered as a powerful paradigm for embedded devices. Its analog and mix-signal hardware implementations are proved to be applicable to high-speed image processing, video analysis, and medical signal processing with its efficiency and popularity limited by smaller implementation size and lower precision. Recently, digital implementations of CeNNs on FPGA have attracted researchers from both academia and industry due to its high flexibility and short time-to-market. However, most existing implementations are not well optimized to fully utilize the advantages of FPGA platform with unnecessary design and computational redundancy that prevents speedup. We propose a multi-level-optimization framework for energy-efficient CeNN implementations on FPGAs. In particular, the optimization framework is featured with three level optimizations: system-, module-, and design-space-level, with focus on computational redundancy and attainable performance, respectively. Experimental results show that with various configurations our framework can achieve an energy-efficiency improvement of 3.54× and up to 3.88× speedup compared with existing implementations with similar accuracy.

20 citations


Journal ArticleDOI
TL;DR: This article proposes a two-phase technique, which uses the order of the path delay in path pairs to detect HTs, and confirms the efficiency and accuracy of the proposed technique are confirmed by a series of experiments.
Abstract: Many fabrication-less design houses are outsourcing their designs to third-party foundries for fabrication to lower cost. This IC development process, however, raises serious security concerns on Hardware Trojans (HTs). Many design-for-trust techniques have been proposed to detect HTs through observing erroneous output or abnormal side-channel characteristics. Side-channel characteristics such as path delay have been widely used for HT detection and functionality verification, as the changes of the characteristics of the host circuit incurred by the inserted HT can be identified through proper methods.In this article, for the first time, we propose a two-phase technique, which uses the order of the path delay in path pairs to detect HTs. In the design phase, a full-cover path set that covers all the nets of the design is generated; meanwhile, in the set, the relative order of paths in path pairs is determined according to their delay. The order of the paths in path pairs serves as the fingerprint of the design. In the test phase, the actual delay of the paths in the full-cover set is extracted from the fabricated circuits, and the order of paths in path pairs is compared with the fingerprint generated in the design phase. A mismatch between them indicates the existence of HTs. Both process variations and measurement noise are taken into consideration. The efficiency and accuracy of the proposed technique are confirmed by a series of experiments, including the examination of both violated path pairs incurred by HTs and their false alarm rate.

20 citations


Journal ArticleDOI
TL;DR: A compressed CeNN framework for efficient FPGA implementations that involves various techniques, such as incremental quantization and early exit, which significantly reduces computation demands while maintaining an acceptable performance.
Abstract: Cellular neural networks (CeNNs) have been widely adopted in image processing tasks. Recently, various hardware implementations of CeNNs have emerged in the literature, with Field Programmable Gate Array (FPGA) being one of the most popular choices due to its high flexibility and low time-to-market. However, CeNNs typically involve extensive computations in a recursive manner. As an example, to simply process an image of 1,920 × 1,080 pixels requires 4--8 Giga floating point multiplications (for 3 × 3 templates and 50–100 iterations), which needs to be done in a timely manner for real-time applications. To address this issue, in this article, we propose a compressed CeNN framework for efficient FPGA implementations. It involves various techniques, such as incremental quantization and early exit, which significantly reduces computation demands while maintaining an acceptable performance. Particularly, incremental quantization quantizes the numbers in CeNN templates to powers of two, so that complex and expensive multiplications can be converted to simple and cheap shift operations, which only require a minimum number of registers and logical elements (LEs). While a similar concept has been explored in hardware implementations of Convolutional Neural Networks (CNNs), CeNNs have completely different computation patterns, which require different quantization and implementation strategies. Experimental results on FPGAs show that incremental quantization and early exit can achieve a speedup of up to 7.8× and 8.3×, respectively, compared with the state-of-the-art implementations, while with almost no performance loss with four widely adopted applications. We also discover that different from CNNs, the optimal quantization strategies of CeNNs depend heavily on the applications. We hope that our work can serve as a pioneer in the hardware optimization of CeNNs.

20 citations


Journal ArticleDOI
TL;DR: Experimental results obtained with an optoelectronic circuit simulator show the advantages of the optical parallel adder circuit over a traditional CMOS-based parallelAdder circuit.
Abstract: Integrated optical circuits with nanophotonic devices have attracted significant attention due to their low power dissipation and light-speed operation. With light interference and resonance phenomena, the nanophotonic device works as a voltage-controlled optical pass-gate like a pass-transistor. This article first introduces the concept of optical pass-gate logic and then proposes a parallel adder circuit based on optical pass-gate logic. Experimental results obtained with an optoelectronic circuit simulator show the advantages of our optical parallel adder circuit over a traditional CMOS-based parallel adder circuit.

Journal ArticleDOI
TL;DR: In the proposed technique, transformable interconnects enable an IC chip to maintain functioning in normal use and to transform its physical structure into another pattern when exposed to invasive attacks.
Abstract: Protection of intellectual property (IP) is increasingly critical for IP vendors in the semiconductor industry. However, advanced reverse engineering techniques can physically disassemble the chip and derive the IPs at a much lower cost than the value of IP design that chips carry. This invasive hardware attack—obtaining information from IC chips—always violates the IP rights of vendors. The intent of this article is to present a chip-level reverse engineering resilient design technique. In the proposed technique, transformable interconnects enable an IC chip to maintain functioning in normal use and to transform its physical structure into another pattern when exposed to invasive attacks. The newly created pattern will significantly increase the difficulty of reverse engineering. Furthermore, to improve the effectiveness of the proposed technique, a systematic design method is developed targeting integrated circuits with multiple design constraints. Simulations have been conducted to demonstrate the capability of the proposed technique, which generates extremely large complexity for reverse engineering with manageable overhead.

Journal ArticleDOI
TL;DR: This article proposes a hardware-friendly Spike-Timing Dependent Plastic (STDP) mechanism for on-chip tuning, and a novel runtime correlation-based neuron gating scheme to minimize the power dissipated by reservoir neurons.
Abstract: The Liquid State Machine (LSM) is a promising model of recurrent spiking neural networks that provides an appealing brain-inspired computing paradigm for machine-learning applications such as pattern recognition. Moreover, processing information directly on spiking events makes the LSM well suited for cost- and energy-efficient hardware implementation. In this article, we systematically present three techniques for optimizing energy efficiency while maintaining good performance of the proposed LSM neural processors from both an algorithmic and hardware implementation point of view. First, to realize adaptive LSM neural processors, thus boost learning performance, we propose a hardware-friendly Spike-Timing Dependent Plastic (STDP) mechanism for on-chip tuning. Then, the LSM processor incorporates a novel runtime correlation-based neuron gating scheme to minimize the power dissipated by reservoir neurons. Furthermore, an activity-dependent clock gating approach is presented to address the energy inefficiency due to the memory-intensive nature of the proposed neural processors.Using two different real-world tasks of speech and image recognition to benchmark, we demonstrate that the proposed architecture boosts the average learning performance by up to 2.0% while reducing energy dissipation by up to 29% compared to a baseline LSM with little extra hardware overhead on a Xilinx Virtex-6 FPGA.

Journal ArticleDOI
TL;DR: This work has developed a memristor crossbar-based approach, inspired by memristOr crossbar neuromorphic circuits, for a low-power, low-area, and high-throughput DPI system that examines both the header and body of a packet.
Abstract: Deep packet inspection (DPI) is a critical component to prevent intrusion detection. This requires a detailed analysis of each network packet header and body. Although this is often done on dedicated high-power servers in most networked systems, mobile systems could potentially be vulnerable to attack if utilized on an unprotected network. In this case, having DPI hardware on the mobile system would be highly beneficial. Unfortunately, DPI hardware is generally area and power consuming, making its implementation difficult in mobile systems.We developed a memristor crossbar-based approach, inspired by memristor crossbar neuromorphic circuits, for a low-power, low-area, and high-throughput DPI system that examines both the header and body of a packet. Two key types of circuits are presented: static pattern matching and regular expression circuits. This system is able to reduce execution time and power consumption due to its high-density grid and massive parallelism. Independent searches are performed using low-power memristor crossbar arrays giving rise to a throughput of 160Gbps with no loss in the classification accuracy.

Journal ArticleDOI
TL;DR: Simulation results of an 8x8 mesh-based optical NoC show that the proposed initial device-setting and thermal-tuning mechanism confines the worst-case thermal-induced optical energy consumption to be on the order of tens of pJ/bit, by avoiding significant thermal- induced optical power loss caused by temperature-dependent wavelength shifts.
Abstract: Optical networks-on-chip (NoCs) based on silicon photonics have been proposed as emerging on-chip communication architectures for chip multiprocessors with large core counts. However, due to the thermal sensitivity of optical devices used in optical NoCs, on-chip temperature variations cause significant thermal-induced optical power loss, which would counteract the power advantages of optical NoCs. To tackle this problem, in this work, we propose a learning-based thermal-sensitive power optimization approach for mesh- or torus-based optical NoCs in the presence of temperature variations. The key techniques proposed include an initial device-setting and thermal-tuning mechanism that is a device-level optimization technique, and a learning-based thermal-sensitive adaptive routing algorithm that is a network-level optimization technique. Simulation results of an 8x8 mesh-based optical NoC show that the proposed initial device-setting and thermal-tuning mechanism confines the worst-case thermal-induced optical energy consumption to be on the order of tens of pJ/bit, by avoiding significant thermal-induced optical power loss caused by temperature-dependent wavelength shifts. Besides, it shows that the learning-based thermal-sensitive adaptive routing algorithm is able to find an optimal path with the minimum estimated thermal-induced optical power consumption for each communication pair. The proposed routing has a greater space for optimization, especially for applications with more long-distance traffic.

Journal ArticleDOI
TL;DR: A methodology for the design space exploration of optical NoC mapping solutions, which automatically assigns IPs/cores to the network tiles such that the laser power consumption is minimized, allowing improved energy efficiency.
Abstract: To face the complex communication problems that arise as the number of on-chip components grows up, photonic networks-on-chip (NoCs) have been recently proposed to replace electronic interconnects. However, photonic NoCs lack efficient laser sources, possibly resulting in an inefficient or inoperable architecture. In this article, we introduce a methodology for the design space exploration of optical NoC mapping solutions, which automatically assigns IPs/cores to the network tiles such that the laser power consumption is minimized. The experimental evaluation shows average reductions of 34.7% and 27.3% in the power consumption compared to, respectively, application-oblivious and randomly mapped photonic NoCs, allowing improved energy efficiency.

Journal ArticleDOI
TL;DR: A simulation flow is proposed to accurately simulate TTSV effects on 3D ICs using a detailed 3D thermal model, full-system simulation, and a validated thermal simulator, which shows accurate thermal analysis of 3DICs.
Abstract: 3D stacking of integrated circuits (ICs) provides significant advantages in saving device footprints, improving power management, and continuing performance enhancement, particularly for many-core systems. However, the stacked structure makes the heat dissipation a challenging issue. While Thermal Through Silicon Via (TTSV) is a promising way of lowering the thermal resistance of dies, past research has either overestimated or underestimated the effects of TTSVs as a consequence of the lack of detailed 3D IC models or system-level simulations. Here, we propose a simulation flow to accurately simulate TTSV effects on 3D ICs. We adopt benchmarks from Splash-2 running on a full-system mode of the gem5 simulator, which generates all the system component activities. McPAT is used to generate the corresponding power consumption and the power traces are fed to HotSpot for thermal simulation. The temperature profiles of 2D and 3D Nehalem-like ×86 processors are compared. TTSVs are later placed close to hotspot regions to facilitate heat dissipation; the peak temperature of 3D Nehalem is reduced by 5--25% with a small area overhead of 6%. By using a detailed 3D thermal model, full-system simulation, and a validated thermal simulator, our results show accurate thermal analysis of 3D ICs.

Journal ArticleDOI
TL;DR: This article proposes an off-line approach that concurrently optimizes the laser power scaling and execution time of a global application and highlights most promising solutions for mapping a defined application onto a 16-core ring-based WDM ONoC.
Abstract: Optical Network-on-Chip (ONoC) is a promising communication medium for large-scale multiprocessor systems-on-chips Indeed, ONoC can outperform classical electrical NoCs in terms of energy efficiency and bandwidth density, in particular, because this medium can support multiple transactions at the same time on different wavelengths by using Wavelength Division Multiplexing (WDM) However, multiple signals sharing simultaneously the same part of a waveguide can lead to inter-channel crosstalk noise This problem impacts the signal-to-noise ratio of the optical signals, which leads to an increase in the Bit Error Rate (BER) at the receiver side If a specific BER is targeted, an increase of laser power should be necessary to satisfy the SNR In this context, an important issue is to evaluate the laser power needed to satisfy the various desired communication bandwidths based on the BER performance requirements In this article, we propose an off-line approach that concurrently optimizes the laser power scaling and execution time of a global application A set of different levels of power is introduced for each laser, to ensure that optical signals can be emitted with just-enough power to ensure targeted BER As a result, most promising solutions are highlighted for mapping a defined application onto a 16-core ring-based WDM ONoC

Journal ArticleDOI
TL;DR: This work presents, for the first time, a systematic design approach to control the accuracy of SCs and balance it against other design parameters, using the theory of Monte Carlo simulation.
Abstract: Stochastic circuits (SCs) offer considerable area- and power-consumption benefits in various applications at the expense of computational inaccuracies. Unlike conventional logic synthesis, managing accuracy is a central problem in SC design. It is usually tackled in ad hoc fashion by multiple trial-and-error simulations that vary relevant parameters like the stochastic number length n. We present, for the first time, a systematic design approach to controlling the accuracy of SCs and balancing it against other design parameters. We express the (in)accuracy of a circuit processing n-bit stochastic numbers by the numerical deviation of the computed value from the expected result, in conjunction with a confidence level. Using the theory of Monte Carlo simulation, we derive expressions for the stochastic number length required for a desired level of accuracy or vice versa. We discuss the integration of the theory into a design framework that is applicable to both combinational and sequential SCs. We show that for combinational SCs, accuracy is independent of the circuit’s size or complexity, a surprising result. We also show how the analysis can identify subtle errors in both combinational and sequential designs. Finally, we apply the proposed methods to a case study on filtering noisy EKG signals.

Journal ArticleDOI
TL;DR: In this article, a stochastic logic time delay reservoir design in FPGA hardware is presented and compared to a deterministic design, and a novel re-seeding method is introduced to reduce the adverse effects of stochastically noise.
Abstract: This article presents and demonstrates a stochastic logic time delay reservoir design in FPGA hardware. The reservoir network approach is analyzed using a number of metrics, such as kernel quality, generalization rank, and performance on simple benchmarks and is also compared to a deterministic design. A novel re-seeding method is introduced to reduce the adverse effects of stochastic noise, which may also be implemented in other stochastic logic reservoir computing designs, such as echo state networks. Benchmark results indicate that the proposed design performs well on noise-tolerant classification problems, but more work needs to be done to improve the stochastic logic time delay reservoir's robustness for regression problems. In addition, we show that the stochastic design can significantly reduce area cost if the conversion between binary and stochastic representations is implemented efficiently.

Journal ArticleDOI
TL;DR: This article introduces MFNW, a Flip-N-Write algorithm explicitly tailored for MLC NVMs, and introduces and investigates two possible variations of the MFNW algorithm: cell Hamming distance (CHD) MFNW and energy Hammingdistance (EHD) MF NW.
Abstract: The increased capacity of multi-level cells (MLC) and triple-level cells (TLC) in emerging non-volatile memory (NVM) technologies comes at the cost of higher cell write energies and lower cell endurance. In this article, we describe MFNW, a Flip-N-Write encoding that effectively reduces the write energy and improves the endurance of MLC NVMs. Two MFNW modes are analyzed: cell Hamming distance mode and energy Hamming distance mode. We derive an approximate model that accurately predicts the average number of cell writes that is proportional to the energy consumption, enabling word length optimization to maximize energy reduction subject to memory space overhead constraints. In comparison to state-of-the-art MLC NVM encodings, our simulation results indicate that MFNW achieves up to 7%--39% saving for 1.56%--50% NVM space overhead. Extra energy saving (up to 19%--47%) can be achieved for the same NVM space overhead using our proposed variations of MFNW, i.e., MFNW2 and MFNW3. For TLC NVMs, we propose TFNW that can achieve up to 53% energy saving in comparison to state-of-the-art TLC NVM encodings. Endurance simulations indicate that MFNW (TFNW) is capable of extending MLC (TLC) NVM life by up to 100% (87%).

Journal ArticleDOI
TL;DR: It is demonstrated that for a chosen set of benchmark graphs, the spike responses generated on a current generation neuromorphic processor can improve the stability of graph partitions and non-overlapping communities can be identified even with the loss of higher-order spiking behavior if the graphs are sufficiently dense.
Abstract: We study the applicability of spiking neural networks and neuromorphic hardware for solving general opti- mization problems without the use of adaptive training or learning algorithms. We leverage the dynamics of Hopfield networks and spin-glass systems to construct a fully connected spiking neural system to generate synchronous spike responses indicative of the underlying community structure in an undirected, unweighted graph. Mapping this fully connected system to current generation neuromorphic hardware is done by embedding sparse tree graphs to generate only the leading-order spiking dynamics. We demonstrate that for a chosen set of benchmark graphs, the spike responses generated on a current generation neuromorphic processor can improve the stability of graph partitions and non-overlapping communities can be identified even with the loss of higher-order spiking behavior if the graphs are sufficiently dense. For sparse graphs, the loss of higher-order spiking behavior improves the stability of certain graph partitions but does not retrieve the known community memberships.

Journal ArticleDOI
TL;DR: A novel arbitrated all-optical path-setup scheme for tiled CMPs adopting circuit-switched optical networks that aims at significantly reducing path- setup latency and overall energy consumption, and a logically clustered architecture in which an arbiter is allocated in each logical core-clusters and an ad hoc distributed reservation protocol coordinates arbiters to manage inter-cluster path reservations is proposed.
Abstract: Nanophotonics is a promising solution for on-chip interconnection due to its intrinsic low-latency and low-power features, which can be useful for performance and energy in future Chip Multi-Processors (CMPs).This article proposes a novel arbitrated all-optical path-setup scheme for tiled CMPs adopting circuit-switched optical networks. It aims at significantly reducing path-setup latency and overall energy consumption. The proposed arbitrated scheme is able to configure multiple photonic switches simultaneously, instead of sequentially as it is done in state-of-the-art proposals. The proposed fast optical path-setup solution reduces the overhead in each transmission and, most importantly, allows optical circuit-switched networks to effectively serve cache coherence traffic, which is mainly composed of relatively small messages.Specifically, we propose a single-arbiter scheme where the whole topology is managed by a central module (single-arbiter) that takes care of the path-setup procedures. Then, to tackle scalability, we propose a logically clustered architecture (multi-arbiter) in which an arbiter is allocated in each logical core-cluster and an ad hoc distributed reservation protocol coordinates arbiters to manage inter-cluster path reservations.We show that our proposed single-arbiter architecture outperforms a state-of-the-art optical network with sequential path-setup (optical baseline) in the case of 8- and 16-core tiled CMP setups. However, due to serialization issues, the single-arbiter solution is not able to compete with a reference electronic baseline for bigger 32- and 64-core setups even if still performing much better than the optical baseline. Conversely, our multi-arbiter hierarchical solution allows us to improve performance up to almost 20% and 40% for 32- and 64-core setups, respectively, demonstrating a wide applicability of the proposed technique.Energy-wise, the analyzed solutions enable significant savings compared to both the optical baseline with sequential path setup, and to the electronic counterpart. Specifically, results show more than 25% average improvement for the single-arbiter in the 8- and 16-core cases, and more than 40% and 15% savings for the multi-arbiter in the 32- and 64-core cases, respectively.

Journal ArticleDOI
TL;DR: A novel Markov random field sound source separation algorithm that uses expectation-maximization and Gibbs sampling to learn MRF parameters on the fly and infer the best separation of sources is developed, intended for deployment on a mobile phone.
Abstract: Machine learning (ML) has revolutionized a wide range of recognition tasks, ranging from text analysis to speech to vision, most notably in cloud deployments. However, mobile deployment of these ideas involves a very different category of design problems. In this article, we develop a hardware architecture for a sound source separation task, intended for deployment on a mobile phone. We focus on a novel Markov random field (MRF) sound source separation algorithm that uses expectation-maximization and Gibbs sampling to learn MRF parameters on the fly and infer the best separation of sources. The intrinsically iterative algorithm suggests challenges for both speed and power. A real-time streaming FPGA implementation runs at 150MHz with 207KB RAM, achieves a speed-up of 22× over a software reference, performs with an SDR of up to 7.021dB with 1.601ms latency, and exhibits excellent perceived audio quality. A 45nm CMOS ASIC virtual prototype simulated at 20MHz shows that this architecture is small (

Journal ArticleDOI
TL;DR: A memristor-CMOS analog co-processor architecture that can handle floating point computation and has a superior performance when compared to other processors is developed.
Abstract: Vector matrix multiplication computation underlies major applications in machine vision, deep learning, and scientific simulation. These applications require high computational speed and are run on platforms that are size, weight, and power constrained. With the transistor scaling coming to an end, existing digital hardware architectures will not be able to meet this increasing demand. Analog computation with its rich set of primitives and inherent parallel architecture can be faster, more efficient, and compact for some of these applications. One such primitive is a memristor-CMOS crossbar array-based vector matrix multiplication. In this article, we develop a memristor-CMOS analog coprocessor architecture that can handle floating-point computation. To demonstrate the working of the analog coprocessor at a system level, we use a new electronic design automation tool called PSpice Systems Option, which performs integrated cosimulation of MATLAB/Simulink and PSpice. It is shown that the analog coprocessor has a superior performance when compared to other processors, and a speedup of up to 12 × when compared to projected GPU performance is observed. Using the new PSpice Systems Option tool, various application simulations for image processing and solutions to partial differential equations are performed on the analog coprocessor model.

Journal ArticleDOI
TL;DR: A novel mapping scheme for in-memory Kogge-Stone adder has been presented and the correctness of the proposed scheme is verified by means of TaO× device model-based memristive simulations.
Abstract: Low operating voltage, high storage density, non-volatile storage capabilities, and relative low access latencies have popularized memristive devices as storage devices. Memristors can be ideally used for in-memory computing in the form of hybrid CMOS nano-crossbar arrays. In-memory serial adders have been theoretically and experimentally proven for crossbar arrays. To harness the parallelism of memristive arrays, parallel-prefix adders can be effective. In this work, a novel mapping scheme for in-memory Kogge-Stone adder has been presented. The number of cycles increases logarithmically with the bit width N of the operands, i.e., O(log2N), and the device count is 5N. We verify the correctness of the proposed scheme by means of TaO× device model-based memristive simulations. We compare the proposed scheme with other proposed schemes in terms of number of cycle and number of devices.

Journal ArticleDOI
TL;DR: In this article, an on-device training circuitry for threshold-current memristors integrated in a crossbar structure is proposed, and alternate approaches of mapping the synaptic weights into fully trained and semi-trained crossbars are investigated.
Abstract: On-device intelligence is gaining significant attention recently as it offers local data processing and low power consumption. In this research, an on-device training circuitry for threshold-current memristors integrated in a crossbar structure is proposed. Furthermore, alternate approaches of mapping the synaptic weights into fully trained and semi-trained crossbars are investigated. In a semi-trained crossbar, a confined subset of memristors are tuned and the remaining subset of memristors are not programmed. This translates to optimal resource utilization and power consumption, compared to a fully programmed crossbar. The semi-trained crossbar architecture is applicable to a broad class of neural networks. System level verification is performed with an extreme learning machine for binomial and multinomial classification. The total power for a single 4 × 4 layer network, when implemented in IBM 65nm node, is estimated to be a42.16μW and the area is estimated to be 26.48μm × 22.35μm.

Journal ArticleDOI
TL;DR: This study encompasses the influence of key parameters in DNN hardware implementations towards their performance and energy efficiency, including DNN architectural choices, underlying workloads, and tier partitioning choices in M3D designs, which produce power savings and performance improvement beyond what can be achieved using conventional 2D ICs.
Abstract: In recent years, deep learning has become widespread for various real-world recognition tasks. In addition to recognition accuracy, energy efficiency and speed (i.e., performance) are other grand challenges to enable local intelligence in edge devices. In this article, we investigate the adoption of monolithic three-dimensional (3D) IC (M3D) technology for deep learning hardware design, using speech recognition as a test vehicle. M3D has recently proven to be one of the leading contenders to address the power, performance, and area (PPA) scaling challenges in advanced technology nodes. Our study encompasses the influence of key parameters in DNN hardware implementations towards their performance and energy efficiency, including DNN architectural choices, underlying workloads, and tier partitioning choices in M3D designs. Our post-layout M3D designs, together with hardware-efficient sparse algorithms, produce power savings and performance improvement beyond what can be achieved using conventional 2D ICs. Experimental results show that M3D offers 22.3% iso-performance power saving and 6.2% performance improvement, convincingly demonstrating its entitlement as a solution for DNN ASICs. We further present architectural and physical design guidelines for M3D DNNs to maximize the benefits.

Journal ArticleDOI
TL;DR: An algorithm that minimizes the number of checkpoints and determines their locations to cover every path in a given droplet-routing solution is proposed, which provides reliability-hardening mechanisms for a wide class of cyber-physical DMFBs.
Abstract: In the area of biomedical engineering, digital-microfluidic biochips (DMFBs) have received considerable attention because of their capability of providing an efficient and reliable platform for conducting point-of-care clinical diagnostics. System reliability, in turn, mandates error-recoverability while implementing biochemical assays on-chip for medical applications. Unfortunately, the technology of DMFBs is not yet fully equipped to handle error-recovery from various microfluidic operations involving droplet motion and reaction. Recently, a number of cyber-physical systems have been proposed to provide real-time checking and error-recovery in assays based on the feedback received from a few on-chip checkpoints. However, to synthesize robust feedback systems for different types of DMFBs, certain practical issues need to be considered such as co-optimization of checkpoint placement, error-recoverability, and layout of droplet-routing pathways. For application-specific DMFBs, we propose here an algorithm that minimizes the number of checkpoints and determines their locations to cover every path in a given droplet-routing solution. Next, for general-purpose DMFBs, where the checkpoints are pre-deployed in specific locations, we present a checkpoint-aware routing algorithm such that every droplet-routing path passes through at least one checkpoint to enable error-recovery and to ensure physical routability of all droplets. Furthermore, we also propose strategies for executing the algorithms in reliable mode to enhance error-recoverability. The proposed methods thus provide reliability-hardening mechanisms for a wide class of cyber-physical DMFBs.

Journal ArticleDOI
TL;DR: Experiments show that the proposed LDPC designs can improve the STT-RAM reliability by at least 102 (104) when compared to the existing error correction codes (ECCs) for the SLC (MLC) design, demonstrating the feasibility of LDPC solutions on STt-RAM.
Abstract: Spin-transfer torque random access memory (STT-RAM) is a promising emerging memory technology in the future memory hierarchy. However, its unique reliability challenges, i.e., the asymmetric bit failure mechanism at different bit flippings, have raised significant concerns in its real applications. Recent studies even show that the common memory error repair “remedies” cannot efficiently address them. In this article, we for the first time systematically study the potentials of the strong low-density parity-check (LDPC) code for combating such unique asymmetric errors in both single-level-cell (SLC) and multi-level-cell (MLC) STT-RAM designs. A generic STT-RAM channel model suitable for the SLC/MLC designs, is developed to analytically calibrate all the accumulated asymmetric factors of the write/read operations. The key initial information for LDPC decoding, namely asymmetric log-likelihood ratio (A-LLR), is redesigned and extracted from the proposed channel model, to unleash the LDPC’s asymmetric error correcting capability. LDPC codec is also carefully designed to lower the hardware cost by leveraging the systematic-structured parity check matrix. Then two customized short-length LDPC codes—(585,512) and (683,512)—augmented from the semi-random parity check matrix and the A-LLR based asymmetric decoding, are proposed for SLC and MLC STT-RAM designs, respectively. Experiments show that our proposed LDPC designs can improve the STT-RAM reliability by at least 102 (104) when compared to the existing error correction codes (ECCs) for the SLC (MLC) design, demonstrating the feasibility of LDPC solutions on STT-RAM.