Showing papers in "ACM Journal on Emerging Technologies in Computing Systems in 2018"

PDF

Open Access

Journal Article•DOI•

Energy-Efficient Neural Computing with Approximate Multipliers

[...]

Syed Shakib Sarwar¹, Swagath Venkataramani¹, Aayush Ankit¹, Anand Raghunathan¹, Kaushik Roy¹ - Show less +1 more•Institutions (1)

Purdue University¹

12 Jul 2018-ACM Journal on Emerging Technologies in Computing Systems

TL;DR: An approximate multiplier that exploits the inherent application resilience to error and utilizes the notion of computation sharing to achieve improved energy consumption for neural networks and a Multiplier-less Artificial Neuron (MAN), which is even more compact and energy efficient.

...read moreread less

Abstract: Neural networks, with their remarkable ability to derive meaning from a large volume of complicated or imprecise data, can be used to extract patterns and detect trends that are too complex for the von Neumann computing paradigm. Their considerable computational requirements stretch the capabilities of even modern computing platforms. We propose an approximate multiplier that exploits the inherent application resilience to error and utilizes the notion of computation sharing to achieve improved energy consumption for neural networks. We also propose a Multiplier-less Artificial Neuron (MAN), which is even more compact and energy efficient. We also propose a network retraining methodology to recover some of the accuracy loss due to the use of these approximate multipliers. We evaluated the proposed algorithm/design on several recognition applications. The results show that we achieve ∼33%, ∼32%, and ∼25% reduction in power consumption and ∼33%, ∼34%, and ∼27% reduction in area, respectively, for 12-, 8-, and 4-bit MAN, with a maximum ∼2.4% loss in accuracy compared to a conventional neuron implementation of equivalent bit precision. These comparisons were performed under iso-speed conditions.

...read moreread less

68 citations

Journal Article•DOI•

A Study of Complex Deep Learning Networks on High-Performance, Neuromorphic, and Quantum Computers

[...]

Thomas E. Potok¹, Catherine D. Schuman¹, Steven R. Young¹, Robert M. Patton¹, Federico M. Spedalieri², Jeremy Liu², Ke-Thia Yao², Garrett S. Rose³, Gangotree Chakma³ - Show less +5 more•Institutions (3)

Oak Ridge National Laboratory¹, Information Sciences Institute², University of Tennessee³

11 Jul 2018-ACM Journal on Emerging Technologies in Computing Systems

TL;DR: In this article, the authors evaluate deep learning models using three different computing architectures: quantum computing to train complex topologies, high performance computing to automatically determine network topology, and neuromorphic computing for a low-power hardware implementation.

...read moreread less

Abstract: Current deep learning approaches have been very successful using convolutional neural networks trained on large graphical-processing-unit-based computers. Three limitations of this approach are that (1) they are based on a simple layered network topology, i.e., highly connected layers, without intra-layer connections; (2) the networks are manually configured to achieve optimal results, and (3) the implementation of the network model is expensive in both cost and power. In this article, we evaluate deep learning models using three different computing architectures to address these problems: quantum computing to train complex topologies, high performance computing to automatically determine network topology, and neuromorphic computing for a low-power hardware implementation. We use the MNIST dataset for our experiment, due to input size limitations of current quantum computers. Our results show the feasibility of using the three architectures in tandem to address the above deep learning limitations. We show that a quantum computer can find high quality values of intra-layer connection weights in a tractable time as the complexity of the network increases, a high performance computer can find optimal layer-based topologies, and a neuromorphic computer can represent the complex topology and weights derived from the other architectures in low power memristive hardware.

...read moreread less

50 citations

Journal Article•DOI•

A GPU-Outperforming FPGA Accelerator Architecture for Binary Convolutional Neural Networks

[...]

Yixing Li¹, Zichuan Liu², Kai Xu¹, Hao Yu³, Fengbo Ren¹ - Show less +1 more•Institutions (3)

Arizona State University¹, Nanyang Technological University², Southern University of Science and Technology³

25 Jul 2018-ACM Journal on Emerging Technologies in Computing Systems

TL;DR: This article proposes an optimized fully mapped FPGA accelerator architecture tailored for bitwise convolution and normalization that features massive spatial parallelism with deep pipelines stages that is on a par with a Titan X GPU in terms of throughput and energy efficiency.

...read moreread less

Abstract: FPGA-based hardware accelerators for convolutional neural networks (CNNs) have received attention due to their higher energy efficiency than GPUs. However, it is challenging for FPGA-based solutions to achieve a higher throughput than GPU counterparts. In this article, we demonstrate that FPGA acceleration can be a superior solution in terms of both throughput and energy efficiency when a CNN is trained with binary constraints on weights and activations. Specifically, we propose an optimized fully mapped FPGA accelerator architecture tailored for bitwise convolution and normalization that features massive spatial parallelism with deep pipelines stages. A key advantage of the FPGA accelerator is that its performance is insensitive to data batch size, while the performance of GPU acceleration varies largely depending on the batch size of the data. Experiment results show that the proposed accelerator architecture for binary CNNs running on a Virtex-7 FPGA is 8.3× faster and 75× more energy-efficient than a Titan X GPU for processing online individual requests in small batch sizes. For processing static data in large batch sizes, the proposed solution is on a par with a Titan X GPU in terms of throughput while delivering 9.5× higher energy efficiency.

...read moreread less

46 citations

Journal Article•DOI•

STDP-based Unsupervised Feature Learning using Convolution-over-time in Spiking Neural Networks for Energy-Efficient Neuromorphic Computing

[...]

Gopalakrishnan Srinivasan¹, Priyadarshini Panda¹, Kaushik Roy¹•Institutions (1)

Purdue University¹

27 Nov 2018-ACM Journal on Emerging Technologies in Computing Systems

TL;DR: This work proposes Spike Timing Dependent Plasticity-based unsupervised feature learning using convolution-over-time in Spiking Neural Network (SNN), and uses shared weight kernels that are convolved with the input patterns over time to encode representative input features, thereby improving the sparsity as well as the robustness of the learning model.

...read moreread less

Abstract: Brain-inspired learning models attempt to mimic the computations performed in the neurons and synapses constituting the human brain to achieve its efficiency in cognitive tasks. In this work, we propose Spike Timing Dependent Plasticity-based unsupervised feature learning using convolution-over-time in Spiking Neural Network (SNN). We use shared weight kernels that are convolved with the input patterns over time to encode representative input features, thereby improving the sparsity as well as the robustness of the learning model. We show that the Convolutional SNN self-learns several visual categories for object recognition with limited number of training patterns while yielding comparable classification accuracy relative to the fully connected SNN. Further, we quantify the energy benefits of the Convolutional SNN over fully connected SNN on neuromorphic hardware implementation.

...read moreread less

36 citations

Journal Article•DOI•

DFR: An Energy-efficient Analog Delay Feedback Reservoir Computing System for Brain-inspired Computing

[...]

Kangjun Bai¹, Yang Yi¹•Institutions (1)

Virginia Tech¹

06 Dec 2018-ACM Journal on Emerging Technologies in Computing Systems

TL;DR: This work designs and fabricates an energy-efficient analog delayed feedback reservoir (DFR) computing system, which is built upon a temporal encoding scheme, a nonlinear transfer function, and a dynamic delayed feedback loop, and represents the first analog integrated circuit (IC) implementation of the DFR computing system.

...read moreread less

Abstract: Neuromorphic computing, which is built on a brain-inspired silicon chip, is uniquely applied to keep pace with the explosive escalation of algorithms and data density on machine learning. Reservoir computing, an emerging computing paradigm based on the recurrent neural network with proven benefits across multifaceted applications, offers an alternative training mechanism only at the readout stage. In this work, we successfully design and fabricate an energy-efficient analog delayed feedback reservoir (DFR) computing system, which is built upon a temporal encoding scheme, a nonlinear transfer function, and a dynamic delayed feedback loop. Measurement results demonstrate its high energy efficiency with rich dynamic behaviors, making the designed system a candidate for low power embedded applications. The system performance, as well as the robustness, are studied and analyzed through the Monte Carlo simulation. The chaotic time series prediction benchmark, NARMA10, is examined through the proposed DFR computing system, and exhibits a 36%−85% reduction on the error rate compared to state-of-the-art DFR computing system designs. To the best of our knowledge, our work represents the first analog integrated circuit (IC) implementation of the DFR computing system.

...read moreread less

31 citations

Journal Article•DOI•

T-count and Qubit Optimized Quantum Circuit Design of the Non-Restoring Square Root Algorithm

[...]

Edgard Munoz-Coreas¹, Himanshu Thapliyal¹•Institutions (1)

University of Kentucky¹

23 Oct 2018-ACM Journal on Emerging Technologies in Computing Systems

TL;DR: In this article, a T-count optimized quantum square root circuit with only 2 ǫ n + 1 qubits and no garbage output was presented, which achieves an average T-Count savings of 43.44%, 98.95%, 41.06%, and 20.28% as well as qubit savings of 85.46%, 95.16%, 90.59%, and 86.77% compared to existing works.

...read moreread less

Abstract: Quantum circuits for basic mathematical functions such as the square root are required to implement scientific computing algorithms on quantum computers. Quantum circuits that are based on Clifford+T gates can easily be made fault tolerant, but the T gate is very costly to implement. As a result, reducing T-count has become an important optimization goal. Further, quantum circuits with many qubits are difficult to realize, making designs that save qubits and produce no garbage outputs desirable. In this work, we present a T-count optimized quantum square root circuit with only 2 ṡ n + 1 qubits and no garbage output. To make a fair comparison against existing work, the Bennett’s garbage removal scheme is used to remove garbage output from existing works. We determined that our proposed design achieves an average T-count savings of 43.44%, 98.95%, 41.06%, and 20.28% as well as qubit savings of 85.46%, 95.16%, 90.59%, and 86.77% compared to existing works.

...read moreread less

29 citations

Journal Article•DOI•

Deep Neural Network Optimized to Resistive Memory with Nonlinear Current-Voltage Characteristics

[...]

Hyungjun Kim¹, Taesu Kim¹, Jinseok Kim¹, Jae-Joon Kim¹•Institutions (1)

Pohang University of Science and Technology¹

12 Jul 2018-ACM Journal on Emerging Technologies in Computing Systems

TL;DR: In this article, instead of optimizing hardware parameters to a given neural network, the authors propose a methodology of reconstructing the neural network itself to be optimized to resistive memory crossbar arrays.

...read moreread less

Abstract: Artificial Neural Network computation relies on intensive vector-matrix multiplications. Recently, the emerging nonvolatile memory (NVM) crossbar array showed a feasibility of implementing such operations with high energy efficiency. Thus, there have been many works on efficiently utilizing emerging NVM crossbar arrays as analog vector-matrix multipliers. However, nonlinear I-V characteristics of NVM restrain critical design parameters, such as the read voltage and weight range, resulting in substantial accuracy loss. In this article, instead of optimizing hardware parameters to a given neural network, we propose a methodology of reconstructing the neural network itself to be optimized to resistive memory crossbar arrays. To verify the validity of the proposed method, we simulated various neural networks with MNIST and CIFAR-10 dataset using two different Resistive Random Access Memory models. Simulation results show that our proposed neural network produces inference accuracies significantly higher than conventional neural network when the network is mapped to synapse devices with nonlinear I-V characteristics.

...read moreread less

22 citations

Journal Article•DOI•

A Multi-Level-Optimization Framework for FPGA-Based Cellular Neural Network Implementation

[...]

Zhongyang Liu¹, Shaoheng Luo¹, Xiaowei Xu², Yiyu Shi², Cheng Zhuo¹ - Show less +1 more•Institutions (2)

Zhejiang University¹, University of Notre Dame²

28 Nov 2018-ACM Journal on Emerging Technologies in Computing Systems

TL;DR: A multi-level-optimization framework for energy-efficient CeNN implementations on FPGAs featured with three level optimizations: system-, module-, and design-space-level, with focus on computational redundancy and attainable performance, respectively is proposed.

...read moreread less

Abstract: Cellular Neural Network (CeNN) is considered as a powerful paradigm for embedded devices. Its analog and mix-signal hardware implementations are proved to be applicable to high-speed image processing, video analysis, and medical signal processing with its efficiency and popularity limited by smaller implementation size and lower precision. Recently, digital implementations of CeNNs on FPGA have attracted researchers from both academia and industry due to its high flexibility and short time-to-market. However, most existing implementations are not well optimized to fully utilize the advantages of FPGA platform with unnecessary design and computational redundancy that prevents speedup. We propose a multi-level-optimization framework for energy-efficient CeNN implementations on FPGAs. In particular, the optimization framework is featured with three level optimizations: system-, module-, and design-space-level, with focus on computational redundancy and attainable performance, respectively. Experimental results show that with various configurations our framework can achieve an energy-efficiency improvement of 3.54× and up to 3.88× speedup compared with existing implementations with similar accuracy.

...read moreread less

20 citations

Journal Article•DOI•

Hardware Trojan Detection Using the Order of Path Delay

[...]

Xiaotong Cui¹, Elnaz Koopahi², Kaijie Wu³, Ramesh Karri³•Institutions (3)

Chongqing University¹, University of Isfahan², New York University³

23 Oct 2018-ACM Journal on Emerging Technologies in Computing Systems

TL;DR: This article proposes a two-phase technique, which uses the order of the path delay in path pairs to detect HTs, and confirms the efficiency and accuracy of the proposed technique are confirmed by a series of experiments.

...read moreread less

Abstract: Many fabrication-less design houses are outsourcing their designs to third-party foundries for fabrication to lower cost. This IC development process, however, raises serious security concerns on Hardware Trojans (HTs). Many design-for-trust techniques have been proposed to detect HTs through observing erroneous output or abnormal side-channel characteristics. Side-channel characteristics such as path delay have been widely used for HT detection and functionality verification, as the changes of the characteristics of the host circuit incurred by the inserted HT can be identified through proper methods.In this article, for the first time, we propose a two-phase technique, which uses the order of the path delay in path pairs to detect HTs. In the design phase, a full-cover path set that covers all the nets of the design is generated; meanwhile, in the set, the relative order of paths in path pairs is determined according to their delay. The order of the paths in path pairs serves as the fingerprint of the design. In the test phase, the actual delay of the paths in the full-cover set is extracted from the fabricated circuits, and the order of paths in path pairs is compared with the fingerprint generated in the design phase. A mismatch between them indicates the existence of HTs. Both process variations and measurement noise are taken into consideration. The efficiency and accuracy of the proposed technique are confirmed by a series of experiments, including the examination of both violated path pairs incurred by HTs and their false alarm rate.

...read moreread less

20 citations

Journal Article•DOI•

Efficient Hardware Implementation of Cellular Neural Networks with Incremental Quantization and Early Exit

[...]

Xiaowei Xu¹, Qing Lu¹, Tianchen Wang¹, Yu Hu², Chen Zhuo³, Jinglan Liu¹, Yiyu Shi¹ - Show less +3 more•Institutions (3)

University of Notre Dame¹, Huazhong University of Science and Technology², Zhejiang University³

01 Dec 2018-ACM Journal on Emerging Technologies in Computing Systems

TL;DR: A compressed CeNN framework for efficient FPGA implementations that involves various techniques, such as incremental quantization and early exit, which significantly reduces computation demands while maintaining an acceptable performance.

...read moreread less

Abstract: Cellular neural networks (CeNNs) have been widely adopted in image processing tasks. Recently, various hardware implementations of CeNNs have emerged in the literature, with Field Programmable Gate Array (FPGA) being one of the most popular choices due to its high flexibility and low time-to-market. However, CeNNs typically involve extensive computations in a recursive manner. As an example, to simply process an image of 1,920 × 1,080 pixels requires 4--8 Giga floating point multiplications (for 3 × 3 templates and 50–100 iterations), which needs to be done in a timely manner for real-time applications. To address this issue, in this article, we propose a compressed CeNN framework for efficient FPGA implementations. It involves various techniques, such as incremental quantization and early exit, which significantly reduces computation demands while maintaining an acceptable performance. Particularly, incremental quantization quantizes the numbers in CeNN templates to powers of two, so that complex and expensive multiplications can be converted to simple and cheap shift operations, which only require a minimum number of registers and logical elements (LEs). While a similar concept has been explored in hardware implementations of Convolutional Neural Networks (CNNs), CeNNs have completely different computation patterns, which require different quantization and implementation strategies. Experimental results on FPGAs show that incremental quantization and early exit can achieve a speedup of up to 7.8× and 8.3×, respectively, compared with the state-of-the-art implementations, while with almost no performance loss with four widely adopted applications. We also discover that different from CNNs, the optimal quantization strategies of CeNNs depend heavily on the applications. We hope that our work can serve as a pioneer in the hardware optimization of CeNNs.

...read moreread less

20 citations

Journal Article•DOI•

An Integrated Nanophotonic Parallel Adder

[...]

Tohru Ishihara¹, Akihiko Shinya², Koji Inoue³, Kengo Nozaki², Masaya Notomi² - Show less +1 more•Institutions (3)

Kyoto University¹, Nippon Telegraph and Telephone², Kyushu University³

25 Jul 2018-ACM Journal on Emerging Technologies in Computing Systems

TL;DR: Experimental results obtained with an optoelectronic circuit simulator show the advantages of the optical parallel adder circuit over a traditional CMOS-based parallelAdder circuit.

...read moreread less

Abstract: Integrated optical circuits with nanophotonic devices have attracted significant attention due to their low power dissipation and light-speed operation. With light interference and resonance phenomena, the nanophotonic device works as a voltage-controlled optical pass-gate like a pass-transistor. This article first introduces the concept of optical pass-gate logic and then proposes a parallel adder circuit based on optical pass-gate logic. Experimental results obtained with an optoelectronic circuit simulator show the advantages of our optical parallel adder circuit over a traditional CMOS-based parallel adder circuit.

...read moreread less

Journal Article•DOI•

A Chip-Level Anti-Reverse Engineering Technique

[...]

Shuai Chen¹, Junlin Chen¹, Lei Wang¹•Institutions (1)

University of Connecticut¹

25 Jul 2018-ACM Journal on Emerging Technologies in Computing Systems

TL;DR: In the proposed technique, transformable interconnects enable an IC chip to maintain functioning in normal use and to transform its physical structure into another pattern when exposed to invasive attacks.

...read moreread less

Abstract: Protection of intellectual property (IP) is increasingly critical for IP vendors in the semiconductor industry. However, advanced reverse engineering techniques can physically disassemble the chip and derive the IPs at a much lower cost than the value of IP design that chips carry. This invasive hardware attack—obtaining information from IC chips—always violates the IP rights of vendors. The intent of this article is to present a chip-level reverse engineering resilient design technique. In the proposed technique, transformable interconnects enable an IC chip to maintain functioning in normal use and to transform its physical structure into another pattern when exposed to invasive attacks. The newly created pattern will significantly increase the difficulty of reverse engineering. Furthermore, to improve the effectiveness of the proposed technique, a systematic design method is developed targeting integrated circuits with multiple design constraints. Simulations have been conducted to demonstrate the capability of the proposed technique, which generates extremely large complexity for reverse engineering with manageable overhead.

...read moreread less

Journal Article•DOI•

Online Adaptation and Energy Minimization for Hardware Recurrent Spiking Neural Networks

[...]

Yu Liu¹, Yingyezhe Jin¹, Peng Li¹•Institutions (1)

Texas A&M University¹

11 Jan 2018-ACM Journal on Emerging Technologies in Computing Systems

TL;DR: This article proposes a hardware-friendly Spike-Timing Dependent Plastic (STDP) mechanism for on-chip tuning, and a novel runtime correlation-based neuron gating scheme to minimize the power dissipated by reservoir neurons.

...read moreread less

Abstract: The Liquid State Machine (LSM) is a promising model of recurrent spiking neural networks that provides an appealing brain-inspired computing paradigm for machine-learning applications such as pattern recognition. Moreover, processing information directly on spiking events makes the LSM well suited for cost- and energy-efficient hardware implementation. In this article, we systematically present three techniques for optimizing energy efficiency while maintaining good performance of the proposed LSM neural processors from both an algorithmic and hardware implementation point of view. First, to realize adaptive LSM neural processors, thus boost learning performance, we propose a hardware-friendly Spike-Timing Dependent Plastic (STDP) mechanism for on-chip tuning. Then, the LSM processor incorporates a novel runtime correlation-based neuron gating scheme to minimize the power dissipated by reservoir neurons. Furthermore, an activity-dependent clock gating approach is presented to address the energy inefficiency due to the memory-intensive nature of the proposed neural processors.Using two different real-world tasks of speech and image recognition to benchmark, we demonstrate that the proposed architecture boosts the average learning performance by up to 2.0% while reducing energy dissipation by up to 29% compared to a baseline LSM with little extra hardware overhead on a Xilinx Virtex-6 FPGA.

...read moreread less

Journal Article•DOI•

Efficient Memristor-Based Architecture for Intrusion Detection and High-Speed Packet Classification

[...]

VenkataRamesh Bontupalli¹, Chris Yakopcic¹, Raqibul Hasan¹, Tarek M. Taha¹•Institutions (1)

University of Dayton¹

28 Nov 2018-ACM Journal on Emerging Technologies in Computing Systems

TL;DR: This work has developed a memristor crossbar-based approach, inspired by memristOr crossbar neuromorphic circuits, for a low-power, low-area, and high-throughput DPI system that examines both the header and body of a packet.

...read moreread less

Abstract: Deep packet inspection (DPI) is a critical component to prevent intrusion detection. This requires a detailed analysis of each network packet header and body. Although this is often done on dedicated high-power servers in most networked systems, mobile systems could potentially be vulnerable to attack if utilized on an unprotected network. In this case, having DPI hardware on the mobile system would be highly beneficial. Unfortunately, DPI hardware is generally area and power consuming, making its implementation difficult in mobile systems.We developed a memristor crossbar-based approach, inspired by memristor crossbar neuromorphic circuits, for a low-power, low-area, and high-throughput DPI system that examines both the header and body of a packet. Two key types of circuits are presented: static pattern matching and regular expression circuits. This system is able to reduce execution time and power consumption due to its high-density grid and massive parallelism. Independent searches are performed using low-power memristor crossbar arrays giving rise to a throughput of 160Gbps with no loss in the classification accuracy.

...read moreread less

Journal Article•DOI•

A Learning-Based Thermal-Sensitive Power Optimization Approach for Optical NoCs

[...]

Zhe Zhang¹, Yaoyao Ye¹•Institutions (1)

Shanghai Jiao Tong University¹

11 Jul 2018-ACM Journal on Emerging Technologies in Computing Systems

TL;DR: Simulation results of an 8x8 mesh-based optical NoC show that the proposed initial device-setting and thermal-tuning mechanism confines the worst-case thermal-induced optical energy consumption to be on the order of tens of pJ/bit, by avoiding significant thermal- induced optical power loss caused by temperature-dependent wavelength shifts.

...read moreread less

Abstract: Optical networks-on-chip (NoCs) based on silicon photonics have been proposed as emerging on-chip communication architectures for chip multiprocessors with large core counts. However, due to the thermal sensitivity of optical devices used in optical NoCs, on-chip temperature variations cause significant thermal-induced optical power loss, which would counteract the power advantages of optical NoCs. To tackle this problem, in this work, we propose a learning-based thermal-sensitive power optimization approach for mesh- or torus-based optical NoCs in the presence of temperature variations. The key techniques proposed include an initial device-setting and thermal-tuning mechanism that is a device-level optimization technique, and a learning-based thermal-sensitive adaptive routing algorithm that is a network-level optimization technique. Simulation results of an 8x8 mesh-based optical NoC show that the proposed initial device-setting and thermal-tuning mechanism confines the worst-case thermal-induced optical energy consumption to be on the order of tens of pJ/bit, by avoiding significant thermal-induced optical power loss caused by temperature-dependent wavelength shifts. Besides, it shows that the learning-based thermal-sensitive adaptive routing algorithm is able to find an optimal path with the minimum estimated thermal-induced optical power consumption for each communication pair. The proposed routing has a greater space for optimization, especially for applications with more long-distance traffic.

...read moreread less

Journal Article•DOI•

Reducing Power Consumption of Lasers in Photonic NoCs through Application-Specific Mapping

[...]

Edoardo Fusella¹, Alessandro Cilardo¹•Institutions (1)

University of Naples Federico II¹

12 Jul 2018-ACM Journal on Emerging Technologies in Computing Systems

TL;DR: A methodology for the design space exploration of optical NoC mapping solutions, which automatically assigns IPs/cores to the network tiles such that the laser power consumption is minimized, allowing improved energy efficiency.

...read moreread less

Abstract: To face the complex communication problems that arise as the number of on-chip components grows up, photonic networks-on-chip (NoCs) have been recently proposed to replace electronic interconnects. However, photonic NoCs lack efficient laser sources, possibly resulting in an inefficient or inoperable architecture. In this article, we introduce a methodology for the design space exploration of optical NoC mapping solutions, which automatically assigns IPs/cores to the network tiles such that the laser power consumption is minimized. The experimental evaluation shows average reductions of 34.7% and 27.3% in the power consumption compared to, respectively, application-oblivious and randomly mapped photonic NoCs, allowing improved energy efficiency.

...read moreread less

Journal Article•DOI•

System-Level Analysis of 3D ICs with Thermal TSVs

[...]

Ayed Alqahtani¹, Zongqing Ren¹, Jaeho Lee¹, Nader Bagherzadeh¹•Institutions (1)

University of California, Irvine¹

23 Oct 2018-ACM Journal on Emerging Technologies in Computing Systems

TL;DR: A simulation flow is proposed to accurately simulate TTSV effects on 3D ICs using a detailed 3D thermal model, full-system simulation, and a validated thermal simulator, which shows accurate thermal analysis of 3DICs.

...read moreread less

Abstract: 3D stacking of integrated circuits (ICs) provides significant advantages in saving device footprints, improving power management, and continuing performance enhancement, particularly for many-core systems. However, the stacked structure makes the heat dissipation a challenging issue. While Thermal Through Silicon Via (TTSV) is a promising way of lowering the thermal resistance of dies, past research has either overestimated or underestimated the effects of TTSVs as a consequence of the lack of detailed 3D IC models or system-level simulations. Here, we propose a simulation flow to accurately simulate TTSV effects on 3D ICs. We adopt benchmarks from Splash-2 running on a full-system mode of the gem5 simulator, which generates all the system component activities. McPAT is used to generate the corresponding power consumption and the power traces are fed to HotSpot for thermal simulation. The temperature profiles of 2D and 3D Nehalem-like ×86 processors are compared. TTSVs are later placed close to hotspot regions to facilitate heat dissipation; the peak temperature of 3D Nehalem is reduced by 5--25% with a small area overhead of 6%. By using a detailed 3D thermal model, full-system simulation, and a validated thermal simulator, our results show accurate thermal analysis of 3D ICs.

...read moreread less

Journal Article•DOI•

Offline Optimization of Wavelength Allocation and Laser Power in Nanophotonic Interconnects

[...]

Jiating Luo¹, Cedric Killian¹, Sébastien Le Beux², Daniel Chillet¹, Olivier Sentieys¹, Ian O'Connor² - Show less +2 more•Institutions (2)

University of Rennes¹, École centrale de Lyon²

12 Jul 2018-ACM Journal on Emerging Technologies in Computing Systems

TL;DR: This article proposes an off-line approach that concurrently optimizes the laser power scaling and execution time of a global application and highlights most promising solutions for mapping a defined application onto a 16-core ring-based WDM ONoC.

...read moreread less

Abstract: Optical Network-on-Chip (ONoC) is a promising communication medium for large-scale multiprocessor systems-on-chips Indeed, ONoC can outperform classical electrical NoCs in terms of energy efficiency and bandwidth density, in particular, because this medium can support multiple transactions at the same time on different wavelengths by using Wavelength Division Multiplexing (WDM) However, multiple signals sharing simultaneously the same part of a waveguide can lead to inter-channel crosstalk noise This problem impacts the signal-to-noise ratio of the optical signals, which leads to an increase in the Bit Error Rate (BER) at the receiver side If a specific BER is targeted, an increase of laser power should be necessary to satisfy the SNR In this context, an important issue is to evaluate the laser power needed to satisfy the various desired communication bandwidths based on the BER performance requirements In this article, we propose an off-line approach that concurrently optimizes the laser power scaling and execution time of a global application A set of different levels of power is introduced for each laser, to ensure that optical signals can be emitted with just-enough power to ensure targeted BER As a result, most promising solutions are highlighted for mapping a defined application onto a 16-core ring-based WDM ONoC

...read moreread less

Journal Article•DOI•

Framework for Quantifying and Managing Accuracy in Stochastic Circuit Design

[...]

Florian Neugebauer¹, Ilia Polian¹, John P. Hayes²•Institutions (2)

University of Passau¹, University of Michigan²

25 Jul 2018-ACM Journal on Emerging Technologies in Computing Systems

TL;DR: This work presents, for the first time, a systematic design approach to control the accuracy of SCs and balance it against other design parameters, using the theory of Monte Carlo simulation.

...read moreread less

Abstract: Stochastic circuits (SCs) offer considerable area- and power-consumption benefits in various applications at the expense of computational inaccuracies. Unlike conventional logic synthesis, managing accuracy is a central problem in SC design. It is usually tackled in ad hoc fashion by multiple trial-and-error simulations that vary relevant parameters like the stochastic number length n. We present, for the first time, a systematic design approach to controlling the accuracy of SCs and balancing it against other design parameters. We express the (in)accuracy of a circuit processing n-bit stochastic numbers by the numerical deviation of the computed value from the expected result, in conjunction with a confidence level. Using the theory of Monte Carlo simulation, we derive expressions for the stochastic number length required for a desired level of accuracy or vice versa. We discuss the integration of the theory into a design framework that is applicable to both combinational and sequential SCs. We show that for combinational SCs, accuracy is independent of the circuit’s size or complexity, a surprising result. We also show how the analysis can identify subtle errors in both combinational and sequential designs. Finally, we apply the proposed methods to a case study on filtering noisy EKG signals.

...read moreread less

Journal Article•DOI•

An FPGA Implementation of a Time Delay Reservoir Using Stochastic Logic

[...]

Lisa Loomis¹, Nathan McDonald¹, Cory Merkel²•Institutions (2)

Air Force Research Laboratory¹, Rochester Institute of Technology²

01 Dec 2018-ACM Journal on Emerging Technologies in Computing Systems

TL;DR: In this article, a stochastic logic time delay reservoir design in FPGA hardware is presented and compared to a deterministic design, and a novel re-seeding method is introduced to reduce the adverse effects of stochastically noise.

...read moreread less

Abstract: This article presents and demonstrates a stochastic logic time delay reservoir design in FPGA hardware. The reservoir network approach is analyzed using a number of metrics, such as kernel quality, generalization rank, and performance on simple benchmarks and is also compared to a deterministic design. A novel re-seeding method is introduced to reduce the adverse effects of stochastic noise, which may also be implemented in other stochastic logic reservoir computing designs, such as echo state networks. Benchmark results indicate that the proposed design performs well on noise-tolerant classification problems, but more work needs to be done to improve the stochastic logic time delay reservoir's robustness for regression problems. In addition, we show that the stochastic design can significantly reduce area cost if the conversion between binary and stochastic representations is implemented efficiently.

...read moreread less

Journal Article•DOI•

MFNW: An MLC/TLC Flip-N-Write Architecture

[...]

Ali Alsuwaiyan¹, Kartik Mohanram¹•Institutions (1)

University of Pittsburgh¹

12 Jul 2018-ACM Journal on Emerging Technologies in Computing Systems

TL;DR: This article introduces MFNW, a Flip-N-Write algorithm explicitly tailored for MLC NVMs, and introduces and investigates two possible variations of the MFNW algorithm: cell Hamming distance (CHD) MFNW and energy Hammingdistance (EHD) MF NW.

...read moreread less

Abstract: The increased capacity of multi-level cells (MLC) and triple-level cells (TLC) in emerging non-volatile memory (NVM) technologies comes at the cost of higher cell write energies and lower cell endurance. In this article, we describe MFNW, a Flip-N-Write encoding that effectively reduces the write energy and improves the endurance of MLC NVMs. Two MFNW modes are analyzed: cell Hamming distance mode and energy Hamming distance mode. We derive an approximate model that accurately predicts the average number of cell writes that is proportional to the energy consumption, enabling word length optimization to maximize energy reduction subject to memory space overhead constraints. In comparison to state-of-the-art MLC NVM encodings, our simulation results indicate that MFNW achieves up to 7%--39% saving for 1.56%--50% NVM space overhead. Extra energy saving (up to 19%--47%) can be achieved for the same NVM space overhead using our proposed variations of MFNW, i.e., MFNW2 and MFNW3. For TLC NVMs, we propose TFNW that can achieve up to 53% energy saving in comparison to state-of-the-art TLC NVM encodings. Endurance simulations indicate that MFNW (TFNW) is capable of extending MLC (TLC) NVM life by up to 100% (87%).

...read moreread less

Journal Article•DOI•

Sparse Hardware Embedding of Spiking Neuron Systems for Community Detection

[...]

Kathleen E. Hamilton¹, Neena Imam¹, Travis S. Humble¹•Institutions (1)

Oak Ridge National Laboratory¹

27 Nov 2018-ACM Journal on Emerging Technologies in Computing Systems

TL;DR: It is demonstrated that for a chosen set of benchmark graphs, the spike responses generated on a current generation neuromorphic processor can improve the stability of graph partitions and non-overlapping communities can be identified even with the loss of higher-order spiking behavior if the graphs are sufficiently dense.

...read moreread less

Abstract: We study the applicability of spiking neural networks and neuromorphic hardware for solving general opti- mization problems without the use of adaptive training or learning algorithms. We leverage the dynamics of Hopfield networks and spin-glass systems to construct a fully connected spiking neural system to generate synchronous spike responses indicative of the underlying community structure in an undirected, unweighted graph. Mapping this fully connected system to current generation neuromorphic hardware is done by embedding sparse tree graphs to generate only the leading-order spiking dynamics. We demonstrate that for a chosen set of benchmark graphs, the spike responses generated on a current generation neuromorphic processor can improve the stability of graph partitions and non-overlapping communities can be identified even with the loss of higher-order spiking behavior if the graphs are sufficiently dense. For sparse graphs, the loss of higher-order spiking behavior improves the stability of certain graph partitions but does not retrieve the known community memberships.

...read moreread less

Journal Article•DOI•

Scalable Path-Setup Scheme for All-Optical Dynamic Circuit Switched NoCs in Cache Coherent CMPs

[...]

Paolo Grani¹, Sandro Bartolini¹•Institutions (1)

University of Siena¹

08 Mar 2018-ACM Journal on Emerging Technologies in Computing Systems

TL;DR: A novel arbitrated all-optical path-setup scheme for tiled CMPs adopting circuit-switched optical networks that aims at significantly reducing path- setup latency and overall energy consumption, and a logically clustered architecture in which an arbiter is allocated in each logical core-clusters and an ad hoc distributed reservation protocol coordinates arbiters to manage inter-cluster path reservations is proposed.

...read moreread less

Abstract: Nanophotonics is a promising solution for on-chip interconnection due to its intrinsic low-latency and low-power features, which can be useful for performance and energy in future Chip Multi-Processors (CMPs).This article proposes a novel arbitrated all-optical path-setup scheme for tiled CMPs adopting circuit-switched optical networks. It aims at significantly reducing path-setup latency and overall energy consumption. The proposed arbitrated scheme is able to configure multiple photonic switches simultaneously, instead of sequentially as it is done in state-of-the-art proposals. The proposed fast optical path-setup solution reduces the overhead in each transmission and, most importantly, allows optical circuit-switched networks to effectively serve cache coherence traffic, which is mainly composed of relatively small messages.Specifically, we propose a single-arbiter scheme where the whole topology is managed by a central module (single-arbiter) that takes care of the path-setup procedures. Then, to tackle scalability, we propose a logically clustered architecture (multi-arbiter) in which an arbiter is allocated in each logical core-cluster and an ad hoc distributed reservation protocol coordinates arbiters to manage inter-cluster path reservations.We show that our proposed single-arbiter architecture outperforms a state-of-the-art optical network with sequential path-setup (optical baseline) in the case of 8- and 16-core tiled CMP setups. However, due to serialization issues, the single-arbiter solution is not able to compete with a reference electronic baseline for bigger 32- and 64-core setups even if still performing much better than the optical baseline. Conversely, our multi-arbiter hierarchical solution allows us to improve performance up to almost 20% and 40% for 32- and 64-core setups, respectively, demonstrating a wide applicability of the proposed technique.Energy-wise, the analyzed solutions enable significant savings compared to both the optical baseline with sequential path setup, and to the electronic counterpart. Specifically, results show more than 25% average improvement for the single-arbiter in the 8- and 16-core cases, and more than 40% and 15% savings for the multi-arbiter in the 32- and 64-core cases, respectively.

...read moreread less

Journal Article•DOI•

Real-Time and Low-Power Streaming Source Separation Using Markov Random Field

[...]

Glenn G. Ko¹, Rob A. Rutenbar¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

22 May 2018-ACM Journal on Emerging Technologies in Computing Systems

TL;DR: A novel Markov random field sound source separation algorithm that uses expectation-maximization and Gibbs sampling to learn MRF parameters on the fly and infer the best separation of sources is developed, intended for deployment on a mobile phone.

...read moreread less

Abstract: Machine learning (ML) has revolutionized a wide range of recognition tasks, ranging from text analysis to speech to vision, most notably in cloud deployments. However, mobile deployment of these ideas involves a very different category of design problems. In this article, we develop a hardware architecture for a sound source separation task, intended for deployment on a mobile phone. We focus on a novel Markov random field (MRF) sound source separation algorithm that uses expectation-maximization and Gibbs sampling to learn MRF parameters on the fly and infer the best separation of sources. The intrinsically iterative algorithm suggests challenges for both speed and power. A real-time streaming FPGA implementation runs at 150MHz with 207KB RAM, achieves a speed-up of 22× over a software reference, performs with an SDR of up to 7.021dB with 1.601ms latency, and exhibits excellent perceived audio quality. A 45nm CMOS ASIC virtual prototype simulated at 20MHz shows that this architecture is small (

...read moreread less

Journal Article•DOI•

Memristor-CMOS Analog Coprocessor for Acceleration of High-Performance Computing Applications

[...]

Nihar Athreyas¹, Wenhao Song¹, Blair Perot¹, Qiangfei Xia¹, Abbie Mathew, Jai Gupta, Dev Gupta¹, Jianhua Yang¹ - Show less +4 more•Institutions (1)

University of Massachusetts Amherst¹

01 Nov 2018-ACM Journal on Emerging Technologies in Computing Systems

TL;DR: A memristor-CMOS analog co-processor architecture that can handle floating point computation and has a superior performance when compared to other processors is developed.

...read moreread less

Abstract: Vector matrix multiplication computation underlies major applications in machine vision, deep learning, and scientific simulation. These applications require high computational speed and are run on platforms that are size, weight, and power constrained. With the transistor scaling coming to an end, existing digital hardware architectures will not be able to meet this increasing demand. Analog computation with its rich set of primitives and inherent parallel architecture can be faster, more efficient, and compact for some of these applications. One such primitive is a memristor-CMOS crossbar array-based vector matrix multiplication. In this article, we develop a memristor-CMOS analog coprocessor architecture that can handle floating-point computation. To demonstrate the working of the analog coprocessor at a system level, we use a new electronic design automation tool called PSpice Systems Option, which performs integrated cosimulation of MATLAB/Simulink and PSpice. It is shown that the analog coprocessor has a superior performance when compared to other processors, and a speedup of up to 12 × when compared to projected GPU performance is observed. Using the new PSpice Systems Option tool, various application simulations for image processing and solutions to partial differential equations are performed on the analog coprocessor model.

...read moreread less

Journal Article•DOI•

Kogge-Stone Adder Realization using 1S1R Resistive Switching Crossbar Arrays

[...]

Debjyoti Bhattacharjee¹, Anne Siemon², Eike Linn², Stephan Menzel³, Anupam Chattopadhyay¹ - Show less +1 more•Institutions (3)

Nanyang Technological University¹, RWTH Aachen University², Forschungszentrum Jülich³

12 Jul 2018-ACM Journal on Emerging Technologies in Computing Systems

TL;DR: A novel mapping scheme for in-memory Kogge-Stone adder has been presented and the correctness of the proposed scheme is verified by means of TaO× device model-based memristive simulations.

...read moreread less

Abstract: Low operating voltage, high storage density, non-volatile storage capabilities, and relative low access latencies have popularized memristive devices as storage devices. Memristors can be ideally used for in-memory computing in the form of hybrid CMOS nano-crossbar arrays. In-memory serial adders have been theoretically and experimentally proven for crossbar arrays. To harness the parallelism of memristive arrays, parallel-prefix adders can be effective. In this work, a novel mapping scheme for in-memory Kogge-Stone adder has been presented. The number of cycles increases logarithmically with the bit width N of the operands, i.e., O(log2N), and the device count is 5N. We verify the correctness of the proposed scheme by means of TaO× device model-based memristive simulations. We compare the proposed scheme with other proposed schemes in terms of number of cycle and number of devices.

...read moreread less

Journal Article•DOI•

Semi-Trained Memristive Crossbar Computing Engine with In Situ Learning Accelerator

[...]

Abdullah M. Zyarah¹, Dhireesha Kudithipudi¹•Institutions (1)

Rochester Institute of Technology¹

27 Nov 2018-ACM Journal on Emerging Technologies in Computing Systems

TL;DR: In this article, an on-device training circuitry for threshold-current memristors integrated in a crossbar structure is proposed, and alternate approaches of mapping the synaptic weights into fully trained and semi-trained crossbars are investigated.

...read moreread less

Abstract: On-device intelligence is gaining significant attention recently as it offers local data processing and low power consumption. In this research, an on-device training circuitry for threshold-current memristors integrated in a crossbar structure is proposed. Furthermore, alternate approaches of mapping the synaptic weights into fully trained and semi-trained crossbars are investigated. In a semi-trained crossbar, a confined subset of memristors are tuned and the remaining subset of memristors are not programmed. This translates to optimal resource utilization and power consumption, compared to a fully programmed crossbar. The semi-trained crossbar architecture is applicable to a broad class of neural networks. System level verification is performed with an extreme learning machine for binomial and multinomial classification. The total power for a single 4 × 4 layer network, when implemented in IBM 65nm node, is estimated to be a42.16μW and the area is estimated to be 26.48μm × 22.35μm.

...read moreread less

Journal Article•DOI•

Power, Performance, and Area Benefit of Monolithic 3D ICs for On-Chip Deep Neural Networks Targeting Speech Recognition

[...]

Kyungwook Chang¹, Deepak Kadetotad², Yu Cao², Jae-sun Seo², Sung Kyu Lim¹ - Show less +1 more•Institutions (2)

Georgia Institute of Technology¹, Arizona State University²

27 Nov 2018-ACM Journal on Emerging Technologies in Computing Systems

TL;DR: This study encompasses the influence of key parameters in DNN hardware implementations towards their performance and energy efficiency, including DNN architectural choices, underlying workloads, and tier partitioning choices in M3D designs, which produce power savings and performance improvement beyond what can be achieved using conventional 2D ICs.

...read moreread less

Abstract: In recent years, deep learning has become widespread for various real-world recognition tasks. In addition to recognition accuracy, energy efficiency and speed (i.e., performance) are other grand challenges to enable local intelligence in edge devices. In this article, we investigate the adoption of monolithic three-dimensional (3D) IC (M3D) technology for deep learning hardware design, using speech recognition as a test vehicle. M3D has recently proven to be one of the leading contenders to address the power, performance, and area (PPA) scaling challenges in advanced technology nodes. Our study encompasses the influence of key parameters in DNN hardware implementations towards their performance and energy efficiency, including DNN architectural choices, underlying workloads, and tier partitioning choices in M3D designs. Our post-layout M3D designs, together with hardware-efficient sparse algorithms, produce power savings and performance improvement beyond what can be achieved using conventional 2D ICs. Experimental results show that M3D offers 22.3% iso-performance power saving and 6.2% performance improvement, convincingly demonstrating its entitlement as a solution for DNN ASICs. We further present architectural and physical design guidelines for M3D DNNs to maximize the benefits.

...read moreread less

Journal Article•DOI•

Reliability Hardening Mechanisms in Cyber-Physical Digital-Microfluidic Biochips

[...]

Guan-Ruei Lu¹, Ansuman Banerjee², Bhargab B. Bhattacharya², Tsung-Yi Ho³, Hung-Ming Chen¹ - Show less +1 more•Institutions (3)

National Chiao Tung University¹, Indian Statistical Institute², National Tsing Hua University³

23 Oct 2018-ACM Journal on Emerging Technologies in Computing Systems

TL;DR: An algorithm that minimizes the number of checkpoints and determines their locations to cover every path in a given droplet-routing solution is proposed, which provides reliability-hardening mechanisms for a wide class of cyber-physical DMFBs.

...read moreread less

Abstract: In the area of biomedical engineering, digital-microfluidic biochips (DMFBs) have received considerable attention because of their capability of providing an efficient and reliable platform for conducting point-of-care clinical diagnostics. System reliability, in turn, mandates error-recoverability while implementing biochemical assays on-chip for medical applications. Unfortunately, the technology of DMFBs is not yet fully equipped to handle error-recovery from various microfluidic operations involving droplet motion and reaction. Recently, a number of cyber-physical systems have been proposed to provide real-time checking and error-recovery in assays based on the feedback received from a few on-chip checkpoints. However, to synthesize robust feedback systems for different types of DMFBs, certain practical issues need to be considered such as co-optimization of checkpoint placement, error-recoverability, and layout of droplet-routing pathways. For application-specific DMFBs, we propose here an algorithm that minimizes the number of checkpoints and determines their locations to cover every path in a given droplet-routing solution. Next, for general-purpose DMFBs, where the checkpoints are pre-deployed in specific locations, we present a checkpoint-aware routing algorithm such that every droplet-routing path passes through at least one checkpoint to enable error-recovery and to ensure physical routability of all droplets. Furthermore, we also propose strategies for executing the algorithms in reliable mode to enhance error-recoverability. The proposed methods thus provide reliability-hardening mechanisms for a wide class of cyber-physical DMFBs.

...read moreread less

Journal Article•DOI•

Efficient LDPC Code Design for Combating Asymmetric Errors in STT-RAM

[...]

Bohua Li¹, Yukui Pei¹, Wujie Wen²•Institutions (2)

Tsinghua University¹, Florida International University²

08 Mar 2018-ACM Journal on Emerging Technologies in Computing Systems

TL;DR: Experiments show that the proposed LDPC designs can improve the STT-RAM reliability by at least 102 (104) when compared to the existing error correction codes (ECCs) for the SLC (MLC) design, demonstrating the feasibility of LDPC solutions on STt-RAM.

...read moreread less

Abstract: Spin-transfer torque random access memory (STT-RAM) is a promising emerging memory technology in the future memory hierarchy. However, its unique reliability challenges, i.e., the asymmetric bit failure mechanism at different bit flippings, have raised significant concerns in its real applications. Recent studies even show that the common memory error repair “remedies” cannot efficiently address them. In this article, we for the first time systematically study the potentials of the strong low-density parity-check (LDPC) code for combating such unique asymmetric errors in both single-level-cell (SLC) and multi-level-cell (MLC) STT-RAM designs. A generic STT-RAM channel model suitable for the SLC/MLC designs, is developed to analytically calibrate all the accumulated asymmetric factors of the write/read operations. The key initial information for LDPC decoding, namely asymmetric log-likelihood ratio (A-LLR), is redesigned and extracted from the proposed channel model, to unleash the LDPC’s asymmetric error correcting capability. LDPC codec is also carefully designed to lower the hardware cost by leveraging the systematic-structured parity check matrix. Then two customized short-length LDPC codes—(585,512) and (683,512)—augmented from the semi-random parity check matrix and the A-LLR based asymmetric decoding, are proposed for SLC and MLC STT-RAM designs, respectively. Experiments show that our proposed LDPC designs can improve the STT-RAM reliability by at least 102 (104) when compared to the existing error correction codes (ECCs) for the SLC (MLC) design, demonstrating the feasibility of LDPC solutions on STT-RAM.

...read moreread less