scispace - formally typeset
Search or ask a question

Showing papers in "ACM Journal on Emerging Technologies in Computing Systems in 2015"


Journal ArticleDOI
TL;DR: This review will give an overview of the status and prospects of spin-based devices and circuits that are currently under intense investigation and development across the world, and address particularly their merits and challenges for practical applications.
Abstract: Conventional MOS integrated circuits and systems suffer serve power and scalability challenges as technology nodes scale into ultra-deep-micron technology nodes (e.g., below 40nm). Both static and dynamic power dissipations are increasing, caused mainly by the intrinsic leakage currents and large data traffic. Alternative approaches beyond charge-only-based electronics, and in particular, spin-based devices, show promising potential to overcome these issues by adding the spin freedom of electrons to electronic circuits. Spintronics provides data non-volatility, fast data access, and low-power operation, and has now become a hot topic in both academia and industry for achieving ultra-low-power circuits and systems. The ITRS report on emerging research devices identified the magnetic tunnel junction (MTJ) nanopillar (one of the Spintronics nanodevices) as one of the most promising technologies to be part of future micro-electronic circuits. In this review we will give an overview of the status and prospects of spin-based devices and circuits that are currently under intense investigation and development across the world, and address particularly their merits and challenges for practical applications. We will also show that, with a rapid development of Spintronics, some novel computing architectures and paradigms beyond classic Von-Neumann architecture have recently been emerging for next-generation ultra-low-power circuits and systems.

168 citations


Journal ArticleDOI
TL;DR: Experimental results show that highly-efficient checkpointing translate to significant speedup in program execution time and reduction in application-level energy consumption.
Abstract: Transiently Powered Computers (TPCs) are a new class of batteryless embedded systems that depend solely on energy harvested from external sources for performing computations. Enabling long-running computations on TPCs is a major challenge due to the highly intermittent nature of the power supply (often bursts of in-situ checkpointing technique for TPCs using FRAM that consumes only 30nJ while decreasing the time taken for saving and restoring a checkpoint to only 21.06μs, which is over two orders of magnitude lower than the corresponding overhead using flash. We have implemented and evaluated our technique, QuickRecall, using the TI MSP430FR5739 FRAM-enabled microcontroller. Experimental results show that our highly-efficient checkpointing translate to significant speedup (1.25x - 8.4x) in program execution time and reduction (∼3x) in application-level energy consumption.

76 citations


Journal ArticleDOI
TL;DR: A brain-inspired reconfigurable digital neuromorphic processor (DNP) architecture for large-scale spiking neural networks is presented and the functionality of the proposed DNP architecture is demonstrated by realizing an unsupervised-learning based character recognition system.
Abstract: This article presents a brain-inspired reconfigurable digital neuromorphic processor (DNP) architecture for large-scale spiking neural networks. The proposed architecture integrates an arbitrary number of N digital leaky integrate-and-fire (LIF) silicon neurons to mimic their biological counterparts and on-chip learning circuits to realize spike-timing-dependent plasticity (STDP) learning rules. We leverage memristor nanodevices to build an N×N crossbar array to store not only multibit synaptic weight values but also network configuration data with significantly reduced area overhead. Additionally, the crossbar array is designed to be accessible both column- and row-wise to expedite the synaptic weight update process for learning. The proposed digital pulse width modulator (PWM) produces binary pulses with various durations for reading and writing the multilevel memristive crossbar. The proposed column based analog-to-digital conversion (ADC) scheme efficiently accumulates the presynaptic weights of each neuron and reduces silicon area overhead by using a shared arithmetic unit to process the LIF operations of all N neurons. With 256 silicon neurons, learning circuits and 64K synapses, the power dissipation and area of our DNP are 6.45 mW and 1.86 mm2, respectively, when implemented in a 90-nm CMOS technology. The functionality of the proposed DNP architecture is demonstrated by realizing an unsupervised-learning based character recognition system.

75 citations


Journal ArticleDOI
TL;DR: This article pattern the neural activities across multiple timescales and encode the sensory information using time-dependent temporal scales and proposes the proposed spiking neuron, which is compact, low power, and robust.
Abstract: This article presents our research towards developing novel and fundamental methodologies for data representation using spike-timing-dependent encoding. Time encoding efficiently maps a signal's amplitude information into a spike time sequence that represents the input data and offers perfect recovery for band-limited stimuli. In this article, we pattern the neural activities across multiple timescales and encode the sensory information using time-dependent temporal scales. The spike encoding methodologies for autonomous classification of time-series signatures are explored using near-chaotic reservoir computing. The proposed spiking neuron is compact, low power, and robust. A hardware implementation of these results is expected to produce an agile hardware implementation of time encoding as a signal conditioner for dynamical neural processor designs.

48 citations


Journal ArticleDOI
TL;DR: It is proved that determining an optimal embedding is coNP-hard already for restricted cases and proposed heuristic and exact methods for determining both the number of additional lines and a corresponding embedding are proposed.
Abstract: Reversible logic represents the basis for many emerging technologies and has recently been intensively studied. However, most of the Boolean functions of practical interest are irreversible and must be embedded into a reversible function before they can be synthesized. Thus far, an optimal embedding is guaranteed only for small functions, whereas a significant overhead results when large functions are considered. We study this issue in this article. We prove that determining an optimal embedding is coNP-hard already for restricted cases. Then, we propose heuristic and exact methods for determining both the number of additional lines and a corresponding embedding. For the approaches, we considered sum of products and binary decision diagrams as function representations. Experimental evaluations show the applicability of the approaches for large functions. Consequently, the reversible embedding of large functions is enabled as a precursor to subsequent synthesis.

48 citations


Journal ArticleDOI
TL;DR: In this article, the authors report a complete performance simulation software tool capable of searching the hardware design space by varying resource architecture and technology parameters, synthesizing and scheduling a fault-tolerant quantum algorithm within the hardware constraints, quantifying the performance metrics such as the execution time and the failure probability of the algorithm, and analyzing the breakdown of these metrics to highlight the performance bottlenecks and visualizing resource utilization.
Abstract: The optimal design of a fault-tolerant quantum computer involves finding an appropriate balance between the burden of large-scale integration of noisy components and the load of improving the reliability of hardware technology. This balance can be evaluated by quantitatively modeling the execution of quantum logic operations on a realistic quantum hardware containing limited computational resources. In this work, we report a complete performance simulation software tool capable of (1) searching the hardware design space by varying resource architecture and technology parameters, (2) synthesizing and scheduling a fault-tolerant quantum algorithm within the hardware constraints, (3) quantifying the performance metrics such as the execution time and the failure probability of the algorithm, and (4) analyzing the breakdown of these metrics to highlight the performance bottlenecks and visualizing resource utilization to evaluate the adequacy of the chosen design. Using this tool, we investigate a vast design space for implementing key building blocks of Shor’s algorithm to factor a 1,024-bit number with a baseline budget of 1.5 million qubits. We show that a trapped-ion quantum computer designed with twice as many qubits and one-tenth of the baseline infidelity of the communication channel can factor a 2,048-bit integer in less than 5 months.

30 citations


Journal ArticleDOI
TL;DR: Several recent techniques that aim to offset these challenges for fully leveraging the potential of near-threshold voltage computing are surveyed, classifying these techniques along several dimensions and highlighting their similarities and differences.
Abstract: Energy efficiency has now become the primary obstacle in scaling the performance of all classes of computing systems. Low-voltage computing, specifically, near-threshold voltage computing (NTC), which involves operating the transistor very close to and yet above its threshold voltage, holds the promise of providing many-fold improvement in energy efficiency. However, use of NTC also presents several challenges such as increased parametric variation, failure rate, and performance loss. This article surveys several recent techniques that aim to offset these challenges for fully leveraging the potential of NTC. By classifying these techniques along several dimensions, we also highlight their similarities and differences. It is hoped that this article will provide insights into state-of-the-art NTC techniques to researchers and system designers and inspire further research in this field.

27 citations


Journal ArticleDOI
TL;DR: How the programmable neuromorphic system proposed can be configured to implement specific spike-based synaptic plasticity rules is demonstrated and how it can be utilised in a cognitive task is depicted.
Abstract: Hardware implementations of spiking neural networks offer promising solutions for computational tasks that require compact and low-power computing technologies. As these solutions depend on both the specific network architecture and the type of learning algorithm used, it is important to develop spiking neural network devices that offer the possibility to reconfigure their network topology and to implement different types of learning mechanisms. Here we present a neuromorphic multi-neuron VLSI device with on-chip programmable event-based hybrid analog/digital circuits; the event-based nature of the input/output signals allows the use of address-event representation infrastructures for configuring arbitrary network architectures, while the programmable synaptic efficacy circuits allow the implementation of different types of spike-based learning mechanisms. The main contributions of this article are to demonstrate how the programmable neuromorphic system proposed can be configured to implement specific spike-based synaptic plasticity rules and to depict how it can be utilised in a cognitive task. Specifically, we explore the implementation of different spike-timing plasticity learning rules online in a hybrid system comprising a workstation and when the neuromorphic VLSI device is interfaced to it, and we demonstrate how, after training, the VLSI device can perform as a standalone component (i.e., without requiring a computer), binary classification of correlated patterns.

22 citations


Journal ArticleDOI
TL;DR: An efficient energy harvesting system which is compatible with various environmental sources, such as light, heat, or wind energy, is proposed, and takes advantage of double-level capacitors not only to prolong system lifetime but also to enable robust booting from the exhausting energy of the system.
Abstract: To design autonomous wireless sensor networks (WSNs) with a theoretical infinite lifetime, energy harvesting (EH) techniques have been recently considered as promising approaches. Ambient sources can provide everlasting additional energy for WSN nodes and exclude their dependence on battery. In this article, an efficient energy harvesting system which is compatible with various environmental sources, such as light, heat, or wind energy, is proposed. Our platform takes advantage of double-level capacitors not only to prolong system lifetime but also to enable robust booting from the exhausting energy of the system. Simulations and experiments show that our multiple-energy-sources converter (MESC) can achive booting time in order of seconds. Although capacitors have virtual recharge cycles, they suffer higher leakage compared to rechargeable batteries. Increasing their size can decrease the system performance due to leakage energy. Therefore, an energy-neutral design framework providing a methodology to determine the minimum size of those storage devices satisfying energy-neutral operation (ENO) and maximizing system quality-of-service (QoS) in EH nodes, when using a given energy source, is proposed. Experiments validating this framework are performed on a real WSN platform with both photovoltaic cells and thermal generators in an indoor environment. Moreover, simulations on OMNET++ show that the energy storage optimized from our design framework is utilized up to 93.86p.

22 citations


Journal ArticleDOI
TL;DR: This article presents a simulation framework for power-performance analysis of multicore architectures with specific focus on the NoC that integrates accurate power gating and DVFS models encompassing also their timing and power overheads.
Abstract: Networks-on-chip (NoCs) are a widely recognized viable interconnection paradigm to support the multi-core revolution. One of the major design issues of multicore architectures is still the power, which can no longer be considered mainly due to the cores, since the NoC contribution to the overall energy budget is relevant. To face both static and dynamic power while balancing NoC performance, different actuators have been exploited in literature, mainly dynamic voltage frequency scaling (DVFS) and power gating. Typically, simulation-based tools are employed to explore the huge design space by adopting simplified models of the components. As a consequence, the majority of state-of-the-art on NoC power-performance optimization do not accurately consider timing and power overheads of actuators, or (even worse) do not consider them at all, with the risk of overestimating the benefits of the proposed methodologies.This article presents a simulation framework for power-performance analysis of multicore architectures with specific focus on the NoC. It integrates accurate power gating and DVFS models encompassing also their timing and power overheads. The value added of our proposal is manyfold: (i) DVFS and power gating actuators are modeled starting from SPICE-level simulations; (ii) such models have been integrated in the simulation environment; (iii) policy analysis support is plugged into the framework to enable assessment of different policies; (iv) a flexible GALS (globally asynchronous locally synchronous) support is provided, covering both handshake and FIFO re-synchronization schemas. To demonstrate both the flexibility and extensibility of our proposal, two simple policies exploiting the modeled actuators are discussed in the article.

22 citations


Journal ArticleDOI
TL;DR: Two fault-tolerant protocols are presented and the trade-off of the computational overhead using the ten-bit quantum carry-lookahead adder as an example is shown and it is suggested that Alice send input qubit encoded with error correction code instead of single input qubits.
Abstract: Blind quantum computation is an appealing use of quantum information technology because it can conceal both the client's data and the algorithm itself from the server. However, problems need to be solved in the practical use of blind quantum computation and fault-tolerance is a major challenge. Broadbent et al. proposed running error correction over blind quantum computation, and Morimae and Fujii proposed using fault-tolerant entangled qubits as the resource for blind quantum computation. Both approaches impose severe demands on the teleportation channel, the former requiring unrealistic data rates and the latter near-perfect fidelity. To extend the application range of blind quantum computation, we suggest that Alice send input qubits encoded with error correction code instead of single input qubits. Two fault-tolerant protocols are presented and we showed the trade-off of the computational overhead using the ten-bit quantum carry-lookahead adder as an example. Though these two fault-tolerant protocols require the client to have more quantum computing ability than using approaches from prior work, they provide better fault-tolerance when the client and the server are connected by realistic quantum repeater networks.

Journal ArticleDOI
TL;DR: In this article, the authors argue that the neural circuit approach, in which networks of neuronal elements model brain circuitry are constructed, allows the development of practical applications and the exploration of brain function, and they present case studies of spiking neural networks in vision and recognition tasks based on one instantiation of a simulation environment.
Abstract: Neuromorphic engineering is a fast growing field with great potential in both understanding the function of the brain, and constructing practical artifacts that build upon this understanding. For these novel chips and hardware to be useful, hardware compatible applications and simulation tools are needed. We argue that the neural circuit approach, in which networks of neuronal elements model brain circuitry are constructed, allows the development of practical applications and the exploration of brain function. At this level of abstraction, networks of 105 neurons or larger can be efficiently simulated, but still preserve the neuronal and synaptic dynamics that appear to be important for brain function. Because the neural circuit level supports spiking neural networks and the prevalent Addressable Event Representation (AER) communication scheme, it fits well with many existing neuromorphic hardware and simulation tools. To show how this approach can be applied, we present case studies of spiking neural networks in vision and recognition tasks based on one instantiation of a simulation environment. However, there are now many hardware options, simulation environments, and applications in this emerging field. These approaches and other considerations are discussed.

Journal ArticleDOI
TL;DR: A mixed integer nonlinear programming model is proposed for the placement and scheduling of quantum circuits in such a way that latency is minimized and is proved reducible to a quadratic assignment problem which is a well-known NP-complete combinatorial optimization problem.
Abstract: Recent works on quantum physical design have pushed the scheduling and placement of quantum circuit into their prominent positions. In this article, a mixed integer nonlinear programming model is proposed for the placement and scheduling of quantum circuits in such a way that latency is minimized. The proposed model determines locations of gates and the sequence of operations. The proposed model is proved reducible to a quadratic assignment problem which is a well-known NP-complete combinatorial optimization problem. Since it is impossible to find the optimal solution of this NP-complete problem for large quantum circuits within a reasonable amount of time, a metaheuristic solution method is developed for the proposed model. Some experiments are conducted to evaluate the performance of the developed solution approach. Experimental results show that the proposed approach improves average latency by about 24.09p for the attempted benchmarks.

Journal ArticleDOI
TL;DR: This article proposes PROTON+, a fast tool for placement and routing of 3D ONoCs minimizing the total laser power, and studies optimal positions of memory controllers for the first time.
Abstract: Optical Networks-on-Chip (ONoCs) are a promising technology to overcome the bottleneck of low bandwidth of electronic Networks-on-Chip. Recent research discusses power and performance benefits of ONoCs based on their system-level design, while layout effects are typically overlooked. As a consequence, laser power requirements are inaccurately computed from the logic scheme but do not consider the layout. In this article, we propose PROTON+, a fast tool for placement and routing of 3D ONoCs minimizing the total laser power. Using our tool, the required laser power of the system can be decreased by up to 94p compared to a state-of-the-art manually designed layout. In addition, with the help of our tool, we study the physical design space of ONoC topologies. For this purpose, topology synthesis methods (e.g., global connectivity and network partitioning) as well as different objective function weights are analyzed in order to minimize the maximum insertion loss and ultimately the system’s laser power consumption. For the first time, we study optimal positions of memory controllers. A comparison of our algorithm to a state-of-the-art placer for electronic circuits shows the need for a different set of tools custom-tailored for the particular requirements of optical interconnects.

Journal ArticleDOI
TL;DR: A transversal survey on energy-efficient techniques ranging from devices to architectures is presented, showing how novel and emerging devices provide new opportunities to extend the trend toward low-power design.
Abstract: Nowadays, power consumption is one of the main limitations of electronic systems. In this context, novel and emerging devices provide new opportunities to extend the trend toward low-power design. In this survey article, we present a transversal survey on energy-efficient techniques ranging from devices to architectures. The actual trends of device research, with fully depleted planar devices, tri-gate geometries, and gate-all-around structures, allows us to reach an increasingly higher level of performance while reducing the associated power. In addition, beyond the simple device property enhancements, emerging devices also lead to innovations at the circuit and architectural levels. In particular, devices whose properties can be tuned through additional terminals enable a fine and dynamic control of device threshold. They also enable designers to realize logic gates and to implement power-related techniques in a compact way unreachable to standard technologies. These innovations reduce power consumption at the gate level and unlock new means of actuation in architectural solutions like adaptive voltage and frequency scaling.

Journal ArticleDOI
TL;DR: This article investigates the design of a neuro-inspired logic block (NLB) dedicated to on-chip function learning and proposes learning strategy and supervised learning methods to demonstrate the ability to learn logic functions with memristive nanodevices.
Abstract: Scaling down beyond CMOS transistors requires the combination of new computing paradigms and novel devices. In this context, neuromorphic architecture is developed to achieve robust and ultra-low power computing systems. Memristive nanodevices are often associated with this architecture to implement efficiently synapses for ultra-high density. In this article, we investigate the design of a neuro-inspired logic block (NLB) dedicated to on-chip function learning and propose learning strategy. It is composed of an array of memristive nanodevices as synapses associated to neuronal circuits. Supervised learning methods are proposed for different type of memristive nanodevices and simulations are performed to demonstrate the ability to learn logic functions with memristive nanodevices. Benefiting from a compact implementation of neuron circuits and the optimization of learning process, this architecture requires small number of nanodevices and moderate power consumption.

Journal ArticleDOI
TL;DR: This article presents a fully autonomous and battery-less circuit solution for piezoelectric energy harvesting based on discrete components in a low-cost PCB technology, which achieves a comparable performance in a 32 × 43 mm2 footprint.
Abstract: In the field of energy harvesting there is a growing interest in power management circuits with intrinsic sub-μ A current consumptions, in order to operate efficiently with very low levels of available power. In this context, integrated circuits proved to be a viable solution with high associated nonrecurring costs and design risks. As an alternative, this article presents a fully autonomous and battery-less circuit solution for piezoelectric energy harvesting based on discrete components in a low-cost PCB technology, which achieves a comparable performance in a 32 × 43 mm2 footprint. The power management circuit implements synchronous electric charge extraction (SECE) with a passive bootstrap circuit from fully discharged states. Circuit characterization showed that the circuit consumes less than 1μ A with a 3V output and may achieve energy conversion efficiencies of up to 85%. In addition, the circuit is specifically designed for operating with input and output voltages up to 20V, which grants a significant flexibility in the choice of transducers and energy storage capacitors.

Journal ArticleDOI
TL;DR: This work proposes a write-aware STTRAM-based RF architecture (WarRF), which contains two techniques: Split Bank Write modifies the arbitrator design to increase the parallelism of read and write accesses in the same bank; Write Pool reduces the number of repeatedwrite accesses to RFs.
Abstract: The massively parallel processing capacity of GPGPUs requires a large register file (RF), and its size keeps increasing to support more concurrent threads from generation to generation. Using traditional SRAM-based RFs, there are concerns in both area cost and energy consumption, and soon they will become unrealistic. In this work, we analyze the feasibility of using STTRAM-based RF designs, which have benefits in terms of smaller silicon area and zero standby leakage power. However, STTRAM long write latency and high write energy bring new challenges. Therefore, we propose a write-aware STTRAM-based RF architecture (WarRF), which contains two techniques: Split Bank Write modifies the arbitrator design to increase the parallelism of read and write accesses in the same bank; Write Pool reduces the number of repeated write accesses to RFs. Our experiment shows that the performance of STTRAM-based RF is improved by 13p and up to 23p after adopting WarRF. In addition, the energy consumption is reduced by 38p on average compared to SRAM-based RFs.

Journal ArticleDOI
TL;DR: A methodology based on both ab-initio simulations and post-processing of data for analyzing an mQCA system adopting an electronic point of view is identified and an interesting assessment on the energy of an m QCA is started, one of the most promising features of this technology.
Abstract: Molecular quantum-dot cellular automata (mQCA) is an emerging paradigm for nanoscale computation. Its revolutionary features are the expected operating frequencies (THz), the high device densities, the noncryogenic working temperature, and, above all, the limited power densities. The main drawback of this technology is a consequence of one of its very main advantages, that is, the extremely small size of a single molecule. Device prototyping and the fabrication of a simple circuit are limited by lack of control in the technological process [Pulimeno et al. 2013a]. Moreover, high defectivity might strongly impact the correct behavior of mQCA devices. Another challenging point is the lack of a solid method for analyzing and simulating mQCA behavior and performance, either in ideal or defective conditions. Our contribution in this article is threefold: (i) We identify a methodology based on both ab-initio simulations and post-processing of data for analyzing an mQCA system adopting an electronic point of view (we baptized this method as “MoSQuiTo”); (ii) we assess the performance of an mQCA device (in this case, a bis- ferrocene molecule) working in nonideal conditions, using as a reference the information on fabrication-critical issues and on the possible defects that we are obtaining while conducting our own ongoing experiments on mQCA: (iii) we determine and assess the electrostatic energy stored in a bis-ferrocene molecule both in an oxidized and reduced form. Results presented here consist of quantitative information for an mQCA device working in manifold driving conditions and subjected to defects. This information is given in terms of: (a) output voltage; (b) safe operating area (SOA); (c) electrostatic energy; and (d) relation between SOA and energy, that is, possible energy reduction subject to reliability and functionality constraints. The whole analysis is a first fundamental step toward the study of a complex mQCA circuit. It gives important suggestions on possible improvements of the technological processes. Moreover, it starts an interesting assessment on the energy of an mQCA, one of the most promising features of this technology.

Journal ArticleDOI
TL;DR: An adaptive write scheme that adaptively adjusts the write pulses to address variations in memristive arrays, resulting in 7×--11× average energy saving in case studies and helps shorten the test time of memory march algorithms.
Abstract: Recent advances in access-transistor-free memristive crossbars have demonstrated the potential of memristor arrays as high-density and ultra-low-power memory. However, with considerable variations in the write-time characteristics of individual memristors, conventional fixed-pulse write schemes cannot guarantee reliable completion of the write operations and waste significant amount of energy. We propose an adaptive write scheme that adaptively adjusts the write pulses to address such variations in memristive arrays, resulting in 7×--11× average energy saving in our case studies. Our scheme embeds an online monitor to detect the completion of a write operation and takes into account the parasitic effect of line-shared devices in access-transistor-free crossbars. This feature also helps shorten the test time of memory march algorithms by eliminating the need of a verifying read right after a write, which is commonly employed in the test sequences of march algorithms.

Journal ArticleDOI
TL;DR: This article proposes modeling cortical plasticity in mammalian brains as a problem of estimating a probability density function that would correspond to the nature and the richness of the environment perceived through multiple modalities and defines and develops a novel neural model solving the problem in a distributed and sparse manner.
Abstract: Neurobiological systems have often been a source of inspiration for computational science and engineering, but in the past their impact has also been limited by the understanding of biological models. Today, new technologies lead to an equilibrium situation where powerful and complex computers bring new biological knowledge of the brain behavior. At this point, we possess sufficient understanding to both imagine new brain-inspired computing paradigms and to sustain a classical paradigm which reaches its end programming and intellectual limitations.In this context, we propose to reconsider the computation problem first in the specific domain of mobile robotics. Our main proposal consists in considering computation as part of a global adaptive system, composed of sensors, actuators, a source of energy and a controlling unit. During the adaptation process, the proposed brain-inspired computing structure does not only execute the tasks of the application but also reacts to the external stimulation and acts on the emergent behavior of the system. This approach is inspired by cortical plasticity in mammalian brains and suggests developing the computation architecture along the system's experience.This article proposes modeling this plasticity as a problem of estimating a probability density function. This function would correspond to the nature and the richness of the environment perceived through multiple modalities. We define and develop a novel neural model solving the problem in a distributed and sparse manner. And we integrate this neural map into a bio-inspired hardware substrate that brings the plasticity property into parallel many-core architectures. The approach is then called Hardware Plasticity. The results show that the self-organization properties of our model solve the problem of multimodal sensory data clusterization. The properties of the proposed model allow envisaging the deployment of this adaptation layer into hardware architectures embedded into the robot's body in order to build intelligent controllers.

Journal ArticleDOI
TL;DR: A FinFET-based low- Swing clocking methodology is introduced to preserve the dynamic power savings of low-swing clocking while minimizing these three negative effects, facilitated through an efficient use of FinFet technology.
Abstract: A low-swing clocking methodology is introduced to achieve low-power operation at 20nm FinFET technology. Low-swing clock trees are used in existing methodologies in order to decrease the dynamic power consumption in a trade-off for 3 issues: (1) the effect of leakage power consumption, which is becoming more dominant when the process scales sub-32nm; (2) the increase in insertion delay, resulting in a high clock skew; and (3) the difficulty in driving the existing DFF sinks with a low-swing clock signal without a timing violation. In this article, a FinFET-based low-swing clocking methodology is introduced to preserve the dynamic power savings of low-swing clocking while minimizing these three negative effects, facilitated through an efficient use of FinFET technology. At scaled performance constraints, the proposed methodology at 20nm FinFET leads to 42p total power savings (clock network+DFF) compared to a FinFET-based full-swing counterpart at the same frequency (3 GHz), thanks to the dynamic power savings of low-swing clocking and 3p power savings compared to a CMOS-based low-swing implementation running at the half frequency (1.5 GHz), thanks to the leakage power savings of FinFET technology.

Journal ArticleDOI
TL;DR: A Single Writer Multiple Reader (SWMR) bus based crossbar mNoC is presented that can achieve more than 88p reduction in energy for a 64×64 crossbar compared to similar ring resonator based designs and can scale to a 256×256 crossbar with an average 10p performance improvement and 54p energy reduction.
Abstract: Moore's law and the continuity of device scaling have led to an increasing number of coressnodes on a chip, creating a need for new mechanisms to achieve high-performance and power-efficient Network-on-Chip (NoC) Nanophotonics based NoCs provide for higher bandwidth and more power efficient designs than electronic networks Present approaches often use an external laser source, ring resonators, and waveguides However, they still suffer from important limitations: large static power consumption, and limited network scalability In this article, we explore the use of emerging molecular scale devices to construct nanophotonic networks: Molecular-scale Network-on-Chip (mNoC) We leverage on-chip emitters such as quantum dot LEDs, which provide electrical to optical signal modulation, and chromophores, which provide optical signal filtering for receivers These devices replace the ring resonators and the external laser source used in contemporary nanophotonic NoCs They reduce energy consumption or enable scaling to larger crossbars for a reduced energy budget We present a Single Writer Multiple Reader (SWMR) bus based crossbar mNoC Our evaluation shows that an mNoC can achieve more than 88p reduction in energy for a 64×64 crossbar compared to similar ring resonator based designs Additionally, an mNoC can scale to a 256×256 crossbar with an average 10p performance improvement and 54p energy reduction

Journal ArticleDOI
TL;DR: This article aims to define a reliability criterion for NoC and provide a framework for quantifying this reliability as it relates to TSV issues, and for the first time, the reliability criterion is reduced to a tractable closed-form expression that requires a single Monte Carlo simulation.
Abstract: The network-on-chip (NoC) technology allows for integration of a manycore design on a single chip for higher efficiency and scalability. Three-dimensional (3D) NoCs offer several advantages over two-dimensional (2D) NoCs. Through-silicon via (TSV) technology is one of the candidates for implementation of 3D NoCs. TSV reliability analysis is still challenging for 3D NoC designers because of their unique electrical, thermal, and physical characteristics. After providing an overview of common TSV issues, this article aims to define a reliability criterion for NoC and provide a framework for quantifying this reliability as it relates to TSV issues. TSV issues are modeled as a time-invariant failure probability. Also, a reliability criterion for TSV-based NoC is defined. The relationship between NoC reliability and TSV failure is quantified. For the first time, the reliability criterion is reduced to a tractable closed-form expression that requires a single Monte Carlo simulation. Importantly, the Monte Carlo simulation depends only on network geometry. To demonstrate our proposed method, the reliability criterion of a simple 8×8×8 NoC supported by an 8×8×7 network of TSVs is calculated.

Journal ArticleDOI
TL;DR: The proposed MN-MATE, an elastic resource management architecture for a single cloud node with manycores, on-chip DRAM, and large size of off- chip DRAM and NVRAM, improves system performance and reduces energy consumption.
Abstract: Recent advent of manycore system increases needs for larger but faster memory hierarchy. Emerging next generation memories such as on-chip DRAM and nonvolatile memory (NVRAM) are promising candidates for replacement of DRAM-only main memory. Combined with the manycore trends, it gives an opportunity to rethink conventional resource management system with a memory hierarchy for a single cloud node. In an attempt to mitigate the energy and memory problems, we propose MN-MATE, an elastic resource management architecture for a single cloud node with manycores, on-chip DRAM, and large size of off-chip DRAM and NVRAM. In MN-MATE, the hypervisor places consolidated VMs and balances memory among them. Based on the monitored information about the allocated memory, a guest OS co-schedules tasks accessing different types of memory with complementary access intensity. Polymorphic management of DRAM hierarchy accelerates average memory access speed inside each guest OS. A guest OS reduces energy consumption with small performance loss based on the NVRAM-aware data placement policy and the hybrid page cache. A new lightweight kernel is developed to reduce the overhead from the guest OS for scientific applications. Experiment results show that our techniques in MN-MATE platform improve system performance and reduce energy consumption.

Journal ArticleDOI
TL;DR: This article introduces a hardware-friendly model adapted from the CNFT, namely the RSDNF model (randomly spiking dynamic neural fields), which achieves scalable parallel implementations on digital hardware while maintaining the behavioral properties of CNFT models.
Abstract: Bio-inspired neural computation attracts a lot of attention as a possible solution for the future challenges in designing computational resources. Dynamic neural fields (DNF) provide cortically inspired models of neural populations to which computation can be applied for a wide variety of tasks, such as perception and sensorimotor control. DNFs are often derived from continuous neural field theory (CNFT). In spite of the parallel structure and regularity of CNFT models, few studies of hardware implementations have been carried out targeting embedded real-time processing. In this article, a hardware-friendly model adapted from the CNFT is introduced, namely the RSDNF model (randomly spiking dynamic neural fields). Thanks to their simplified 2D structure, RSDNFs achieve scalable parallel implementations on digital hardware while maintaining the behavioral properties of CNFT models. Spike-based computations within neurons in the field are introduced to reduce interneuron connection bandwidth. Additionally, local stochastic spike propagation ensures inhibition and excitation broadcast without a fully connected network. The behavioral soundness and robustness of the model in the presence of noise and distracters is fully validated through software and hardware. A field programmable gate array (FPGA) implementation shows how the RSDNF model ensures a level of density and scalability out of reach for previous hardware implementations of dynamic neural field models.

Journal ArticleDOI
TL;DR: This article proposes computational simplifications and architectural optimizations of the original GBNN that leads to significant complexity and area reduction without affecting neither memorizing nor retrieving performance.
Abstract: Brain processes information through a complex hierarchical associative memory organization that is distributed across a complex neural network. The GBNN associative memory model has recently been proposed as a new class of recurrent clustered neural network that presents higher efficiency than the classical models. In this article, we propose computational simplifications and architectural optimizations of the original GBNN. This work leads to significant complexity and area reduction without affecting neither memorizing nor retrieving performance. The obtained results open new perspectives in the design of neuromorphic hardware to support large-scale general-purpose neural algorithms.

Journal ArticleDOI
TL;DR: This work proposes a new device--multilevel DWM with shift-based write (ML-DWM-SW)--that is capable of storing 2 bits in a single device that achieves improved write efficiency and features decoupled read-write paths, enabling independent optimizations of read and write operations.
Abstract: Spintronic memories are considered to be promising candidates for future on-chip memories due to their high density, nonvolatility, and near-zero leakage. However, they also face challenges such as high write energy and latency and limited read speed due to single-ended sensing. Further, the conflicting requirements of read and write operations lead to stringent design constraints that severely compromises their benefits. Recently, domain wall memory was proposed as a spintronic memory that has a potential for very high density by storing multiple bits in the domains of a ferromagnetic nanowire. While reliable operation of DWM memory with multiple domains faces many challenges, single-bit cells that utilize domain wall motion for writes have been experimentally demonstrated [Fukami et al. 2009]. This bit-cell, which we refer to as Domain Wall Memory with Shift-based Write (DWM-SW), achieves improved write efficiency and features decoupled read-write paths, enabling independent optimizations of read and write operations. However, these benefits are achieved at the cost of sacrificing the original goal of improved density. In this work, we explore multilevel storage as a new direction to enhance the density benefits of DWM-SW. At the device level, we propose a new device--multilevel DWM with shift-based write (ML-DWM-SW)--that is capable of storing 2 bits in a single device. At the circuit level, we propose a ML-DWM-SW based bit-cell design and layout. The ML-DWM-SW bit-cell incurs no additional area overhead compared to the DWM-SW bit-cell despite storing an additional bit, thereby achieving roughly twice the density. However, it requires a two-step write operation and has data-dependent read and write energies, which pose unique challenges. To address these issues, we propose suitable architectural optimizations: (i) intra-word interleaving and (ii) bit encoding. We design “all-spin” cache architectures using the proposed ML-DWM-SW bit-cell for both general purpose processors as well as general purpose graphics processing units (GPGPUs). We perform an iso-capacity replacement of SRAM with spintronic memories and study the energy and area benefits at iso-performance conditions. For general purpose processors, the ML-DWM-SW cache achieves 10X reduction in energy and 4.4X reduction in cache area compared to an SRAM cache and 2X and 1.7X reduction in energy and area, respectively, compared to an STT-MRAM cache. For GPGPUs, the ML-DWM-SW cache achieves 5.3X reduction in energy and 3.6X area reduction compared to SRAM and 3.5X energy reduction and 1.9X area reduction compared to STT-MRAM.

Journal ArticleDOI
TL;DR: An energy-efficient deadlock-free routing algorithm for 3D mesh topologies where vertical connections partially exist by introducing some rules for selecting elevators and eliminating the dedicated virtual channel requirement is proposed.
Abstract: 3D integrated circuits (3D ICs) using through-silicon vias (TSVs) allow to envision the stacking of dies with different functions and technologies, using as an interconnect backbone a 3D network-on-chip (NoC). However, partial vertical connection in 3D NoCs seems unavoidable because of the large overhead of TSV itself (e.g., large footprint, low fabrication yield, additional fabrication processes) as well as the heterogeneity in dimension. This article proposes an energy-efficient deadlock-free routing algorithm for 3D mesh topologies where vertical connections partially exist. By introducing some rules for selecting elevators (i.e., vertical links between dies), the routing algorithm can eliminate the dedicated virtual channel requirement. In this article, the rules themselves as well as the proof of deadlock freedom are given. By eliminating the virtual channels for deadlock avoidance, the proposed routing algorithm reduces the energy consumption by 38.9p compared to a conventional routing algorithm. When the virtual channel is used for reducing the head-of-line blocking, the proposed routing algorithm increases performance by up to 23.1p and 6.9p on average.

Journal ArticleDOI
TL;DR: A 3D multicore architecture that provides poolable cache resources and a runtime management policy to improve energy efficiency in 3D systems by utilizing the flexible heterogeneity of cache resources are introduced.
Abstract: Resource pooling, where multiple architectural components are shared among cores, is a promising technique for improving system energy efficiency and reducing total chip area. 3D stacked multicore processors enable efficient pooling of cache resources owing to the short interconnect latency between vertically stacked layers. This article first introduces a 3D multicore architecture that provides poolable cache resources. We then propose a runtime management policy to improve energy efficiency in 3D systems by utilizing the flexible heterogeneity of cache resources. Our policy dynamically allocates jobs to cores on the 3D system while partitioning cache resources based on cache hungriness of the jobs. We investigate the impact of the proposed cache resource pooling architecture and management policy in 3D systems, both with and without on-chip DRAM. We evaluate the performance, energy efficiency, and thermal behavior for a wide range of workloads running on 3D systems. Experimental results demonstrate that the proposed architecture and policy reduce system energy-delay product (EDP) and energy-delay-area product (EDAP) by 18.8p and 36.1p on average, respectively, in comparison to 3D processors with static cache sizes.