scispace - formally typeset
Search or ask a question

Showing papers by "Massoud Pedram published in 2019"


Journal ArticleDOI
TL;DR: The proposed approximate multiplier has an almost Gaussian error distribution with a near-zero mean value and is exploited in the structure of a JPEG encoder, sharpening, and classification applications, indicating that the quality degradation of the output is negligible.
Abstract: A scalable approximate multiplier, called truncation- and rounding-based scalable approximate multiplier (TOSAM) is presented, which reduces the number of partial products by truncating each of the input operands based on their leading one-bit position. In the proposed design, multiplication is performed by shift, add, and small fixed-width multiplication operations resulting in large improvements in the energy consumption and area occupation compared to those of the exact multiplier. To improve the total accuracy, input operands of the multiplication part are rounded to the nearest odd number. Because input operands are truncated based on their leading one-bit positions, the accuracy becomes weakly dependent on the width of the input operands and the multiplier becomes scalable. Higher improvements in design parameters (e.g., area and energy consumption) can be achieved as the input operand widths increase. To evaluate the efficiency of the proposed approximate multiplier, its design parameters are compared with those of an exact multiplier and some other recently proposed approximate multipliers. Results reveal that the proposed approximate multiplier with a mean absolute relative error in the range of 11%–0.3% improves delay, area, and energy consumption up to 41%, 90%, and 98%, respectively, compared to those of the exact multiplier. It also outperforms other approximate multipliers in terms of speed, area, and energy consumption. The proposed approximate multiplier has an almost Gaussian error distribution with a near-zero mean value. We exploit it in the structure of a JPEG encoder, sharpening, and classification applications. The results indicate that the quality degradation of the output is negligible. In addition, we suggest an accuracy configurable TOSAM where the energy consumption of the multiplication operation can be adjusted based on the minimum required accuracy.

99 citations


Journal ArticleDOI
TL;DR: An overview of the current and planned activities related to the ColdFlux project is presented and the design assumptions and decisions that were made to allow the development of design tools for million-gate circuits are justified.
Abstract: The IARPA SuperTools program requires the development of superconducting electronic design automation (S-EDA) and superconducting technology computer-aided design (S-TCAD) tools aimed at enabling the reliable design of complex superconducting digital circuits with millions of Josephson junctions. Within the SuperTools program, the ColdFlux project addresses S-EDA and S-TCAD tool research and development in four areas: 1) RTL synthesis, architectures and verification; 2) analog design and layout synthesis; 3) physical design and test; and 4) device and process modeling/simulation and cell library design. Capabilities include, but are not limited to, the following: device level modeling and simulation of Josephson junctions, modeling and simulation of the superconducting process manufacturing processes, powerful new electrical circuit simulation, parameterized schematic and layout libraries, optimization, compact SPICE-like model extraction, timing analysis, behavioral, register-transfer-level and logic syntheses, clock tree synthesis, placement and routing, layout-versus-schematic extraction, functional verification, and the evaluation of designs in the presence of magnetic fields and trapped flux. ColdFlux consists of six research groups from four continents. Here, we present an overview of the current and planned activities related to the project and justify the design assumptions and decisions that were made to allow the development of design tools for million-gate circuits.

54 citations


Journal ArticleDOI
TL;DR: A signal processing theoretical modeling approach for describing the power of the approximation noise which is the integral of error spectral density over the bandwidth, is developed and a mathematical optimization approach based on Lagrange Multipliers for optimizing design parameters is presented.
Abstract: In this paper, we present a framework for analytically estimating the output quality of common digital signal processing (DSP) blocks that utilize approximate adders. The framework is based on considering the error of approximate adders as an additive noise (approximation noise) that disturbs the output of the DSP block in question. A signal processing theoretical modeling approach for describing the power of the approximation noise which is the integral of error spectral density over the bandwidth, is developed. The output qualities of DSP blocks, such as finite impulse response filter, discrete cosine transform, and fast Fourier transform, which utilize approximate adders, are thus estimated. The accuracy of the proposed framework is evaluated by comparing mathematical model predictions to simulation results by using the signal-to-noise ratio (SNR) metric. The inaccuracy of the SNRs predicted by the framework was, on average, less than 2.5dB compared with that obtained from simulations. Therefore, a mathematical optimization approach based on Lagrange Multipliers for optimizing design parameters is also presented. The optimization is realized by choosing a proper configuration of the target block, such as determining the data width of the inexact computation part for each approximate adder in the design.

34 citations


Proceedings ArticleDOI
21 Jan 2019
TL;DR: This paper presents a training method that enables a radically different approach for realization of deep neural networks through Boolean logic minimization, which completely removes the energy-hungry step of accessing memory for obtaining model parameters.
Abstract: Deep neural networks have been successfully deployed in a wide variety of applications including computer vision and speech recognition. To cope with computational and storage complexity of these models, this paper presents a training method that enables a radically different approach for realization of deep neural networks through Boolean logic minimization. The aforementioned realization completely removes the energy-hungry step of accessing memory for obtaining model parameters, consumes about two orders of magnitude fewer computing resources compared to realizations that use floating-point operations, and has a substantially lower latency.

30 citations


Proceedings ArticleDOI
25 Mar 2019
TL;DR: A scalable framework for gate-level circuit recognition that leverages deep learning and a convolutional neural network (CNN)-based circuit representation is presented and a data structure, termed level-dependent decaying sum (LDDS) existence vector, which can compactly represent information about the circuit topology is proposed.
Abstract: Efficiently recognizing the functionality of a circuit is key to many applications, such as formal verification, reverse engineering, and security. We present a scalable framework for gate-level circuit recognition that leverages deep learning and a convolutional neural network (CNN)-based circuit representation. Given a standard cell library, we present a sparse mapping algorithm to improve the time and memory efficiency of the CNN-based circuit representation. Sparse mapping allows encoding only the logic cell functionality, independently of implementation parameters such as timing or area. We further propose a data structure, termed level-dependent decaying sum (LDDS) existence vector, which can compactly represent information about the circuit topology. Given a reference gate in the circuit, an LDDS vector can capture the function of the gates in the input and output cones as well as their distance (number of stages) from the reference. Compared to the baseline approach, our framework obtains more than an-order-of-magnitude reduction in the average training time and 2× improvement in the average runtime for generating CNN-based representations from gate-level circuits, while achieving 10% higher accuracy on a set of benchmarks including EPFL and ISCAS’85 circuits.

22 citations


Journal ArticleDOI
TL;DR: A synchronous minimum-skew clock tree synthesis algorithm for single flux quantum circuits considering splitter delays and placement blockages and creating a fully balanced clock tree structure in which the number of clock splitters from the clock source to all the sink nodes is identical.
Abstract: This article presents a synchronous minimum-skew clock tree synthesis algorithm for single flux quantum circuits considering splitter delays and placement blockages. The proposed methodology improves the state-of-the-art by accounting for splitter delays and creating a fully balanced clock tree structure in which the number of clock splitters from the clock source to all the sink nodes is identical. Additionally, a mixed integer linear programming based algorithm is presented that removes the overlaps among the clock splitters and placed cells (i.e., placement blockages) and minimizes the clock skew, simultaneously. Using the proposed method, the average clock skew for 17 benchmark circuits is 4.6 ps, improving the state-of-the-art algorithm by $\text{70}{\%}$ . Finally, a clock tree synthesis algorithm for imbalanced topologies is presented that reduces the clock skew and the number of clock splitters in the clock network by $\text{56}{\%}$ and $\text{37}{\%}$ , respectively, compared with a fully balanced clock tree solution.

16 citations


Proceedings ArticleDOI
01 Nov 2019
TL;DR: This paper presents a dynamic programming-based technology mapping algorithm that generates a minimum-area mapping solution which is guaranteed to be fully path balanced to conventional superconductive single flux quantum circuits, which will fail otherwise.
Abstract: Path balancing technology mapping is a method of mapping a technology-independent logical description of a circuit, such as a Boolean network, into a technology-dependent, gate-level netlist. For a gate-level netlist generated by the path balancing mapper, the difference between lengths of the longest and the shortest paths in the circuit is minimized. To achieve full path balancing, it may be necessary to add buffers on signal paths, and in such a case, the cost of buffers must be properly accounted for. This paper presents a dynamic programming-based technology mapping algorithm that generates a minimum-area mapping solution which is guaranteed to be fully path balanced. The fully path balanced mapping solution is essential to conventional superconductive single flux quantum circuits, which will fail otherwise. The balanced mapping solution is also useful in CMOS circuits to avoid (or minimize) unwanted hazard activity and the resulting wasteful dynamic power dissipation as well as to achieve the maximum throughput in a wave-pipelined circuit. Experimental results show that our path balancing technology mapping algorithm decreases total area, static power consumption, and path balancing overhead of single flux quantum circuits by large factors. For example, it reduces the circuit area by up to 111% and by an average of 26.3% compared to state-of-the-art technology mappers.

16 citations


Journal ArticleDOI
TL;DR: Simulation results for current recycling ERSFQ circuits are presented along with a strategy for implementing large superconducting circuits, and an innovative clock-choking mechanism using magnetic Josephson junctions is proposed.
Abstract: Energy-efficient rapid single flux quantum (ERSFQ) circuits have become a viable alternative for the implementation of superconducting circuits due to a large amount of static power consumption in RSFQ circuits. ERSFQ circuits are built upon the popular RSFQ logic circuits by replacing the power-dissipating resistor bias network with a bias network consisting of active devices. In this paper, a simulation study of ERSFQ biasing scheme is carried out by building simulation test benches for both synchronous and asynchronous ERSFQ circuits. A study is carried out to present the optimum value of biasing inductance, influence of the feeding Josephson transmission line (FJTL) and the effect of its size, the effect of the feeding clock frequency, and the effect of the circuit operating frequency. An innovative clock-choking mechanism using magnetic Josephson junctions is also proposed for the FJTL in the case of no logic circuit activity for a current-recycling circuit block, which would help in eliminating the dynamic power consumed due to the switching of bias junctions in a logic circuit. Simulation results for current recycling ERSFQ circuits are presented along with a strategy for implementing large superconducting circuits.

15 citations


Journal ArticleDOI
TL;DR: An integrated global and detailed router for the SFQ circuits, qGDR, which aims at reducing the impedance mismatch during signal transfer by minimizing the total number of used vias by resorting to a maze routing algorithm.
Abstract: Single-flux-quantum (SFQ) circuit technologies are promising digital circuit technologies with high-speed and extremely low-power characteristics. However, heavy wire routing tasks are finished either by considerable human effort or by commercial routing tools with few physical considerations for the SFQ circuits. In this paper, we present an integrated global and detailed router for the SFQ circuits, qGDR, which aims at reducing the impedance mismatch during signal transfer by minimizing the total number of used vias. The global router allocates routing resources while minimizing the via usage by a dynamic layer assignment algorithm. The detailed router follows the global routing results to complete the routing task by resorting to a maze routing algorithm. Following the MIT-LL SFQ5ee process technology, qGDR can use only two routing layers to route an 8-bit integer divider with more than 40 000 Josephson junctions in less than one hour.

15 citations


Journal ArticleDOI
TL;DR: A new timing characterization method is presented here for SFQ logic cells, which relies on low-dimensional lookup tables (LUTs) to store the clock-to-output delay, setup, and hold times of clocked cells and input-to theoutput delay of nonclocked cells in an SFQ standard cell library.
Abstract: Single flux quantum (SFQ) logic families require the development of electronic design automation tools to generate large-scale circuits. The available methodologies or tools for performing timing analysis of SFQ circuits do not have a load-dependent timing characterization method for calculating the context-dependent delay of cells, such as the nonlinear delay model for complementary metal–oxide–semiconductor (CMOS) circuits. A new timing characterization method is presented here for SFQ logic cells, which relies on low-dimensional lookup tables (LUTs) to store the clock-to-output delay, setup, and hold times of clocked cells and input-to-output delay of nonclocked cells in an SFQ standard cell library. Although the delay of Josephson junction based logic cells depends on many parameters, this paper shows that it is possible to reduce this dependency to only a small number of well-chosen parameters. All LUTs are obtained from JSIM simulations for a given target process technology. The accuracy of the proposed LUT-based timing characterization method is compared against JSIM simulations, which shows a maximum error of only 2.1% of the tested clocked cells with different loads.

15 citations


Proceedings ArticleDOI
06 Mar 2019
TL;DR: VeriSFQ as discussed by the authors is a semi-formal verification framework for single-flux quantum (SFQ) circuits using the Universal Verification Methodology (UVM) standard.
Abstract: In this paper, we propose a semi-formal verification framework for single-flux quantum (SFQ) circuits called VeriSFQ, using the Universal Verification Methodology (UVM) standard. The considered SFQ technology is superconducting digital electronic devices that operate at cryogenic temperatures with active circuit elements called the Josephson junction, which operate at high switching speeds and low switching energy - allowing SFQ circuits to operate at frequencies over 300 gigahertz. Due to key differences between SFQ and CMOS logic, verification techniques for the former are not as advanced as the latter. Thus, it is crucial to develop efficient verification techniques as the complexity of SFQ circuits scales. The VeriSFQ framework focuses on verifying the key circuit and gate-level properties of $\mathrm{SFQ}$ logic: fanout, gate-level pipeline, path balancing, and input-to-output latency. The combinational circuits considered in analyzing the performance of VeriSFQ are: Kogge-Stone adders (KSA), array multipliers, integer dividers, and select ISCAS’85 combinational benchmark circuits. Methods of introducing bugs into SFQ circuit designs for verification detection were experimented with - including stuck-at faults, fanout errors, unbalanced paths, and functional bugs like incorrect logic gates. In addition, we propose an SFQ verification benchmark consisting of combinational SFQ circuits that exemplify SFQ logic properties and present the performance of the VeriSFQ framework on these benchmark circuits. The portability and reusability of the UVM standard allows the VeriSFQ framework to serve as a foundation for future SFQ semi-formal verification techniques.

Proceedings ArticleDOI
21 Jan 2019
TL;DR: This paper presents a novel approach for modeling idle intervals in MPSoC platforms which leads to a mixed integer linear programming (MILP) formulation integrating DPM, DVFS, and task scheduling of periodic task graphs subject to a hard deadline.
Abstract: Energy efficiency is one of the most critical design criteria for modern embedded systems such as multiprocessor system-on-chips (MPSoCs). Dynamic voltage and frequency scaling (DVFS) and dynamic power management (DPM) are two major techniques for reducing energy consumption in such embedded systems. Furthermore, MPSoCs are becoming more popular for many real-time applications. One of the challenges of integrating DPM with DVFS and task scheduling of real-time applications on MPSoCs is the modeling of idle intervals on these platforms. In this paper, we present a novel approach for modeling idle intervals in MPSoC platforms which leads to a mixed integer linear programming (MILP) formulation integrating DPM, DVFS, and task scheduling of periodic task graphs subject to a hard deadline. We also present a heuristic approach for solving the MILP and compare its results with those obtained from solving the MILP.

Proceedings ArticleDOI
25 Mar 2019
TL;DR: This paper starts by describing key differences between SFQ logic and conventional CMOS and concludes by listing key challenges that must be overcome to achieve the very large scale integration of SFQ circuits and make the demonstration of a superconductive CPU a reality.
Abstract: Design and manufacturing of superconductive electronics have been evolving over the past three decades with significant progress made in related fields. Rapid single flux quantum (RSFQ) logic circuits have become popular among superconductive logic families and its energy-efficient variants (ERSFQ and eSFQ) have shown promise as an ultra lowpower and high-speed circuit fabric. SFQ circuits have been demonstrated at tens of GHz with an energy consumption of an attojoule per gate. There are many differences between SFQ and conventional CMOS circuits. SFQ logic circuits are based on the manipulation of the quantized magnetic flux pulses. Most of the logic gates are sequential in nature requiring the clock to be distributed to every logic gate. SFQ logic gates have no gain and hence splitters are needed to drive multiple fanouts. Design and successful demonstration of a controllable superconducting switch and a compact reliable memory element have evaded researchers so far. This paper starts by describing key differences between SFQ logic and conventional CMOS and concludes by listing key challenges that must be overcome to achieve the very large scale integration of SFQ circuits and make the demonstration of a superconductive CPU a reality.

Proceedings ArticleDOI
13 May 2019
TL;DR: Experimental results show that a combination of balanced factorization and rewriting algorithms reduces the path balancing overhead by an average of 63% for 15 benchmark circuits, and area by up to 23% compared to state-of-the-art logic synthesis tools.
Abstract: Single Flux Quantum (SFQ) logic with switching energy of 100zJ1 and switching delay of 1ps is a promising post-CMOS candidate. Logic synthesis of these magnetic-pulse-based circuits is a very important step in their design flow with a big impact on the total area, power consumption, and critical path delay. SFQ circuits has some properties different from CMOS which should be taken into consideration in the design and implementation flow of these circuits. One of these properties is requirement of path balancing in the standard SFQ circuit design. Standard CMOS-based rewriting and factorization algorithms fail to preserve the balancing property of SFQ circuits. Therefore, they end up generating circuits with huge path balancing overheads. Our proposed balanced factorization and rewriting algorithms are designed specifically to solve this problem. Experimental results show that a combination of balanced factorization and rewriting algorithms reduces the path balancing overhead by an average of 63% for 15 benchmark circuits, and area by up to 23% compared to state-of-the-art logic synthesis tools.

Journal ArticleDOI
TL;DR: The usefulness of the proposed algorithm is verified by training some neuromorphic circuits for different applications, and it is found that the accuracy of the networks trained by OCTAN is, on average, about 46% higher than those of RWC and SLMS algorithms.
Abstract: In this paper, we propose a hardware friendly On-Chip Training Algorithm for the memristive Neuromorphic circuits (OCTAN). Although the proposed algorithm has a simple hardware like that of the random weight change (RWC) algorithm, it is much more efficient in terms of convergence speed and accuracy. In this algorithm, weights of the circuit are updated individually by a small value and the effect of individual weight update is assessed. If the weight change causes an increase in the error of the network, the weight update is reversed by applying the same change in the reverse direction twice. The usefulness of the proposed algorithm is verified by training some neuromorphic circuits for different applications. Compared to RWC and stochastic least-mean-squares (SLMS) training algorithms, our proposed algorithm needs, on average, $329\times $ fewer epochs to find the minimum error point. Moreover, the accuracy of the networks trained by OCTAN is, on average, about 46% higher than those of RWC and SLMS algorithms. Additionally, a hardware for OCTAN is presented. This hardware provides a speedup of $172\times $ ( $61\times $ ) compared to that of the RWC (SLMS) algorithm. Finally, the impact of PVT (process, voltage, and temperature) variations is studied on the proposed training hardware indicating an average training error increase of less than 3.27% in the presence of variations.

Journal ArticleDOI
TL;DR: This work presents a heuristic for scheduling tasks with potentially imprecise computations, represented with directed acyclic graphs, on multiprocessor platforms, and presents a mixed integer linear program formulation of the same problem, which provides the optimal reference scheduling solutions.
Abstract: Imprecise computations allow scheduling algorithms developed for energy-constrained computing devices to trade off output quality with utilization of system resources. The goal of such scheduling algorithms is to utilize imprecise computations to find a feasible schedule for a given task graph while maximizing the quality of service (QoS) and satisfying a hard deadline and an energy bound. This work presents a heuristic for scheduling tasks with potentially imprecise computations, represented with directed acyclic graphs, on multiprocessor platforms. Furthermore, it presents a mixed integer linear program formulation of the same problem, which provides the optimal reference scheduling solutions, enabling evaluation of the efficacy of the proposed heuristic. Both the heuristic and mathematical program take account of potentially imprecise inputs of tasks on their output quality. Furthermore, the presented heuristic is capable of finding feasible schedules even under tight energy budgets. Through extensive experiments, it is shown that in some cases, the proposed heuristic is capable of finding the same QoS as the ones found by MILP. Furthermore, for those task graphs that MILP outperforms the proposed heuristic, QoS values obtained with the proposed heuristic are, on average, within 1.24% of the optimal solutions while improving the runtime by a factor of 100 or so. This clearly demonstrates the advantage of the proposed heuristic over the exact solution, especially for large task graphs where solving the mathematical problem is hampered by its lengthy runtime.

Journal ArticleDOI
TL;DR: Simulation results with the latest commercial CMOS process technologies for ULP designs demonstrate the effectiveness of the BB technique along with the TEI-aware voltage scaling method and TEi-aware frequency scaling method.
Abstract: Temperature effect inversion (TEI) phenomenon in ultralow power (ULP) very large scale integration circuits has been identified as an important effect by both academia and industry. Although a number of ULP methods that attempt to exploit the TEI phenomenon have been proposed, the small size of the design exploration space when applying these methods to ULP circuits hinders them from achieving their full potential. This is mainly due to the limited granularity of the supply voltage level control. Starting with an intuition that the body biasing (BB) technique is a key to overcome this limitation, this paper exploits the BB technique along with the TEI-aware voltage scaling (TEI-VS) method and TEI-aware frequency scaling (TEI-FS) method, so as to substantially increase the design spaces of these methods. Techniques for optimally combining the BB technique with TEI-VS and TEI-FS are introduced. Simulation results with the latest commercial CMOS process technologies for ULP designs demonstrate the effectiveness of the proposed methodology.

Proceedings ArticleDOI
06 Mar 2019
TL;DR: This paper presents design of kNN-CAM, a k-Nearest Neighbors (kNN)-based Configurable Approximate floating point Multiplier that utilizes approximate computing opportunities to deliver significant area and energy savings.
Abstract: In many real computations such as arithmetic operations in hidden layers of a neural network, some amounts of inaccuracies can be tolerated without degrading the final results (e.g., maintaining the same level of accuracy for image classification). This paper presents design of kNN-CAM, a k-Nearest Neighbors (kNN)-based Configurable Approximate floating point Multiplier. kNN-CAM utilizes approximate computing opportunities to deliver significant area and energy savings. A kNN engine is trained on a sufficiently large set of input data to learn the quantity of bit truncation that can be performed in each floating point input with the goal of minimizing energy and area. Next, this trained engine is used to predict the level of approximation for unseen data. Experimental results show that kNN-CAM provides about 67% area saving and 19% speedup while losing only 4.86% accuracy when compared to a 100% accurate multiplier. Furthermore, the application of kNN-CAM in implementation of a handwritten digit recognition provides 47.2% area saving while the accuracy is dropped by only 0.3%.

Journal ArticleDOI
TL;DR: A novel deep-learning based framework is presented that employs a genetic algorithm to efficiently guide exploration through the large design space while utilizing deep learning methods to provide fast performance prediction of design points instead of relying on slow full system simulations.
Abstract: As throughput-oriented processors incur a significant number of data accesses, the placement of memory controllers (MCs) has a critical impact on overall performance. However, due to the lack of a systematic way to explore the huge design space of MC placements, only a few ad-hoc placements have been proposed, leaving much of the opportunity unexploited. In this paper, we present a novel deep-learning based framework that explores this opportunity intelligently and automatically. The proposed framework employs a genetic algorithm to efficiently guide exploration through the large design space while utilizing deep learning methods to provide fast performance prediction of design points instead of relying on slow full system simulations. Evaluation shows that, the proposed deep learning models achieves a speedup of 282X for the search process, and the MC placement found by our framework improves the average performance (IPC) of 18 benchmarks by 19.3 percent over the best-known placement found by human intuition.

Journal ArticleDOI
TL;DR: Simulation results in an industrial and a predictive CMOS technology show that the proposed design for SRAM reduces the energy consumption of read and write operations considerably for some standard test images as input data to the memory.
Abstract: This study presents a new energy-efficient design for static random access memory (SRAM) using a low-power input data encoding and output data decoding stages. A data bit reordering algorithm is applied to the input data to increase the number of 0s that are going to be written into the SRAM array. Using SRAM cells which are more energy-efficient in writing a ‘0’ than a ‘1’ benefits from this, resulting in a reduction in the total power and energy consumptions of the whole memory. The input data encoding is performed using a simple circuit, which is built of multiplexers and inverters. After the read operation, data will be returned back to its initial form using a low-power data decoding circuit. Simulation results in an industrial and a predictive CMOS technology show that the proposed design for SRAM reduces the energy consumption of read and write operations considerably for some standard test images as input data to the memory. For instance, in writing pixels of Lenna test image into this SRAM and reading them back, 15 and 20% savings are observed for the energy consumption of write and read operations, respectively, compared with the normal write and read operations in standard SRAMs.

Posted Content
TL;DR: A novel deep reinforcement framework is proposed, taking routerless networks-on-chip (NoC) as an evaluation case study, and successfully resolves problems with prior design approaches being either unreliable due to random searches or inflexible due to severe design space restrictions.
Abstract: Machine learning applied to architecture design presents a promising opportunity with broad applications. Recent deep reinforcement learning (DRL) techniques, in particular, enable efficient exploration in vast design spaces where conventional design strategies may be inadequate. This paper proposes a novel deep reinforcement framework, taking routerless networks-on-chip (NoC) as an evaluation case study. The new framework successfully resolves problems with prior design approaches being either unreliable due to random searches or inflexible due to severe design space restrictions. The framework learns (near-)optimal loop placement for routerless NoCs with various design constraints. A deep neural network is developed using parallel threads that efficiently explore the immense routerless NoC design space with a Monte Carlo search tree. Experimental results show that, compared with conventional mesh, the proposed deep reinforcement learning (DRL) routerless design achieves a 3.25x increase in throughput, 1.6x reduction in packet latency, and 5x reduction in power. Compared with the state-of-the-art routerless NoC, DRL achieves a 1.47x increase in throughput, 1.18x reduction in packet latency, and 1.14x reduction in average hop count albeit with slightly more power overhead.

Proceedings ArticleDOI
15 Jul 2019
TL;DR: CSrram is presented, an efficient ex-situ training framework for hybrid CMOS-memristive neuromorphic circuits that includes a pre-defined block diagonal clustered (BDC) sparsity algorithm to significantly reduce area and power consumption.
Abstract: Artificial Neural Networks (ANNs) play a key role in many machine learning (ML) applications but poses arduous challenges in terms of storage and computation of network parameters. Memristive crossbar arrays (MCAs) are capable of both computation and storage, making them promising for in-memory computing enabled neural network accelerators. At the same time, the presence of a significant amount of zero weights in ANNs has motivated research in a variety of parameter reduction techniques. However, for crossbar based architectures, the study of efficient methods to take advantage of network sparsity is still in the early stage. This paper presents CSrram, an efficient ex-situ training framework for hybrid CMOS-memristive neuromorphic circuits. CSrram includes a pre-defined block diagonal clustered (BDC) sparsity algorithm to significantly reduce area and power consumption. The proposed framework is verified on a wide range of datasets including MNIST handwritten recognition, fashion MNIST, breast cancer prediction (BCW), IRIS, and mobile health monitoring. Compared to state of the art fully connected memristive neuromorphic circuits, our CSrram with only 25% density of weights in the first junction, provides a power and area efficiency of 1.5x and 2.6x (averaged over five datasets), respectively, without any significant test accuracy loss.

Proceedings ArticleDOI
29 Jul 2019
TL;DR: A new TEI-inspired SoC platform (called TIP), which relies on network-on-chip architecture (called µNoC) to realize system interconnects, which successfully reduces the total number and length of global wires.
Abstract: Researchers have been trying to exploit the temperature effect inversion (TEI) phenomenon to improve energy efficiency of system-on-chip (SoC) designs without sacrificing its performance. However, TEI-aware low power methods have a critical limitation in that they can only be applied to components within the SoC that do not contain long (global) wires. This is because wire delays continue to increase with rising temperatures irrespective of the operating supply voltage level, which tends to cancel out positive effects of the TEI phenomenon in SoCs. To tackle this limitation and thoroughly utilize the TEI-aware methods, this paper presents new TEI-inspired SoC platform (called TIP), which relies on network-on-chip architecture (called µNoC) to realize system interconnects. The µNoC successfully reduces the total number and length of global wires. By fabricating a TIP prototyping chip in Samsung 28nm FD-SOI technology, we verify the effectiveness of TIP. Extensive post-fabrication measurements demonstrate that the chip while continuing to operate at a target 50MHz clock frequency can lower its supply voltage from 0.54V to 0.48V at 25°C and to 0.44V at 80°C, which results in up to 35% power saving.

Proceedings ArticleDOI
01 Jul 2019
TL;DR: A framework for logical equivalence checking (LEC) ofSFQ circuits called qEC is proposed, built on the ABC tool however with the ability to check on properties of SFQ superconducting circuits, which shows a comparative verification time of Sport lab SFQ logic circuit benchmark suite.
Abstract: Superconducting devices have emerged as one of the most promising beyond-CMOS technologies with a switching delay of 1ps and switching energy of $10^{-19}\mathrm{J}$ to achieve high performance, energy-efficient systems and make quantum computing a reality. Design and verification methodologies of single flux quantum (SFQ) logic fundamentally differ from those of the CMOS logic, due to key differences such as pulse signal type, ultra-deep (gate-level) pipelining, and path-balancing in SFQ circuits. In this paper, we propose a framework for logical equivalence checking (LEC) of SFQ circuits called qEC. qEC is built on the ABC tool however with the ability to check on properties of SFQ superconducting circuits. Several timing and structural checks are embedded in our framework. We benchmark the framework on post-synthesis netlists with an SFQ technology. Results show a comparative verification time of Sport lab SFQ logic circuit benchmark suite including 16-bit Array multiplier, 16-bit integer divider and ISCAS'85 circuits with respect to ABC tool for similar CMOS circuits.

Journal ArticleDOI
08 Feb 2019
TL;DR: The authors consider a realistic BSS framework in which EVs can arrive at BSS with time of day dependent rates having different battery state-of-charges, and investigate the battery charging scheduling problem in the BSS under a dynamic energy pricing.
Abstract: Further popularisation of electric vehicles (EVs) is hindered by their relatively short driving distance and long battery charging time. To overcome these shortcomings, the battery swapping station (BSS) has been proposed as a means of satisfying the increasing demands for fast EV battery recharging. At a BSS, (partially) depleted batteries from EVs can be replaced with partially or fully charged ones almost instantaneously. Recharging scheduling and maintenance of batteries are done by the operator of BSS, with the target of minimising electrical energy costs while satisfying customer demands. In this study, the authors consider a realistic BSS framework in which EVs can arrive at BSS with time of day dependent rates having different battery state-of-charges. They investigate the battery charging scheduling problem in the BSS under a dynamic energy pricing. They solve (i) an online optimal BSS control problem to minimise the energy cost with a quality-of-service (QoS) guarantee, and (ii) an offline optimal BSS design problem to determine the optimal number of stored batteries so as to achieve a desirable tradeoff between flexibility in charging and amortised battery costs. The experimental results show that the total charging energy cost can be reduced significantly under different traffic scenarios.

Proceedings ArticleDOI
01 Jul 2019
TL;DR: A bootstrap-based statistical static timing analysis tool called qSSTA that can reasonably estimate a minimum workable clock period by executing a large amount of bootstrap iterations from the discrete sampling spaces of all gates under a certain correlation specification.
Abstract: As a beyond-CMOS technology, superconducting single-flux-quantum (SFQ) technology promises fast processing speed and excellent energy efficiency. With the increasing complexity of SFQ circuits, the accurate and fast estimation of the workable clock period under process variation becomes more urgent. However, the estimation of the minimum workable clock period is difficult due to the spatial correlation of physical parameters and the non-normal distribution of timing parameters (propagation delay, setup time, and hold time). Therefore, a good statistical timing analysis (SSTA) tool for SFQ circuits is necessary. This paper presents a bootstrap-based statistical static timing analysis tool called qSSTA. qSSTA can reasonably estimate a minimum workable clock period by executing a large amount of bootstrap iterations from the discrete sampling spaces of all gates under a certain correlation specification. By applying path pruning methods, qSSTA skips the calculations on unimportant paths and hence reduce run time and memory. Experimental results show that the size of important paths could be small. Among 19114 paths of the 16-bit integer divider, only 73 paths are important to estimate minimum workable clock period. We only need 84.21 seconds to run 10,000 iterations.

Posted Content
TL;DR: This work proposes a training method for visual attention networks, Coarse2Fine, which creates a differentiable path from the input space to the attended feature maps, which will guide the attention maps to better attend the fine-grained features.
Abstract: Small inter-class and large intra-class variations are the main challenges in fine-grained visual classification. Objects from different classes share visually similar structures and objects in the same class can have different poses and viewpoints. Therefore, the proper extraction of discriminative local features (e.g. bird's beak or car's headlight) is crucial. Most of the recent successes on this problem are based upon the attention models which can localize and attend the local discriminative objects parts. In this work, we propose a training method for visual attention networks, Coarse2Fine, which creates a differentiable path from the input space to the attended feature maps. Coarse2Fine learns an inverse mapping function from the attended feature maps to the informative regions in the raw image, which will guide the attention maps to better attend the fine-grained features. We show Coarse2Fine and orthogonal initialization of the attention weights can surpass the state-of-the-art accuracies on common fine-grained classification tasks.

Proceedings ArticleDOI
13 May 2019
TL;DR: A novel hybrid verification framework (HVF) which uses Reinforcement Learning (RL) and Deep Neural Networks (DNNs) to accelerate the verification of complex systems.
Abstract: In this paper, we propose a novel hybrid verification framework (HVF) which uses Reinforcement Learning (RL) and Deep Neural Networks (DNNs) to accelerate the verification of complex systems. More precisely, our HVF incorporates RL to generate all possible sequences of vectors needed to approach a target state as well as the corresponding path to the target state which contains a potential design error. Furthermore, HVF utilizes DNNs to accelerate the verification of complex data paths in the target states. We have tested our framework on several circuits including multi-core designs as well as bus-arbiters and confirmed its significant verification speedup when compared to prior work. For example, HVF provides a total speedup of 4.5x for a quad-core MIPS processor verification.

Journal ArticleDOI
TL;DR: An accuracy-aware design framework, which synthesizes a high-level description of an input application with the objective of minimizing the energy consumption of the synthesized circuit, is presented and the results show that relative coverage of large errors may be increased from 21% to 55% by employing synthetic minority oversampling technique method.
Abstract: In this paper, we present an accuracy-aware design framework [called accuracy-aware high-level synthesis (Achilles)], which synthesizes a high-level description of an input application with the objective of minimizing the energy consumption of the synthesized circuit. The proposed framework includes two main parts of Achilles and light-weight predictor selection. The framework leverages light-weight error predictors (i.e., machine learning-based classifiers) to achieve more energy reduction by dynamically managing the output quality level (exact or approximate) of the synthesized circuit. To synthesize the input application, first, we exploit a heuristic algorithm to determine the quality level required for each operation in the data flow graph (DFG) representation of the input application. Next, for synthesizing the input application, we propose an effective Achilles algorithm which utilizes the flexibility of the available multiquality arithmetic units in a high-level cell library to synthesize the datapath. To improve the efficiency, the process starts by iteratively reducing the number of functional units required for synthesizing the DFG. Then, a proper light-weight error predictor satisfying the user expected quality is chosen from the available predictors in the framework. Based on the quality requirements, three different quality management modes are considered. The efficacy of the proposed framework is assessed for benchmarks from image and signal processing as well as robotics domains. The study of these benchmarks indicates that Achilles may reduce the energy consumption up to 51% (36% on average), up to 72% (51% on average), and up to 57% (33% on average) in threshold, average, and hybrid modes, respectively, for the studied cases. Moreover, the results show that relative coverage of large errors may be increased from 21% to 55% by employing synthetic minority oversampling technique method.

Proceedings ArticleDOI
01 Nov 2019
TL;DR: QCG as mentioned in this paper is a multi-domain design and verification framework, which utilizes clock gating and frequency scaling to optimize dynamic power dissipation, not only for SFQ circuits, but also their clock networks and cooling systems.
Abstract: In this paper, we propose qCG, a multi-domain design and verification framework, which utilizes clock gating and frequency scaling to optimize dynamic power dissipation. SFQ circuits are ultra-deep pipelined at the logic level, resulting in large clock distribution networks which account for a considerable part of overall power dissipation. We have shown that qCG significantly increases power efficiency, not only for SFQ circuits, but also their clock networks and inherently cooling systems. The verification engine of qCG learns to increase the quality of results in terms of verification time and coverage. Datapath and coverage meters are embedded to verify the pulse integrity of clock signals, SFQ fanout, and path-balancing properties. Our experiments on several SFQ benchmark circuits show that qCG provides 3X power reductions for the chip. Results also confirm that when compared to a traditional random-based coverage-driven approach, qCG provides significant verification quality improvement including 2.33X verification speedup.