Showing papers by "Massoud Pedram published in 2019"

PDF

Open Access

Journal Article•DOI•

TOSAM: An Energy-Efficient Truncation- and Rounding-Based Scalable Approximate Multiplier

[...]

Shaghayegh Vahdat¹, Mehdi Kamal¹, Ali Afzali-Kusha¹, Massoud Pedram²•Institutions (2)

University of Tehran¹, University of Southern California²

25 Jan 2019-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: The proposed approximate multiplier has an almost Gaussian error distribution with a near-zero mean value and is exploited in the structure of a JPEG encoder, sharpening, and classification applications, indicating that the quality degradation of the output is negligible.

...read moreread less

Abstract: A scalable approximate multiplier, called truncation- and rounding-based scalable approximate multiplier (TOSAM) is presented, which reduces the number of partial products by truncating each of the input operands based on their leading one-bit position. In the proposed design, multiplication is performed by shift, add, and small fixed-width multiplication operations resulting in large improvements in the energy consumption and area occupation compared to those of the exact multiplier. To improve the total accuracy, input operands of the multiplication part are rounded to the nearest odd number. Because input operands are truncated based on their leading one-bit positions, the accuracy becomes weakly dependent on the width of the input operands and the multiplier becomes scalable. Higher improvements in design parameters (e.g., area and energy consumption) can be achieved as the input operand widths increase. To evaluate the efficiency of the proposed approximate multiplier, its design parameters are compared with those of an exact multiplier and some other recently proposed approximate multipliers. Results reveal that the proposed approximate multiplier with a mean absolute relative error in the range of 11%–0.3% improves delay, area, and energy consumption up to 41%, 90%, and 98%, respectively, compared to those of the exact multiplier. It also outperforms other approximate multipliers in terms of speed, area, and energy consumption. The proposed approximate multiplier has an almost Gaussian error distribution with a near-zero mean value. We exploit it in the structure of a JPEG encoder, sharpening, and classification applications. The results indicate that the quality degradation of the output is negligible. In addition, we suggest an accuracy configurable TOSAM where the energy consumption of the multiplication operation can be adjusted based on the minimum required accuracy.

...read moreread less

99 citations

Journal Article•DOI•

ColdFlux Superconducting EDA and TCAD Tools Project: Overview and Progress

[...]

Coenrad J. Fourie¹, Kyle Jackman¹, Matthys M. Botha¹, Sasan Razmkhah², Pascal Febvre², Christopher L. Ayala, Qiuyun Xu, Nobuyuki Yoshikawa, Erin Patrick³, Mark E. Law³, Yanzi Wang⁴, Murali Annavaram⁵, Peter A. Beerel⁵, Sandeep K. S. Gupta⁵, Shaheen Nazarian⁵, Massoud Pedram⁵ - Show less +12 more•Institutions (5)

Stellenbosch University¹, Los Angeles Harbor College², University of Florida³, Northeastern University⁴, University of Southern California⁵

10 Jan 2019-IEEE Transactions on Applied Superconductivity

TL;DR: An overview of the current and planned activities related to the ColdFlux project is presented and the design assumptions and decisions that were made to allow the development of design tools for million-gate circuits are justified.

...read moreread less

Abstract: The IARPA SuperTools program requires the development of superconducting electronic design automation (S-EDA) and superconducting technology computer-aided design (S-TCAD) tools aimed at enabling the reliable design of complex superconducting digital circuits with millions of Josephson junctions. Within the SuperTools program, the ColdFlux project addresses S-EDA and S-TCAD tool research and development in four areas: 1) RTL synthesis, architectures and verification; 2) analog design and layout synthesis; 3) physical design and test; and 4) device and process modeling/simulation and cell library design. Capabilities include, but are not limited to, the following: device level modeling and simulation of Josephson junctions, modeling and simulation of the superconducting process manufacturing processes, powerful new electrical circuit simulation, parameterized schematic and layout libraries, optimization, compact SPICE-like model extraction, timing analysis, behavioral, register-transfer-level and logic syntheses, clock tree synthesis, placement and routing, layout-versus-schematic extraction, functional verification, and the evaluation of designs in the presence of magnetic fields and trapped flux. ColdFlux consists of six research groups from four continents. Here, we present an overview of the current and planned activities related to the project and justify the design assumptions and decisions that were made to allow the development of design tools for million-gate circuits.

...read moreread less

54 citations

Journal Article•DOI•

A Theoretical Framework for Quality Estimation and Optimization of DSP Applications Using Low-Power Approximate Adders

[...]

Masoud Pashaeifar¹, Mehdi Kamal¹, Ali Afzali-Kusha¹, Massoud Pedram²•Institutions (2)

University of Tehran¹, University of Southern California²

01 Jan 2019-IEEE Transactions on Circuits and Systems I-regular Papers

TL;DR: A signal processing theoretical modeling approach for describing the power of the approximation noise which is the integral of error spectral density over the bandwidth, is developed and a mathematical optimization approach based on Lagrange Multipliers for optimizing design parameters is presented.

...read moreread less

Abstract: In this paper, we present a framework for analytically estimating the output quality of common digital signal processing (DSP) blocks that utilize approximate adders. The framework is based on considering the error of approximate adders as an additive noise (approximation noise) that disturbs the output of the DSP block in question. A signal processing theoretical modeling approach for describing the power of the approximation noise which is the integral of error spectral density over the bandwidth, is developed. The output qualities of DSP blocks, such as finite impulse response filter, discrete cosine transform, and fast Fourier transform, which utilize approximate adders, are thus estimated. The accuracy of the proposed framework is evaluated by comparing mathematical model predictions to simulation results by using the signal-to-noise ratio (SNR) metric. The inaccuracy of the SNRs predicted by the framework was, on average, less than 2.5dB compared with that obtained from simulations. Therefore, a mathematical optimization approach based on Lagrange Multipliers for optimizing design parameters is also presented. The optimization is realized by choosing a proper configuration of the target block, such as determining the data width of the inexact computation part for each approximate adder in the design.

...read moreread less

34 citations

Proceedings Article•DOI•

Energy-efficient, low-latency realization of neural networks through boolean logic minimization

[...]

Mahdi Nazemi¹, Ghasem Pasandi¹, Massoud Pedram¹•Institutions (1)

University of Southern California¹

21 Jan 2019

TL;DR: This paper presents a training method that enables a radically different approach for realization of deep neural networks through Boolean logic minimization, which completely removes the energy-hungry step of accessing memory for obtaining model parameters.

...read moreread less

Abstract: Deep neural networks have been successfully deployed in a wide variety of applications including computer vision and speech recognition. To cope with computational and storage complexity of these models, this paper presents a training method that enables a radically different approach for realization of deep neural networks through Boolean logic minimization. The aforementioned realization completely removes the energy-hungry step of accessing memory for obtaining model parameters, consumes about two orders of magnitude fewer computing resources compared to realizations that use floating-point operations, and has a substantially lower latency.

...read moreread less

30 citations

Proceedings Article•DOI•

Deep Learning-Based Circuit Recognition Using Sparse Mapping and Level-Dependent Decaying Sum Circuit Representations

[...]

Arash Fayyazi¹, Soheil Shababi¹, Pierluigi Nuzzo¹, Shahin Nazarian¹, Massoud Pedram¹ - Show less +1 more•Institutions (1)

University of Southern California¹

25 Mar 2019

TL;DR: A scalable framework for gate-level circuit recognition that leverages deep learning and a convolutional neural network (CNN)-based circuit representation is presented and a data structure, termed level-dependent decaying sum (LDDS) existence vector, which can compactly represent information about the circuit topology is proposed.

...read moreread less

Abstract: Efficiently recognizing the functionality of a circuit is key to many applications, such as formal verification, reverse engineering, and security. We present a scalable framework for gate-level circuit recognition that leverages deep learning and a convolutional neural network (CNN)-based circuit representation. Given a standard cell library, we present a sparse mapping algorithm to improve the time and memory efficiency of the CNN-based circuit representation. Sparse mapping allows encoding only the logic cell functionality, independently of implementation parameters such as timing or area. We further propose a data structure, termed level-dependent decaying sum (LDDS) existence vector, which can compactly represent information about the circuit topology. Given a reference gate in the circuit, an LDDS vector can capture the function of the gates in the input and output cones as well as their distance (number of stages) from the reference. Compared to the baseline approach, our framework obtains more than an-order-of-magnitude reduction in the average training time and 2× improvement in the average runtime for generating CNN-based representations from gate-level circuits, while achieving 10% higher accuracy on a set of benchmarks including EPFL and ISCAS’85 circuits.

...read moreread less

22 citations

Journal Article•DOI•

A Minimum-Skew Clock Tree Synthesis Algorithm for Single Flux Quantum Logic Circuits

[...]

Soheil Nazar Shahsavani¹, Massoud Pedram¹•Institutions (1)

University of Southern California¹

26 Sep 2019-IEEE Transactions on Applied Superconductivity

TL;DR: A synchronous minimum-skew clock tree synthesis algorithm for single flux quantum circuits considering splitter delays and placement blockages and creating a fully balanced clock tree structure in which the number of clock splitters from the clock source to all the sink nodes is identical.

...read moreread less

Abstract: This article presents a synchronous minimum-skew clock tree synthesis algorithm for single flux quantum circuits considering splitter delays and placement blockages. The proposed methodology improves the state-of-the-art by accounting for splitter delays and creating a fully balanced clock tree structure in which the number of clock splitters from the clock source to all the sink nodes is identical. Additionally, a mixed integer linear programming based algorithm is presented that removes the overlaps among the clock splitters and placed cells (i.e., placement blockages) and minimizes the clock skew, simultaneously. Using the proposed method, the average clock skew for 17 benchmark circuits is 4.6 ps, improving the state-of-the-art algorithm by $\text{70}{\%}$ . Finally, a clock tree synthesis algorithm for imbalanced topologies is presented that reduces the clock skew and the number of clock splitters in the clock network by $\text{56}{\%}$ and $\text{37}{\%}$ , respectively, compared with a fully balanced clock tree solution.

...read moreread less

16 citations

Proceedings Article•DOI•

A Dynamic Programming-Based, Path Balancing Technology Mapping Algorithm Targeting Area Minimization

[...]

Ghasem Pasandi¹, Massoud Pedram¹•Institutions (1)

University of Southern California¹

01 Nov 2019

TL;DR: This paper presents a dynamic programming-based technology mapping algorithm that generates a minimum-area mapping solution which is guaranteed to be fully path balanced to conventional superconductive single flux quantum circuits, which will fail otherwise.

...read moreread less

Abstract: Path balancing technology mapping is a method of mapping a technology-independent logical description of a circuit, such as a Boolean network, into a technology-dependent, gate-level netlist. For a gate-level netlist generated by the path balancing mapper, the difference between lengths of the longest and the shortest paths in the circuit is minimized. To achieve full path balancing, it may be necessary to add buffers on signal paths, and in such a case, the cost of buffers must be properly accounted for. This paper presents a dynamic programming-based technology mapping algorithm that generates a minimum-area mapping solution which is guaranteed to be fully path balanced. The fully path balanced mapping solution is essential to conventional superconductive single flux quantum circuits, which will fail otherwise. The balanced mapping solution is also useful in CMOS circuits to avoid (or minimize) unwanted hazard activity and the resulting wasteful dynamic power dissipation as well as to achieve the maximum throughput in a wave-pipelined circuit. Experimental results show that our path balancing technology mapping algorithm decreases total area, static power consumption, and path balancing overhead of single flux quantum circuits by large factors. For example, it reduces the circuit area by up to 111% and by an average of 26.3% compared to state-of-the-art technology mappers.

...read moreread less

16 citations

Journal Article•DOI•

Simulation Analysis and Energy-Saving Techniques for ERSFQ Circuits

[...]

Naveen Kumar Katam¹, Oleg A. Mukhanov, Massoud Pedram¹•Institutions (1)

University of Southern California¹

12 Mar 2019-IEEE Transactions on Applied Superconductivity

TL;DR: Simulation results for current recycling ERSFQ circuits are presented along with a strategy for implementing large superconducting circuits, and an innovative clock-choking mechanism using magnetic Josephson junctions is proposed.

...read moreread less

Abstract: Energy-efficient rapid single flux quantum (ERSFQ) circuits have become a viable alternative for the implementation of superconducting circuits due to a large amount of static power consumption in RSFQ circuits. ERSFQ circuits are built upon the popular RSFQ logic circuits by replacing the power-dissipating resistor bias network with a bias network consisting of active devices. In this paper, a simulation study of ERSFQ biasing scheme is carried out by building simulation test benches for both synchronous and asynchronous ERSFQ circuits. A study is carried out to present the optimum value of biasing inductance, influence of the feeding Josephson transmission line (FJTL) and the effect of its size, the effect of the feeding clock frequency, and the effect of the circuit operating frequency. An innovative clock-choking mechanism using magnetic Josephson junctions is also proposed for the FJTL in the case of no logic circuit activity for a current-recycling circuit block, which would help in eliminating the dynamic power consumed due to the switching of bias junctions in a logic circuit. Simulation results for current recycling ERSFQ circuits are presented along with a strategy for implementing large superconducting circuits.

...read moreread less

15 citations

Journal Article•DOI•

qGDR: A Via-Minimization-Oriented Routing Tool for Large-Scale Superconductive Single-Flux-Quantum Circuits

[...]

Ting-Ru Lin¹, Tim Edwards, Massoud Pedram¹•Institutions (1)

University of Southern California¹

09 May 2019-IEEE Transactions on Applied Superconductivity

TL;DR: An integrated global and detailed router for the SFQ circuits, qGDR, which aims at reducing the impedance mismatch during signal transfer by minimizing the total number of used vias by resorting to a maze routing algorithm.

...read moreread less

Abstract: Single-flux-quantum (SFQ) circuit technologies are promising digital circuit technologies with high-speed and extremely low-power characteristics. However, heavy wire routing tasks are finished either by considerable human effort or by commercial routing tools with few physical considerations for the SFQ circuits. In this paper, we present an integrated global and detailed router for the SFQ circuits, qGDR, which aims at reducing the impedance mismatch during signal transfer by minimizing the total number of used vias. The global router allocates routing resources while minimizing the via usage by a dynamic layer assignment algorithm. The detailed router follows the global routing results to complete the routing task by resorting to a maze routing algorithm. Following the MIT-LL SFQ5ee process technology, qGDR can use only two routing layers to route an 8-bit integer divider with more than 40 000 Josephson junctions in less than one hour.

...read moreread less

15 citations

Journal Article•DOI•

Timing Characterization for Static Timing Analysis of Single Flux Quantum Circuits

[...]

Naveen Kumar Katam¹, Massoud Pedram¹•Institutions (1)

University of Southern California¹

07 Jan 2019-IEEE Transactions on Applied Superconductivity

TL;DR: A new timing characterization method is presented here for SFQ logic cells, which relies on low-dimensional lookup tables (LUTs) to store the clock-to-output delay, setup, and hold times of clocked cells and input-to theoutput delay of nonclocked cells in an SFQ standard cell library.

...read moreread less

Abstract: Single flux quantum (SFQ) logic families require the development of electronic design automation tools to generate large-scale circuits. The available methodologies or tools for performing timing analysis of SFQ circuits do not have a load-dependent timing characterization method for calculating the context-dependent delay of cells, such as the nonlinear delay model for complementary metal–oxide–semiconductor (CMOS) circuits. A new timing characterization method is presented here for SFQ logic cells, which relies on low-dimensional lookup tables (LUTs) to store the clock-to-output delay, setup, and hold times of clocked cells and input-to-output delay of nonclocked cells in an SFQ standard cell library. Although the delay of Josephson junction based logic cells depends on many parameters, this paper shows that it is possible to reduce this dependency to only a small number of well-chosen parameters. All LUTs are obtained from JSIM simulations for a given target process technology. The accuracy of the proposed LUT-based timing characterization method is compared against JSIM simulations, which shows a maximum error of only 2.1% of the tested clocked cells with different loads.

...read moreread less

15 citations

Proceedings Article•DOI•

VeriSFQ: A Semi-formal Verification Framework and Benchmark for Single Flux Quantum Technology

[...]

Alvin D. Wong¹, Kevin Su¹, Hang Sun¹, Arash Fayyazi¹, Massoud Pedram¹, Shahin Nazarian¹ - Show less +2 more•Institutions (1)

University of Southern California¹

06 Mar 2019

TL;DR: VeriSFQ as discussed by the authors is a semi-formal verification framework for single-flux quantum (SFQ) circuits using the Universal Verification Methodology (UVM) standard.

...read moreread less

Abstract: In this paper, we propose a semi-formal verification framework for single-flux quantum (SFQ) circuits called VeriSFQ, using the Universal Verification Methodology (UVM) standard. The considered SFQ technology is superconducting digital electronic devices that operate at cryogenic temperatures with active circuit elements called the Josephson junction, which operate at high switching speeds and low switching energy - allowing SFQ circuits to operate at frequencies over 300 gigahertz. Due to key differences between SFQ and CMOS logic, verification techniques for the former are not as advanced as the latter. Thus, it is crucial to develop efficient verification techniques as the complexity of SFQ circuits scales. The VeriSFQ framework focuses on verifying the key circuit and gate-level properties of $\mathrm{SFQ}$ logic: fanout, gate-level pipeline, path balancing, and input-to-output latency. The combinational circuits considered in analyzing the performance of VeriSFQ are: Kogge-Stone adders (KSA), array multipliers, integer dividers, and select ISCAS’85 combinational benchmark circuits. Methods of introducing bugs into SFQ circuit designs for verification detection were experimented with - including stuck-at faults, fanout errors, unbalanced paths, and functional bugs like incorrect logic gates. In addition, we propose an SFQ verification benchmark consisting of combinational SFQ circuits that exemplify SFQ logic properties and present the performance of the VeriSFQ framework on these benchmark circuits. The portability and reusability of the UVM standard allows the VeriSFQ framework to serve as a foundation for future SFQ semi-formal verification techniques.

...read moreread less

Proceedings Article•DOI•

Modeling processor idle times in MPSoC platforms to enable integrated DPM, DVFS, and task scheduling subject to a hard deadline

[...]

Amirhossein Esmaili¹, Mahdi Nazemi¹, Massoud Pedram¹•Institutions (1)

University of Southern California¹

21 Jan 2019

TL;DR: This paper presents a novel approach for modeling idle intervals in MPSoC platforms which leads to a mixed integer linear programming (MILP) formulation integrating DPM, DVFS, and task scheduling of periodic task graphs subject to a hard deadline.

...read moreread less

Abstract: Energy efficiency is one of the most critical design criteria for modern embedded systems such as multiprocessor system-on-chips (MPSoCs). Dynamic voltage and frequency scaling (DVFS) and dynamic power management (DPM) are two major techniques for reducing energy consumption in such embedded systems. Furthermore, MPSoCs are becoming more popular for many real-time applications. One of the challenges of integrating DPM with DVFS and task scheduling of real-time applications on MPSoCs is the modeling of idle intervals on these platforms. In this paper, we present a novel approach for modeling idle intervals in MPSoC platforms which leads to a mixed integer linear programming (MILP) formulation integrating DPM, DVFS, and task scheduling of periodic task graphs subject to a hard deadline. We also present a heuristic approach for solving the MILP and compare its results with those obtained from solving the MILP.

...read moreread less

Proceedings Article•DOI•

Challenges and the status of superconducting single flux quantum technology

[...]

Naveen Kumar Katam¹, Jamil Kawa², Massoud Pedram¹•Institutions (2)

University of Southern California¹, Synopsys²

25 Mar 2019

TL;DR: This paper starts by describing key differences between SFQ logic and conventional CMOS and concludes by listing key challenges that must be overcome to achieve the very large scale integration of SFQ circuits and make the demonstration of a superconductive CPU a reality.

...read moreread less

Abstract: Design and manufacturing of superconductive electronics have been evolving over the past three decades with significant progress made in related fields. Rapid single flux quantum (RSFQ) logic circuits have become popular among superconductive logic families and its energy-efficient variants (ERSFQ and eSFQ) have shown promise as an ultra lowpower and high-speed circuit fabric. SFQ circuits have been demonstrated at tens of GHz with an energy consumption of an attojoule per gate. There are many differences between SFQ and conventional CMOS circuits. SFQ logic circuits are based on the manipulation of the quantized magnetic flux pulses. Most of the logic gates are sequential in nature requiring the clock to be distributed to every logic gate. SFQ logic gates have no gain and hence splitters are needed to drive multiple fanouts. Design and successful demonstration of a controllable superconducting switch and a compact reliable memory element have evaded researchers so far. This paper starts by describing key differences between SFQ logic and conventional CMOS and concludes by listing key challenges that must be overcome to achieve the very large scale integration of SFQ circuits and make the demonstration of a superconductive CPU a reality.

...read moreread less

Proceedings Article•DOI•

Balanced Factorization and Rewriting Algorithms for Synthesizing Single Flux Quantum Logic Circuits

[...]

Ghasem Pasandi¹, Massoud Pedram¹•Institutions (1)

University of Southern California¹

13 May 2019

TL;DR: Experimental results show that a combination of balanced factorization and rewriting algorithms reduces the path balancing overhead by an average of 63% for 15 benchmark circuits, and area by up to 23% compared to state-of-the-art logic synthesis tools.

...read moreread less

Abstract: Single Flux Quantum (SFQ) logic with switching energy of 100zJ1 and switching delay of 1ps is a promising post-CMOS candidate. Logic synthesis of these magnetic-pulse-based circuits is a very important step in their design flow with a big impact on the total area, power consumption, and critical path delay. SFQ circuits has some properties different from CMOS which should be taken into consideration in the design and implementation flow of these circuits. One of these properties is requirement of path balancing in the standard SFQ circuit design. Standard CMOS-based rewriting and factorization algorithms fail to preserve the balancing property of SFQ circuits. Therefore, they end up generating circuits with huge path balancing overheads. Our proposed balanced factorization and rewriting algorithms are designed specifically to solve this problem. Experimental results show that a combination of balanced factorization and rewriting algorithms reduces the path balancing overhead by an average of 63% for 15 benchmark circuits, and area by up to 23% compared to state-of-the-art logic synthesis tools.

...read moreread less

Journal Article•DOI•

OCTAN: An On-Chip Training Algorithm for Memristive Neuromorphic Circuits

[...]

Mohammad Javad Alemzadeh Ansari¹, Arash Fayyazi², Mehdi Kamal¹, Ali Afzali-Kusha¹, Massoud Pedram² - Show less +1 more•Institutions (2)

University of Tehran¹, University of Southern California²

04 Oct 2019-IEEE Transactions on Circuits and Systems I-regular Papers

TL;DR: The usefulness of the proposed algorithm is verified by training some neuromorphic circuits for different applications, and it is found that the accuracy of the networks trained by OCTAN is, on average, about 46% higher than those of RWC and SLMS algorithms.

...read moreread less

Abstract: In this paper, we propose a hardware friendly On-Chip Training Algorithm for the memristive Neuromorphic circuits (OCTAN). Although the proposed algorithm has a simple hardware like that of the random weight change (RWC) algorithm, it is much more efficient in terms of convergence speed and accuracy. In this algorithm, weights of the circuit are updated individually by a small value and the effect of individual weight update is assessed. If the weight change causes an increase in the error of the network, the weight update is reversed by applying the same change in the reverse direction twice. The usefulness of the proposed algorithm is verified by training some neuromorphic circuits for different applications. Compared to RWC and stochastic least-mean-squares (SLMS) training algorithms, our proposed algorithm needs, on average, $329\times $ fewer epochs to find the minimum error point. Moreover, the accuracy of the networks trained by OCTAN is, on average, about 46% higher than those of RWC and SLMS algorithms. Additionally, a hardware for OCTAN is presented. This hardware provides a speedup of $172\times $ ( $61\times $ ) compared to that of the RWC (SLMS) algorithm. Finally, the impact of PVT (process, voltage, and temperature) variations is studied on the proposed training hardware indicating an average training error increase of less than 3.27% in the presence of variations.

...read moreread less

Journal Article•DOI•

Energy-aware Scheduling of Task Graphs with Imprecise Computations and End-to-end Deadlines

[...]

Amirhossein Esmaili¹, Mahdi Nazemi¹, Massoud Pedram¹•Institutions (1)

University of Southern California¹

26 Nov 2019-ACM Transactions on Design Automation of Electronic Systems

TL;DR: This work presents a heuristic for scheduling tasks with potentially imprecise computations, represented with directed acyclic graphs, on multiprocessor platforms, and presents a mixed integer linear program formulation of the same problem, which provides the optimal reference scheduling solutions.

...read moreread less

Abstract: Imprecise computations allow scheduling algorithms developed for energy-constrained computing devices to trade off output quality with utilization of system resources. The goal of such scheduling algorithms is to utilize imprecise computations to find a feasible schedule for a given task graph while maximizing the quality of service (QoS) and satisfying a hard deadline and an energy bound. This work presents a heuristic for scheduling tasks with potentially imprecise computations, represented with directed acyclic graphs, on multiprocessor platforms. Furthermore, it presents a mixed integer linear program formulation of the same problem, which provides the optimal reference scheduling solutions, enabling evaluation of the efficacy of the proposed heuristic. Both the heuristic and mathematical program take account of potentially imprecise inputs of tasks on their output quality. Furthermore, the presented heuristic is capable of finding feasible schedules even under tight energy budgets. Through extensive experiments, it is shown that in some cases, the proposed heuristic is capable of finding the same QoS as the ones found by MILP. Furthermore, for those task graphs that MILP outperforms the proposed heuristic, QoS values obtained with the proposed heuristic are, on average, within 1.24% of the optimal solutions while improving the runtime by a factor of 100 or so. This clearly demonstrates the advantage of the proposed heuristic over the exact solution, especially for large task graphs where solving the mathematical problem is hampered by its lengthy runtime.

...read moreread less

Journal Article•DOI•

TEI-ULP: Exploiting Body Biasing to Improve the TEI-Aware Ultralow Power Methods

[...]

Woojoo Lee¹, Tae Wook Kang², Jae-Jin Lee², Kyuseung Han², Joongheon Kim¹, Massoud Pedram³ - Show less +2 more•Institutions (3)

Chung-Ang University¹, Electronics and Telecommunications Research Institute², University of Southern California³

01 Sep 2019-IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

TL;DR: Simulation results with the latest commercial CMOS process technologies for ULP designs demonstrate the effectiveness of the BB technique along with the TEI-aware voltage scaling method and TEi-aware frequency scaling method.

...read moreread less

Abstract: Temperature effect inversion (TEI) phenomenon in ultralow power (ULP) very large scale integration circuits has been identified as an important effect by both academia and industry. Although a number of ULP methods that attempt to exploit the TEI phenomenon have been proposed, the small size of the design exploration space when applying these methods to ULP circuits hinders them from achieving their full potential. This is mainly due to the limited granularity of the supply voltage level control. Starting with an intuition that the body biasing (BB) technique is a key to overcome this limitation, this paper exploits the BB technique along with the TEI-aware voltage scaling (TEI-VS) method and TEI-aware frequency scaling (TEI-FS) method, so as to substantially increase the design spaces of these methods. Techniques for optimally combining the BB technique with TEI-VS and TEI-FS are introduced. Simulation results with the latest commercial CMOS process technologies for ULP designs demonstrate the effectiveness of the proposed methodology.

...read moreread less

Proceedings Article•DOI•

kNN-CAM: A k-Nearest Neighbors-based Configurable Approximate Floating Point Multiplier

[...]

Ming Yan¹, Yuntao Song¹, Yiyu Feng¹, Ghasem Pasandi¹, Massoud Pedram¹, Shahin Nazarian¹ - Show less +2 more•Institutions (1)

University of Southern California¹

06 Mar 2019

TL;DR: This paper presents design of kNN-CAM, a k-Nearest Neighbors (kNN)-based Configurable Approximate floating point Multiplier that utilizes approximate computing opportunities to deliver significant area and energy savings.

...read moreread less

Abstract: In many real computations such as arithmetic operations in hidden layers of a neural network, some amounts of inaccuracies can be tolerated without degrading the final results (e.g., maintaining the same level of accuracy for image classification). This paper presents design of kNN-CAM, a k-Nearest Neighbors (kNN)-based Configurable Approximate floating point Multiplier. kNN-CAM utilizes approximate computing opportunities to deliver significant area and energy savings. A kNN engine is trained on a sufficiently large set of input data to learn the quantity of bit truncation that can be performed in each floating point input with the goal of minimizing energy and area. Next, this trained engine is used to predict the level of approximation for unseen data. Experimental results show that kNN-CAM provides about 67% area saving and 19% speedup while losing only 4.86% accuracy when compared to a 100% accurate multiplier. Furthermore, the application of kNN-CAM in implementation of a handwritten digit recognition provides 47.2% area saving while the accuracy is dropped by only 0.3%.

...read moreread less

Journal Article•DOI•

Design Space Exploration of Memory Controller Placement in Throughput Processors with Deep Learning

[...]

Ting-Ru Lin¹, Yunfan Li², Massoud Pedram¹, Lizhong Chen²•Institutions (2)

University of Southern California¹, Oregon State University²

01 Jan 2019-IEEE Computer Architecture Letters

TL;DR: A novel deep-learning based framework is presented that employs a genetic algorithm to efficiently guide exploration through the large design space while utilizing deep learning methods to provide fast performance prediction of design points instead of relying on slow full system simulations.

...read moreread less

Abstract: As throughput-oriented processors incur a significant number of data accesses, the placement of memory controllers (MCs) has a critical impact on overall performance. However, due to the lack of a systematic way to explore the huge design space of MC placements, only a few ad-hoc placements have been proposed, leaving much of the opportunity unexploited. In this paper, we present a novel deep-learning based framework that explores this opportunity intelligently and automatically. The proposed framework employs a genetic algorithm to efficiently guide exploration through the large design space while utilizing deep learning methods to provide fast performance prediction of design points instead of relying on slow full system simulations. Evaluation shows that, the proposed deep learning models achieves a speedup of 282X for the search process, and the MC placement found by our framework improves the average performance (IPC) of 18 benchmarks by 19.3 percent over the best-known placement found by human intuition.

...read moreread less

Journal Article•DOI•

Low-power data encoding/decoding for energy-efficient static random access memory design

[...]

Ghasem Pasandi, Kolsoom Mehrabi, Behzad Ebrahimi, Sied Mehdi Fakhraei, Ali Afzali-Kusha, Massoud Pedram - Show less +2 more

01 Nov 2019-Iet Circuits Devices & Systems

TL;DR: Simulation results in an industrial and a predictive CMOS technology show that the proposed design for SRAM reduces the energy consumption of read and write operations considerably for some standard test images as input data to the memory.

...read moreread less

Abstract: This study presents a new energy-efficient design for static random access memory (SRAM) using a low-power input data encoding and output data decoding stages. A data bit reordering algorithm is applied to the input data to increase the number of 0s that are going to be written into the SRAM array. Using SRAM cells which are more energy-efficient in writing a ‘0’ than a ‘1’ benefits from this, resulting in a reduction in the total power and energy consumptions of the whole memory. The input data encoding is performed using a simple circuit, which is built of multiplexers and inverters. After the read operation, data will be returned back to its initial form using a low-power data decoding circuit. Simulation results in an industrial and a predictive CMOS technology show that the proposed design for SRAM reduces the energy consumption of read and write operations considerably for some standard test images as input data to the memory. For instance, in writing pixels of Lenna test image into this SRAM and reading them back, 15 and 20% savings are observed for the energy consumption of write and read operations, respectively, compared with the normal write and read operations in standard SRAMs.

...read moreread less

Posted Content•

Optimizing Routerless Network-on-Chip Designs: An Innovative Learning-Based Framework

[...]

Ting-Ru Lin, Drew Penney, Massoud Pedram, Lizhong Chen¹•Institutions (1)

Oregon State University¹

11 May 2019-arXiv: Hardware Architecture

TL;DR: A novel deep reinforcement framework is proposed, taking routerless networks-on-chip (NoC) as an evaluation case study, and successfully resolves problems with prior design approaches being either unreliable due to random searches or inflexible due to severe design space restrictions.

...read moreread less

Abstract: Machine learning applied to architecture design presents a promising opportunity with broad applications. Recent deep reinforcement learning (DRL) techniques, in particular, enable efficient exploration in vast design spaces where conventional design strategies may be inadequate. This paper proposes a novel deep reinforcement framework, taking routerless networks-on-chip (NoC) as an evaluation case study. The new framework successfully resolves problems with prior design approaches being either unreliable due to random searches or inflexible due to severe design space restrictions. The framework learns (near-)optimal loop placement for routerless NoCs with various design constraints. A deep neural network is developed using parallel threads that efficiently explore the immense routerless NoC design space with a Monte Carlo search tree. Experimental results show that, compared with conventional mesh, the proposed deep reinforcement learning (DRL) routerless design achieves a 3.25x increase in throughput, 1.6x reduction in packet latency, and 5x reduction in power. Compared with the state-of-the-art routerless NoC, DRL achieves a 1.47x increase in throughput, 1.18x reduction in packet latency, and 1.14x reduction in average hop count albeit with slightly more power overhead.

...read moreread less

Proceedings Article•DOI•

CSrram: Area-Efficient Low-Power Ex-Situ Training Framework for Memristive Neuromorphic Circuits Based on Clustered Sparsity

[...]

Arash Fayyazi¹, Souvik Kundu¹, Shahin Nazarian¹, Peter A. Beerel¹, Massoud Pedram¹ - Show less +1 more•Institutions (1)

University of Southern California¹

15 Jul 2019

TL;DR: CSrram is presented, an efficient ex-situ training framework for hybrid CMOS-memristive neuromorphic circuits that includes a pre-defined block diagonal clustered (BDC) sparsity algorithm to significantly reduce area and power consumption.

...read moreread less

Abstract: Artificial Neural Networks (ANNs) play a key role in many machine learning (ML) applications but poses arduous challenges in terms of storage and computation of network parameters. Memristive crossbar arrays (MCAs) are capable of both computation and storage, making them promising for in-memory computing enabled neural network accelerators. At the same time, the presence of a significant amount of zero weights in ANNs has motivated research in a variety of parameter reduction techniques. However, for crossbar based architectures, the study of efficient methods to take advantage of network sparsity is still in the early stage. This paper presents CSrram, an efficient ex-situ training framework for hybrid CMOS-memristive neuromorphic circuits. CSrram includes a pre-defined block diagonal clustered (BDC) sparsity algorithm to significantly reduce area and power consumption. The proposed framework is verified on a wide range of datasets including MNIST handwritten recognition, fashion MNIST, breast cancer prediction (BCW), IRIS, and mobile health monitoring. Compared to state of the art fully connected memristive neuromorphic circuits, our CSrram with only 25% density of weights in the first junction, provides a power and area efficiency of 1.5x and 2.6x (averaged over five datasets), respectively, without any significant test accuracy loss.

...read moreread less

Proceedings Article•DOI•

TIP: A Temperature Effect Inversion-Aware Ultra-Low Power System-on-Chip Platform

[...]

Kyuseung Han¹, Sukho Lee¹, Jae-Jin Lee¹, Woojoo Lee², Massoud Pedram³ - Show less +1 more•Institutions (3)

Electronics and Telecommunications Research Institute¹, Chung-Ang University², University of Southern California³

29 Jul 2019

TL;DR: A new TEI-inspired SoC platform (called TIP), which relies on network-on-chip architecture (called µNoC) to realize system interconnects, which successfully reduces the total number and length of global wires.

...read moreread less

Abstract: Researchers have been trying to exploit the temperature effect inversion (TEI) phenomenon to improve energy efficiency of system-on-chip (SoC) designs without sacrificing its performance. However, TEI-aware low power methods have a critical limitation in that they can only be applied to components within the SoC that do not contain long (global) wires. This is because wire delays continue to increase with rising temperatures irrespective of the operating supply voltage level, which tends to cancel out positive effects of the TEI phenomenon in SoCs. To tackle this limitation and thoroughly utilize the TEI-aware methods, this paper presents new TEI-inspired SoC platform (called TIP), which relies on network-on-chip architecture (called µNoC) to realize system interconnects. The µNoC successfully reduces the total number and length of global wires. By fabricating a TIP prototyping chip in Samsung 28nm FD-SOI technology, we verify the effectiveness of TIP. Extensive post-fabrication measurements demonstrate that the chip while continuing to operate at a target 50MHz clock frequency can lower its supply voltage from 0.54V to 0.48V at 25°C and to 0.44V at 80°C, which results in up to 35% power saving.

...read moreread less

Proceedings Article•DOI•

qEC: A Logical Equivalence Checking Framework Targeting SFQ Superconducting Circuits

[...]

Arash Fayyazi¹, Shahin Nazarian¹, Massoud Pedram¹•Institutions (1)

University of Southern California¹

01 Jul 2019

TL;DR: A framework for logical equivalence checking (LEC) ofSFQ circuits called qEC is proposed, built on the ABC tool however with the ability to check on properties of SFQ superconducting circuits, which shows a comparative verification time of Sport lab SFQ logic circuit benchmark suite.

...read moreread less

Abstract: Superconducting devices have emerged as one of the most promising beyond-CMOS technologies with a switching delay of 1ps and switching energy of $10^{-19}\mathrm{J}$ to achieve high performance, energy-efficient systems and make quantum computing a reality. Design and verification methodologies of single flux quantum (SFQ) logic fundamentally differ from those of the CMOS logic, due to key differences such as pulse signal type, ultra-deep (gate-level) pipelining, and path-balancing in SFQ circuits. In this paper, we propose a framework for logical equivalence checking (LEC) of SFQ circuits called qEC. qEC is built on the ABC tool however with the ability to check on properties of SFQ superconducting circuits. Several timing and structural checks are embedded in our framework. We benchmark the framework on post-synthesis netlists with an SFQ technology. Results show a comparative verification time of Sport lab SFQ logic circuit benchmark suite including 16-bit Array multiplier, 16-bit integer divider and ISCAS'85 circuits with respect to ABC tool for similar CMOS circuits.

...read moreread less

Journal Article•DOI•

QoS guaranteed online management of battery swapping station under dynamic energy pricing

[...]

Luhao Wang¹, Massoud Pedram¹•Institutions (1)

University of Southern California¹

08 Feb 2019

TL;DR: The authors consider a realistic BSS framework in which EVs can arrive at BSS with time of day dependent rates having different battery state-of-charges, and investigate the battery charging scheduling problem in the BSS under a dynamic energy pricing.

...read moreread less

Abstract: Further popularisation of electric vehicles (EVs) is hindered by their relatively short driving distance and long battery charging time. To overcome these shortcomings, the battery swapping station (BSS) has been proposed as a means of satisfying the increasing demands for fast EV battery recharging. At a BSS, (partially) depleted batteries from EVs can be replaced with partially or fully charged ones almost instantaneously. Recharging scheduling and maintenance of batteries are done by the operator of BSS, with the target of minimising electrical energy costs while satisfying customer demands. In this study, the authors consider a realistic BSS framework in which EVs can arrive at BSS with time of day dependent rates having different battery state-of-charges. They investigate the battery charging scheduling problem in the BSS under a dynamic energy pricing. They solve (i) an online optimal BSS control problem to minimise the energy cost with a quality-of-service (QoS) guarantee, and (ii) an offline optimal BSS design problem to determine the optimal number of stored batteries so as to achieve a desirable tradeoff between flexibility in charging and amortised battery costs. The experimental results show that the total charging energy cost can be reduced significantly under different traffic scenarios.

...read moreread less

Proceedings Article•DOI•

A Statistical Static Timing Analysis Tool for Superconducting Single-Flux-Quantum Circuits

[...]

Bo Zhang, Fangzhou Wang, Sandeep Gupta, Massoud Pedram

01 Jul 2019

TL;DR: A bootstrap-based statistical static timing analysis tool called qSSTA that can reasonably estimate a minimum workable clock period by executing a large amount of bootstrap iterations from the discrete sampling spaces of all gates under a certain correlation specification.

...read moreread less

Abstract: As a beyond-CMOS technology, superconducting single-flux-quantum (SFQ) technology promises fast processing speed and excellent energy efficiency. With the increasing complexity of SFQ circuits, the accurate and fast estimation of the workable clock period under process variation becomes more urgent. However, the estimation of the minimum workable clock period is difficult due to the spatial correlation of physical parameters and the non-normal distribution of timing parameters (propagation delay, setup time, and hold time). Therefore, a good statistical timing analysis (SSTA) tool for SFQ circuits is necessary. This paper presents a bootstrap-based statistical static timing analysis tool called qSSTA. qSSTA can reasonably estimate a minimum workable clock period by executing a large amount of bootstrap iterations from the discrete sampling spaces of all gates under a certain correlation specification. By applying path pruning methods, qSSTA skips the calculations on unimportant paths and hence reduce run time and memory. Experimental results show that the size of important paths could be small. Among 19114 paths of the 16-bit integer divider, only 73 paths are important to estimate minimum workable clock period. We only need 84.21 seconds to run 10,000 iterations.

...read moreread less

Posted Content•

Coarse2Fine: A Two-stage Training Method for Fine-grained Visual Classification

[...]

Amir Erfan Eshratifar¹, David Eigen, Michael Gormish, Massoud Pedram¹•Institutions (1)

University of Southern California¹

06 Sep 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work proposes a training method for visual attention networks, Coarse2Fine, which creates a differentiable path from the input space to the attended feature maps, which will guide the attention maps to better attend the fine-grained features.

...read moreread less

Abstract: Small inter-class and large intra-class variations are the main challenges in fine-grained visual classification. Objects from different classes share visually similar structures and objects in the same class can have different poses and viewpoints. Therefore, the proper extraction of discriminative local features (e.g. bird's beak or car's headlight) is crucial. Most of the recent successes on this problem are based upon the attention models which can localize and attend the local discriminative objects parts. In this work, we propose a training method for visual attention networks, Coarse2Fine, which creates a differentiable path from the input space to the attended feature maps. Coarse2Fine learns an inverse mapping function from the attended feature maps to the informative regions in the raw image, which will guide the attention maps to better attend the fine-grained features. We show Coarse2Fine and orthogonal initialization of the attention weights can surpass the state-of-the-art accuracies on common fine-grained classification tasks.

...read moreread less

Proceedings Article•DOI•

A Hybrid Framework for Functional Verification using Reinforcement Learning and Deep Learning

[...]

Karunveer Singh¹, Rishabh Gupta¹, Vikram Gupta¹, Arash Fayyazi¹, Massoud Pedram¹, Shahin Nazarian¹ - Show less +2 more•Institutions (1)

University of Southern California¹

13 May 2019

TL;DR: A novel hybrid verification framework (HVF) which uses Reinforcement Learning (RL) and Deep Neural Networks (DNNs) to accelerate the verification of complex systems.

...read moreread less

Abstract: In this paper, we propose a novel hybrid verification framework (HVF) which uses Reinforcement Learning (RL) and Deep Neural Networks (DNNs) to accelerate the verification of complex systems. More precisely, our HVF incorporates RL to generate all possible sequences of vectors needed to approach a target state as well as the corresponding path to the target state which contains a potential design error. Furthermore, HVF utilizes DNNs to accelerate the verification of complex data paths in the target states. We have tested our framework on several circuits including multi-core designs as well as bus-arbiters and confirmed its significant verification speedup when compared to prior work. For example, HVF provides a total speedup of 4.5x for a quad-core MIPS processor verification.

...read moreread less

Journal Article•DOI•

ACHILLES: Accuracy-Aware High-Level Synthesis Considering Online Quality Management

[...]

Shayan Tabatabaei-Nikkhah¹, Mahdi Zahedi¹, Mehdi Kamal¹, Ali Afzali-Kusha¹, Massoud Pedram² - Show less +1 more•Institutions (2)

University of Tehran¹, University of Southern California²

01 Aug 2019-IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

TL;DR: An accuracy-aware design framework, which synthesizes a high-level description of an input application with the objective of minimizing the energy consumption of the synthesized circuit, is presented and the results show that relative coverage of large errors may be increased from 21% to 55% by employing synthetic minority oversampling technique method.

...read moreread less

Abstract: In this paper, we present an accuracy-aware design framework [called accuracy-aware high-level synthesis (Achilles)], which synthesizes a high-level description of an input application with the objective of minimizing the energy consumption of the synthesized circuit. The proposed framework includes two main parts of Achilles and light-weight predictor selection. The framework leverages light-weight error predictors (i.e., machine learning-based classifiers) to achieve more energy reduction by dynamically managing the output quality level (exact or approximate) of the synthesized circuit. To synthesize the input application, first, we exploit a heuristic algorithm to determine the quality level required for each operation in the data flow graph (DFG) representation of the input application. Next, for synthesizing the input application, we propose an effective Achilles algorithm which utilizes the flexibility of the available multiquality arithmetic units in a high-level cell library to synthesize the datapath. To improve the efficiency, the process starts by iteratively reducing the number of functional units required for synthesizing the DFG. Then, a proper light-weight error predictor satisfying the user expected quality is chosen from the available predictors in the framework. Based on the quality requirements, three different quality management modes are considered. The efficacy of the proposed framework is assessed for benchmarks from image and signal processing as well as robotics domains. The study of these benchmarks indicates that Achilles may reduce the energy consumption up to 51% (36% on average), up to 72% (51% on average), and up to 57% (33% on average) in threshold, average, and hybrid modes, respectively, for the studied cases. Moreover, the results show that relative coverage of large errors may be increased from 21% to 55% by employing synthetic minority oversampling technique method.

...read moreread less

Proceedings Article•DOI•

qCG: A Low-Power Multi-Domain SFQ Logic Design and Verification Framework

[...]

Shahin Nazarian¹, Arash Fayyazi¹, Massoud Pedram¹•Institutions (1)

University of Southern California¹

01 Nov 2019

TL;DR: QCG as mentioned in this paper is a multi-domain design and verification framework, which utilizes clock gating and frequency scaling to optimize dynamic power dissipation, not only for SFQ circuits, but also their clock networks and cooling systems.

...read moreread less

Abstract: In this paper, we propose qCG, a multi-domain design and verification framework, which utilizes clock gating and frequency scaling to optimize dynamic power dissipation. SFQ circuits are ultra-deep pipelined at the logic level, resulting in large clock distribution networks which account for a considerable part of overall power dissipation. We have shown that qCG significantly increases power efficiency, not only for SFQ circuits, but also their clock networks and inherently cooling systems. The verification engine of qCG learns to increase the quality of results in terms of verification time and coverage. Datapath and coverage meters are embedded to verify the pulse integrity of clock signals, SFQ fanout, and path-balancing properties. Our experiments on several SFQ benchmark circuits show that qCG provides 3X power reductions for the chip. Results also confirm that when compared to a traditional random-based coverage-driven approach, qCG provides significant verification quality improvement including 2.33X verification speedup.

...read moreread less