Showing papers by "Peter A. Beerel published in 2019"

PDF

Open Access

Journal Article•DOI•

ColdFlux Superconducting EDA and TCAD Tools Project: Overview and Progress

[...]

Coenrad J. Fourie¹, Kyle Jackman¹, Matthys M. Botha¹, Sasan Razmkhah², Pascal Febvre², Christopher L. Ayala, Qiuyun Xu, Nobuyuki Yoshikawa, Erin Patrick³, Mark E. Law³, Yanzi Wang⁴, Murali Annavaram⁵, Peter A. Beerel⁵, Sandeep K. S. Gupta⁵, Shaheen Nazarian⁵, Massoud Pedram⁵ - Show less +12 more•Institutions (5)

Stellenbosch University¹, Los Angeles Harbor College², University of Florida³, Northeastern University⁴, University of Southern California⁵

10 Jan 2019-IEEE Transactions on Applied Superconductivity

TL;DR: An overview of the current and planned activities related to the ColdFlux project is presented and the design assumptions and decisions that were made to allow the development of design tools for million-gate circuits are justified.

...read moreread less

Abstract: The IARPA SuperTools program requires the development of superconducting electronic design automation (S-EDA) and superconducting technology computer-aided design (S-TCAD) tools aimed at enabling the reliable design of complex superconducting digital circuits with millions of Josephson junctions. Within the SuperTools program, the ColdFlux project addresses S-EDA and S-TCAD tool research and development in four areas: 1) RTL synthesis, architectures and verification; 2) analog design and layout synthesis; 3) physical design and test; and 4) device and process modeling/simulation and cell library design. Capabilities include, but are not limited to, the following: device level modeling and simulation of Josephson junctions, modeling and simulation of the superconducting process manufacturing processes, powerful new electrical circuit simulation, parameterized schematic and layout libraries, optimization, compact SPICE-like model extraction, timing analysis, behavioral, register-transfer-level and logic syntheses, clock tree synthesis, placement and routing, layout-versus-schematic extraction, functional verification, and the evaluation of designs in the presence of magnetic fields and trapped flux. ColdFlux consists of six research groups from four continents. Here, we present an overview of the current and planned activities related to the project and justify the design assumptions and decisions that were made to allow the development of design tools for million-gate circuits.

...read moreread less

54 citations

Journal Article•DOI•

Pre-Defined Sparse Neural Networks With Hardware Acceleration

[...]

Sourya Dey¹, Kuan-Wen Huang¹, Peter A. Beerel¹, Keith M. Chugg¹•Institutions (1)

University of Southern California¹

12 Apr 2019-IEEE Journal on Emerging and Selected Topics in Circuits and Systems

TL;DR: In this article, a pre-defined sparsity is proposed to reduce the complexity during both training and inference, regardless of the implementation platform, and an architecture for hardware acceleration that is compatible with pre defined sparsity.

...read moreread less

Abstract: Neural networks have proven to be extremely powerful tools for modern artificial intelligence applications, but computational and storage complexity remain limiting factors. This paper presents two compatible contributions towards reducing the time, energy, computational, and storage complexities associated with multilayer perceptrons. Pre-defined sparsity is proposed to reduce the complexity during both training and inference, regardless of the implementation platform. Our results show that storage and computational complexity can be reduced by factors greater than 5X without significant performance loss. The second contribution is an architecture for hardware acceleration that is compatible with pre-defined sparsity. This architecture supports both training and inference modes and is flexible in the sense that it is not tied to a specific number of neurons. For example, this flexibility implies that various sized neural networks can be supported on various sized field programmable gate array (FPGA)s.

...read moreread less

23 citations

Proceedings Article•DOI•

System-Level Framework for Logic Obfuscation with Quantified Metrics for Evaluation

[...]

Vivek V. Menon¹, Gaurav Kolhe¹, Andrew G. Schmidt¹, Joshua S. Monson¹, Matthew French¹, Yinghua Hu¹, Peter A. Beerel¹, Pierluigi Nuzzo¹ - Show less +4 more•Institutions (1)

University of Southern California¹

01 Sep 2019

TL;DR: MIRAGE, a system-level end-to-end framework for integrated circuit obfuscation, is presented and the effectiveness of the proposed framework is illustrated on benchmarks from ISCAS and cores from an open-source system-on-chip test platform.

...read moreread less

Abstract: Logic obfuscation techniques are used to deter intellectual property piracy, reverse engineering, and counterfeiting threats in the design and manufacturing of integrated circuits. However, obfuscation can be reverse-engineered in a variety of ways, there has been little effort in measuring the effectiveness of different obfuscation techniques using uniform security and overhead metrics, and only limited investigations on the effect of combining multiple methods on the same die. This paper presents MIRAGE, a system-level end-to-end framework for integrated circuit obfuscation. MIRAGE includes a front-end design space exploration tool that selects an appropriate combination of obfuscation techniques for a given circuit, a netlist manipulation application programming interface to apply the obfuscation, and a back-end analysis tool that evaluates the obfuscation strength in terms of attack resiliency time as well as area, power, and timing overhead. The effectiveness of the proposed framework is illustrated on benchmarks from ISCAS and cores from an open-source system-on-chip test platform, thus extending the analysis to practical circuits with more than 100k gates. MIRAGE utilizes 1,242 circuit configurations to evaluate 6 obfuscation methods in terms of the attack resiliency time and the obfuscation overheads such as area, power, and timing.

...read moreread less

13 citations

Proceedings Article•DOI•

pSConv: A Pre-defined S parse Kernel Based Convolution for Deep CNNs

[...]

Souvik Kundu¹, Saurav Prakash¹, Haleh Akrami¹, Peter A. Beerel¹, Keith M. Chugg¹ - Show less +1 more•Institutions (1)

University of Southern California¹

01 Sep 2019

TL;DR: This paper proposed pSConv, a pre-defined sparse 2D kernel based convolution, which showed a parameter count reduction of up to 4.24× with modest degradation in classification accuracy relative to standard CNNs.

...read moreread less

Abstract: The high demand for computational and storage resources severely impedes the deployment of deep convolutional neural networks (CNNs) in limited resource devices. Recent CNN architectures have proposed reduced complexity versions (e.g,. SuffleNet and MobileNet) but at the cost of modest decreases in accuracy. This paper proposes pSConv, a pre-defined sparse 2D kernel based convolution, which promises significant improvements in the trade-off between complexity and accuracy for both CNN training and inference. To explore the potential of this approach, we have experimented with two widely accepted datasets, CIFAR-10 and Tiny ImageNet, in sparse variants of both the ResNet18 and VGG16 architectures. Our approach shows a parameter count reduction of up to 4.24× with modest degradation in classification accuracy relative to that of standard CNNs. Our approach outperforms a popular variant of ShuffleNet using a variant of ResNet18 with pSConv having 3 × 3 kernels with only four of nine elements not fixed at zero. In particular, the parameter count is reduced by 1.7× for CIFAR-10 and 2.29× for Tiny ImageNet with an increased accuracy of ∼ 4%.

...read moreread less

13 citations

Proceedings Article•DOI•

CSrram: Area-Efficient Low-Power Ex-Situ Training Framework for Memristive Neuromorphic Circuits Based on Clustered Sparsity

[...]

Arash Fayyazi¹, Souvik Kundu¹, Shahin Nazarian¹, Peter A. Beerel¹, Massoud Pedram¹ - Show less +1 more•Institutions (1)

University of Southern California¹

15 Jul 2019

TL;DR: CSrram is presented, an efficient ex-situ training framework for hybrid CMOS-memristive neuromorphic circuits that includes a pre-defined block diagonal clustered (BDC) sparsity algorithm to significantly reduce area and power consumption.

...read moreread less

Abstract: Artificial Neural Networks (ANNs) play a key role in many machine learning (ML) applications but poses arduous challenges in terms of storage and computation of network parameters. Memristive crossbar arrays (MCAs) are capable of both computation and storage, making them promising for in-memory computing enabled neural network accelerators. At the same time, the presence of a significant amount of zero weights in ANNs has motivated research in a variety of parameter reduction techniques. However, for crossbar based architectures, the study of efficient methods to take advantage of network sparsity is still in the early stage. This paper presents CSrram, an efficient ex-situ training framework for hybrid CMOS-memristive neuromorphic circuits. CSrram includes a pre-defined block diagonal clustered (BDC) sparsity algorithm to significantly reduce area and power consumption. The proposed framework is verified on a wide range of datasets including MNIST handwritten recognition, fashion MNIST, breast cancer prediction (BCW), IRIS, and mobile health monitoring. Compared to state of the art fully connected memristive neuromorphic circuits, our CSrram with only 25% density of weights in the first junction, provides a power and area efficiency of 1.5x and 2.6x (averaged over five datasets), respectively, without any significant test accuracy loss.

...read moreread less

7 citations

Posted Content•

A Pre-defined Sparse Kernel Based Convolution for Deep CNNs.

[...]

Souvik Kundu, Saurav Prakash, Haleh Akrami, Peter A. Beerel, Keith M. Chugg¹ - Show less +1 more•Institutions (1)

University of Southern California¹

02 Oct 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: This article proposed pSConv, a pre-defined sparse 2D kernel-based convolution, which improves the trade-off between complexity and accuracy for both CNN training and inference.

...read moreread less

Abstract: The high demand for computational and storage resources severely impede the deployment of deep convolutional neural networks (CNNs) in limited-resource devices. Recent CNN architectures have proposed reduced complexity versions (e.g. SuffleNet and MobileNet) but at the cost of modest decreases inaccuracy. This paper proposes pSConv, a pre-defined sparse 2D kernel-based convolution, which promises significant improvements in the trade-off between complexity and accuracy for both CNN training and inference. To explore the potential of this approach, we have experimented with two widely accepted datasets, CIFAR-10 and Tiny ImageNet, in sparse variants of both the ResNet18 and VGG16 architectures. Our approach shows a parameter count reduction of up to 4.24x with modest degradation in classification accuracy relative to that of standard CNNs. Our approach outperforms a popular variant of ShuffleNet using a variant of ResNet18 with pSConv having 3x3 kernels with only four of nine elements not fixed at zero. In particular, the parameter count is reduced by 1.7x for CIFAR-10 and 2.29x for Tiny ImageNet with an increased accuracy of ~4%.

...read moreread less

6 citations

Journal Article•DOI•

Automatic Retiming of Two-Phase Latch-Based Resilient Circuits

[...]

Huimei Cheng¹, Hsiao-Lun Wang¹, Minghe Zhang², Dylan Hand¹, Peter A. Beerel¹ - Show less +1 more•Institutions (2)

University of Southern California¹, Georgia Institute of Technology²

01 Jul 2019-IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

TL;DR: This paper proposes two alternatives to reduce the overhead in two-phase latch-based resilient circuits by using a new resiliency-aware graph-based approach to solve the retiming problem and uses a virtual resynthesis library to enable commercial synthesis tools to recognize the EDL overhead and optimize total area during retimed.

...read moreread less

Abstract: Timing resilient design has shown significant promise in mitigating the excess margins associated with rare worst-case data and increased process, voltage, and temperature variations. However, resilient circuits need error detecting sequential logic (EDL) to detect timing errors which incur area and power overhead. This paper proposes two alternatives to reduce the overhead in two-phase latch-based resilient circuits. The first is a new resiliency-aware graph-based approach to solve the retiming problem. The second uses a virtual resynthesis library to enable commercial synthesis tools to recognize the EDL overhead and optimize total area during retiming. We compare both approaches to a commercially standard retiming approach, which ignores the resiliency overheads, on a wide variety of benchmarks. Our experimental results show that our methods are computationally efficient and reduce the total circuit area by an average of up to 10%–15% when compared to traditional retiming.

...read moreread less

4 citations

Proceedings Article•DOI•

qBSA: Logic Design of a 32-bit Block-Skewed RSFQ Arithmetic Logic Unit

[...]

Souvik Kundu¹, Gourav Datta¹, Peter A. Beerel¹, Massoud Pedram¹•Institutions (1)

University of Southern California¹

01 Jul 2019

TL;DR: This paper proposes to increase the throughput of SFQ pipelines by redesigning the datapath to accept and operate on least-significant bits (LSBs) clock cycles earlier than more significant bits, and develops a block-skewed MIPS-compatible 32-bit ALU.

...read moreread less

Abstract: Single flux quantum (SFQ) circuits are an attractive beyond-CMOS technology because they promise two orders of magnitude lower power at clock frequencies exceeding 25 GHz. However, every SFQ gate is clocked creating very deep gate-level pipelines that are difficult to keep full, particularly for sequences that include data-dependent operations. This paper proposes to increase the throughput of SFQ pipelines by redesigning the datapath to accept and operate on least-significant bits (LSBs) clock cycles earlier than more significant bits. This skewed datapath approach reduces the latency of the LSB side which can be feedback earlier for use in subsequent data-dependent operations increasing their throughput. In particular, we propose to group the bits into 4-bit blocks that are operated on concurrently and create block-skewed datapath units for 32-bit operation. This skewed approach allows a subsequent data-dependent operation to start evaluating as soon as the first 4-bit block completes. Using this general approach, we develop a block-skewed MIPS-compatible 32-bit ALU. Our gate-level Verilog design improves the throughput of 32-bit data dependent operations by 2x and 1.5x compared to previously proposed 4-bit bit-slice and 32-bit Ladner-Fischer ALUs respectively. We have quantified the benefit of this design on instructions per cycle (IPC) for various RISC-V benchmarks assuming a range of non-ALU operation latencies from one to ten cycles. Averaging across benchmarks, our experimental results show that compared to the 32-bit Ladner-Fischer our proposed architecture provides a range of IPC improvements between 1.37x assuming one-cycle non-ALU latency to 1.2x assuming ten-cycle non-ALU latency. Moreover, our average IPC improvements compared to a 32-bit ALU based on the 4-bit bit-slice range from 2.93x to 4x.

...read moreread less

4 citations

Posted Content•

Metastability-Resilient Synchronization FIFO for SFQ Logic

[...]

Gourav Datta, Haolin Cong, Souvik Kundu, Peter A. Beerel¹•Institutions (1)

University of Southern California¹

10 Oct 2019-arXiv: Emerging Technologies

TL;DR: This paper shows that naive SFQ clock domain crossing (CDC) first-in-first-out buffers (FIFOs) are vulnerable to delay increases, motivating the need for more robust CDC FIFOs, and proposes a novel 1-bit metastability-resilient SFQCDC FIFO that delivers over a 1000 reduction in logical error rate at 30 GHz.

...read moreread less

Abstract: Digital single-flux quantum (SFQ) technology promises to meet the demands of ultra low power and high speed computing needed for future exascale supercomputing systems. The combination of ultra high clock frequencies, gate-level pipelines, and numerous sources of variability in SFQ circuits, however, make low-skew global clock distribution a challenge. This motivates the support of multiple independent clock domains and related clock domain crossing circuits that enable reliable communication across domains. Existing J-SIM simulation models indicate that setup violations can cause clock-to-Q increases of up to 100%. This paper first shows that naive SFQ clock domain crossing (CDC) first-in-first-out buffers (FIFOs) are vulnerable to these delay increases, motivating the need for more robust CDC FIFOs. Inspired by CMOS multi-flip-flop asynchronous FIFO synchronizers, we then propose a novel 1-bit metastability-resilient SFQ CDC FIFO that simulations show delivers over a 1000 reduction in logical error rate at 30 GHz. Moreover, for a 10-stage FIFO, the Josephson junction (JJ) area of our proposed design is only 7.5% larger than the non-resilient counterpart. Finally, we propose design guidelines that define the minimal FIFO depth subject to both throughput and burstiness constraints.

...read moreread less

3 citations

Proceedings Article•DOI•

qCDC: Metastability-Resilient Synchronization FIFO for SFQ Logic

[...]

Gourav Datta¹, Haolin Cong¹, Souvik Kundu¹, Peter A. Beerel¹•Institutions (1)

University of Southern California¹

01 Jul 2019

...read moreread less

3 citations

Posted Content•

Automatic Conversion from Flip-flop to 3-phase Latch-based Designs.

[...]

Huimei Cheng, Yichen Gu, Peter A. Beerel

25 Jun 2019-arXiv: Hardware Architecture

TL;DR: A novel automated design flow is presented that converts flip-flop to 3-phase latch-based designs and the resulting circuits have the same performance as the master-slave based designs but require significantly less latches.

...read moreread less

Abstract: Latch-based designs have many benefits over their flip-flop based counterparts but have limited use partially because most RTL specifications are flop-centric and automatic conversion of FF to latch-based designs is challenging. Conventional conversion algorithms target master-slave latch-based designs with two non-overlapping clocks. This paper presents a novel automated design flow that converts flip-flop to 3-phase latch-based designs. The resulting circuits have the same performance as the master-slave based designs but require significantly less latches. Our experimental results demonstrate the potential for savings in the number of latches (21.3%), area (5.8%), and power (16.3%) on a variety of ISCAS, CEP, and CPU benchmark circuits, compared to the master-slave conversions.

...read moreread less

Posted Content•

Neural Network Training with Approximate Logarithmic Computations

[...]

Arnab Sanyal¹, Peter A. Beerel¹, Keith M. Chugg¹•Institutions (1)

University of Southern California¹

22 Oct 2019-arXiv: Learning

TL;DR: This paper proposed an end-to-end training and inference scheme that eliminates multiplications by approximate operations in the log-domain which has the potential to significantly reduce implementation complexity.

...read moreread less

Abstract: The high computational complexity associated with training deep neural networks limits online and real-time training on edge devices. This paper proposed an end-to-end training and inference scheme that eliminates multiplications by approximate operations in the log-domain which has the potential to significantly reduce implementation complexity. We implement the entire training procedure in the log-domain, with fixed-point data representations. This training procedure is inspired by hardware-friendly approximations of log-domain addition which are based on look-up tables and bit-shifts. We show that our 16-bit log-based training can achieve classification accuracy within approximately 1% of the equivalent floating-point baselines for a number of commonly used datasets.

...read moreread less

Proceedings Article•DOI•

A Clock Synthesis Algorithm for Hierarchical Chains of Homogeneous Clover-Leaves Clock Networks for Single Flux Quantum Logic Circuits

[...]

Soheil Nazar Shahsavani¹, Ramy N. Tadros¹, Peter A. Beerel¹, Massoud Pedram¹•Institutions (1)

University of Southern California¹

01 Jul 2019

TL;DR: A novel algorithm for the physical implementation of the (HC)2LC network as a directed graph with multiple cycles representing the synchronizing feedback signals and a novel mixed integer linear programming (MILP) based approach minimizes the maximum clock skew among the sinks of the clock network.

...read moreread less

Abstract: Single Flux Quantum (SFQ) is a promising option for high performance and low power supercomputing platforms. Nevertheless, timing uncertainty represents an obstacle to the design of high-frequency clock distribution networks. The hierarchical chains of homogeneous clover-leaves clocking, $(\mathrm{HC})^{2}\mathrm{LC}$ . was proposed as an innovative solution to this challenge. This paper presents a novel algorithm for the physical implementation of $(\mathrm{HC})^{2}\mathrm{LC}$ networks. The proposed method models the (HC)2LC network as a directed graph with multiple cycles representing the synchronizing feedback signals. This graph is then transformed to a directed acyclic graph (DAG) by eliminating feedback edges. The physical location of the nodes in the generated DAG (such as splitters and C-junctions) in the Manhattan plane is calculated using a zero-skew clock embedding algorithm. Additionally, a novel mixed integer linear programming (MILP) based approach minimizes the maximum clock skew among the sinks of the clock network and the sum of the delay of the edges in feedback loops, simultaneously. Experimental results show that using the proposed approach, the average clock skew for five benchmark circuits is 4.6ps.

...read moreread less

Journal Article•DOI•

Yield modelling and analysis of bundled data and ring-oscillator based designs

[...]

Yang Zhang, Ji Li, Huimei Cheng, Haipeng Zha, Jeffrey Draper, Peter A. Beerel - Show less +2 more

09 May 2019-Iet Computers and Digital Techniques

TL;DR: This study mathematically analyses the resulting yield subject to a limit on shipped product quality providing a practical mechanism of optimising the test margins for ring-oscillator-based clocks and bundled-data circuits.

...read moreread less

Abstract: The ill effects of process, voltage, and temperature variations are significantly reduced by ring-oscillator (OR)-based clocks and bundled-data (BD) designs. Such designs include delay lines that enable the addition of test margin that can either by set uniformly across all manufactured chips or tuned individually per-chip. This study mathematically analyses the resulting yield subject to a limit on shipped product quality providing a practical mechanism of optimising the test margins for these circuits. The model also provides a means of quantifying the benefits from the correlation in the delay line and combinational logic. In particular, using correlation values obtained from Monte Carlo analysis of a sample circuit in a 65 nm process, the model shows that BD and OR-based circuits can have an over 50% yield advantage over their synchronous counterparts.

...read moreread less