scispace - formally typeset
Search or ask a question

Showing papers by "Peter A. Beerel published in 2019"


Journal ArticleDOI
TL;DR: An overview of the current and planned activities related to the ColdFlux project is presented and the design assumptions and decisions that were made to allow the development of design tools for million-gate circuits are justified.
Abstract: The IARPA SuperTools program requires the development of superconducting electronic design automation (S-EDA) and superconducting technology computer-aided design (S-TCAD) tools aimed at enabling the reliable design of complex superconducting digital circuits with millions of Josephson junctions. Within the SuperTools program, the ColdFlux project addresses S-EDA and S-TCAD tool research and development in four areas: 1) RTL synthesis, architectures and verification; 2) analog design and layout synthesis; 3) physical design and test; and 4) device and process modeling/simulation and cell library design. Capabilities include, but are not limited to, the following: device level modeling and simulation of Josephson junctions, modeling and simulation of the superconducting process manufacturing processes, powerful new electrical circuit simulation, parameterized schematic and layout libraries, optimization, compact SPICE-like model extraction, timing analysis, behavioral, register-transfer-level and logic syntheses, clock tree synthesis, placement and routing, layout-versus-schematic extraction, functional verification, and the evaluation of designs in the presence of magnetic fields and trapped flux. ColdFlux consists of six research groups from four continents. Here, we present an overview of the current and planned activities related to the project and justify the design assumptions and decisions that were made to allow the development of design tools for million-gate circuits.

54 citations


Journal ArticleDOI
TL;DR: In this article, a pre-defined sparsity is proposed to reduce the complexity during both training and inference, regardless of the implementation platform, and an architecture for hardware acceleration that is compatible with pre defined sparsity.
Abstract: Neural networks have proven to be extremely powerful tools for modern artificial intelligence applications, but computational and storage complexity remain limiting factors. This paper presents two compatible contributions towards reducing the time, energy, computational, and storage complexities associated with multilayer perceptrons. Pre-defined sparsity is proposed to reduce the complexity during both training and inference, regardless of the implementation platform. Our results show that storage and computational complexity can be reduced by factors greater than 5X without significant performance loss. The second contribution is an architecture for hardware acceleration that is compatible with pre-defined sparsity. This architecture supports both training and inference modes and is flexible in the sense that it is not tied to a specific number of neurons. For example, this flexibility implies that various sized neural networks can be supported on various sized field programmable gate array (FPGA)s.

23 citations


Proceedings ArticleDOI
01 Sep 2019
TL;DR: MIRAGE, a system-level end-to-end framework for integrated circuit obfuscation, is presented and the effectiveness of the proposed framework is illustrated on benchmarks from ISCAS and cores from an open-source system-on-chip test platform.
Abstract: Logic obfuscation techniques are used to deter intellectual property piracy, reverse engineering, and counterfeiting threats in the design and manufacturing of integrated circuits. However, obfuscation can be reverse-engineered in a variety of ways, there has been little effort in measuring the effectiveness of different obfuscation techniques using uniform security and overhead metrics, and only limited investigations on the effect of combining multiple methods on the same die. This paper presents MIRAGE, a system-level end-to-end framework for integrated circuit obfuscation. MIRAGE includes a front-end design space exploration tool that selects an appropriate combination of obfuscation techniques for a given circuit, a netlist manipulation application programming interface to apply the obfuscation, and a back-end analysis tool that evaluates the obfuscation strength in terms of attack resiliency time as well as area, power, and timing overhead. The effectiveness of the proposed framework is illustrated on benchmarks from ISCAS and cores from an open-source system-on-chip test platform, thus extending the analysis to practical circuits with more than 100k gates. MIRAGE utilizes 1,242 circuit configurations to evaluate 6 obfuscation methods in terms of the attack resiliency time and the obfuscation overheads such as area, power, and timing.

13 citations


Proceedings ArticleDOI
01 Sep 2019
TL;DR: This paper proposed pSConv, a pre-defined sparse 2D kernel based convolution, which showed a parameter count reduction of up to 4.24× with modest degradation in classification accuracy relative to standard CNNs.
Abstract: The high demand for computational and storage resources severely impedes the deployment of deep convolutional neural networks (CNNs) in limited resource devices. Recent CNN architectures have proposed reduced complexity versions (e.g,. SuffleNet and MobileNet) but at the cost of modest decreases in accuracy. This paper proposes pSConv, a pre-defined sparse 2D kernel based convolution, which promises significant improvements in the trade-off between complexity and accuracy for both CNN training and inference. To explore the potential of this approach, we have experimented with two widely accepted datasets, CIFAR-10 and Tiny ImageNet, in sparse variants of both the ResNet18 and VGG16 architectures. Our approach shows a parameter count reduction of up to 4.24× with modest degradation in classification accuracy relative to that of standard CNNs. Our approach outperforms a popular variant of ShuffleNet using a variant of ResNet18 with pSConv having 3 × 3 kernels with only four of nine elements not fixed at zero. In particular, the parameter count is reduced by 1.7× for CIFAR-10 and 2.29× for Tiny ImageNet with an increased accuracy of ∼ 4%.

13 citations


Proceedings ArticleDOI
15 Jul 2019
TL;DR: CSrram is presented, an efficient ex-situ training framework for hybrid CMOS-memristive neuromorphic circuits that includes a pre-defined block diagonal clustered (BDC) sparsity algorithm to significantly reduce area and power consumption.
Abstract: Artificial Neural Networks (ANNs) play a key role in many machine learning (ML) applications but poses arduous challenges in terms of storage and computation of network parameters. Memristive crossbar arrays (MCAs) are capable of both computation and storage, making them promising for in-memory computing enabled neural network accelerators. At the same time, the presence of a significant amount of zero weights in ANNs has motivated research in a variety of parameter reduction techniques. However, for crossbar based architectures, the study of efficient methods to take advantage of network sparsity is still in the early stage. This paper presents CSrram, an efficient ex-situ training framework for hybrid CMOS-memristive neuromorphic circuits. CSrram includes a pre-defined block diagonal clustered (BDC) sparsity algorithm to significantly reduce area and power consumption. The proposed framework is verified on a wide range of datasets including MNIST handwritten recognition, fashion MNIST, breast cancer prediction (BCW), IRIS, and mobile health monitoring. Compared to state of the art fully connected memristive neuromorphic circuits, our CSrram with only 25% density of weights in the first junction, provides a power and area efficiency of 1.5x and 2.6x (averaged over five datasets), respectively, without any significant test accuracy loss.

7 citations


Posted Content
TL;DR: This article proposed pSConv, a pre-defined sparse 2D kernel-based convolution, which improves the trade-off between complexity and accuracy for both CNN training and inference.
Abstract: The high demand for computational and storage resources severely impede the deployment of deep convolutional neural networks (CNNs) in limited-resource devices. Recent CNN architectures have proposed reduced complexity versions (e.g. SuffleNet and MobileNet) but at the cost of modest decreases inaccuracy. This paper proposes pSConv, a pre-defined sparse 2D kernel-based convolution, which promises significant improvements in the trade-off between complexity and accuracy for both CNN training and inference. To explore the potential of this approach, we have experimented with two widely accepted datasets, CIFAR-10 and Tiny ImageNet, in sparse variants of both the ResNet18 and VGG16 architectures. Our approach shows a parameter count reduction of up to 4.24x with modest degradation in classification accuracy relative to that of standard CNNs. Our approach outperforms a popular variant of ShuffleNet using a variant of ResNet18 with pSConv having 3x3 kernels with only four of nine elements not fixed at zero. In particular, the parameter count is reduced by 1.7x for CIFAR-10 and 2.29x for Tiny ImageNet with an increased accuracy of ~4%.

6 citations


Journal ArticleDOI
TL;DR: This paper proposes two alternatives to reduce the overhead in two-phase latch-based resilient circuits by using a new resiliency-aware graph-based approach to solve the retiming problem and uses a virtual resynthesis library to enable commercial synthesis tools to recognize the EDL overhead and optimize total area during retimed.
Abstract: Timing resilient design has shown significant promise in mitigating the excess margins associated with rare worst-case data and increased process, voltage, and temperature variations. However, resilient circuits need error detecting sequential logic (EDL) to detect timing errors which incur area and power overhead. This paper proposes two alternatives to reduce the overhead in two-phase latch-based resilient circuits. The first is a new resiliency-aware graph-based approach to solve the retiming problem. The second uses a virtual resynthesis library to enable commercial synthesis tools to recognize the EDL overhead and optimize total area during retiming. We compare both approaches to a commercially standard retiming approach, which ignores the resiliency overheads, on a wide variety of benchmarks. Our experimental results show that our methods are computationally efficient and reduce the total circuit area by an average of up to 10%–15% when compared to traditional retiming.

4 citations


Proceedings ArticleDOI
01 Jul 2019
TL;DR: This paper proposes to increase the throughput of SFQ pipelines by redesigning the datapath to accept and operate on least-significant bits (LSBs) clock cycles earlier than more significant bits, and develops a block-skewed MIPS-compatible 32-bit ALU.
Abstract: Single flux quantum (SFQ) circuits are an attractive beyond-CMOS technology because they promise two orders of magnitude lower power at clock frequencies exceeding 25 GHz. However, every SFQ gate is clocked creating very deep gate-level pipelines that are difficult to keep full, particularly for sequences that include data-dependent operations. This paper proposes to increase the throughput of SFQ pipelines by redesigning the datapath to accept and operate on least-significant bits (LSBs) clock cycles earlier than more significant bits. This skewed datapath approach reduces the latency of the LSB side which can be feedback earlier for use in subsequent data-dependent operations increasing their throughput. In particular, we propose to group the bits into 4-bit blocks that are operated on concurrently and create block-skewed datapath units for 32-bit operation. This skewed approach allows a subsequent data-dependent operation to start evaluating as soon as the first 4-bit block completes. Using this general approach, we develop a block-skewed MIPS-compatible 32-bit ALU. Our gate-level Verilog design improves the throughput of 32-bit data dependent operations by 2x and 1.5x compared to previously proposed 4-bit bit-slice and 32-bit Ladner-Fischer ALUs respectively. We have quantified the benefit of this design on instructions per cycle (IPC) for various RISC-V benchmarks assuming a range of non-ALU operation latencies from one to ten cycles. Averaging across benchmarks, our experimental results show that compared to the 32-bit Ladner-Fischer our proposed architecture provides a range of IPC improvements between 1.37x assuming one-cycle non-ALU latency to 1.2x assuming ten-cycle non-ALU latency. Moreover, our average IPC improvements compared to a 32-bit ALU based on the 4-bit bit-slice range from 2.93x to 4x.

4 citations


Posted Content
TL;DR: This paper shows that naive SFQ clock domain crossing (CDC) first-in-first-out buffers (FIFOs) are vulnerable to delay increases, motivating the need for more robust CDC FIFOs, and proposes a novel 1-bit metastability-resilient SFQCDC FIFO that delivers over a 1000 reduction in logical error rate at 30 GHz.
Abstract: Digital single-flux quantum (SFQ) technology promises to meet the demands of ultra low power and high speed computing needed for future exascale supercomputing systems. The combination of ultra high clock frequencies, gate-level pipelines, and numerous sources of variability in SFQ circuits, however, make low-skew global clock distribution a challenge. This motivates the support of multiple independent clock domains and related clock domain crossing circuits that enable reliable communication across domains. Existing J-SIM simulation models indicate that setup violations can cause clock-to-Q increases of up to 100%. This paper first shows that naive SFQ clock domain crossing (CDC) first-in-first-out buffers (FIFOs) are vulnerable to these delay increases, motivating the need for more robust CDC FIFOs. Inspired by CMOS multi-flip-flop asynchronous FIFO synchronizers, we then propose a novel 1-bit metastability-resilient SFQ CDC FIFO that simulations show delivers over a 1000 reduction in logical error rate at 30 GHz. Moreover, for a 10-stage FIFO, the Josephson junction (JJ) area of our proposed design is only 7.5% larger than the non-resilient counterpart. Finally, we propose design guidelines that define the minimal FIFO depth subject to both throughput and burstiness constraints.

3 citations


Proceedings ArticleDOI
01 Jul 2019
TL;DR: This paper shows that naive SFQ clock domain crossing (CDC) first-in-first-out buffers (FIFOs) are vulnerable to delay increases, motivating the need for more robust CDC FIFOs, and proposes a novel 1-bit metastability-resilient SFQCDC FIFO that delivers over a 1000 reduction in logical error rate at 30 GHz.
Abstract: Digital single-flux quantum (SFQ) technology promises to meet the demands of ultra low power and high speed computing needed for future exascale supercomputing systems. The combination of ultra high clock frequencies, gate-level pipelines, and numerous sources of variability in SFQ circuits, however, make low-skew global clock distribution a challenge. This motivates the support of multiple independent clock domains and related clock domain crossing circuits that enable reliable communication across domains. Existing J-SIM simulation models indicate that setup violations can cause clock-to-Q increases of up to 100%. This paper first shows that naive SFQ clock domain crossing (CDC) first-in-first-out buffers (FIFOs) are vulnerable to these delay increases, motivating the need for more robust CDC FIFOs. Inspired by CMOS multi-flip-flop asynchronous FIFO synchronizers, we then propose a novel 1-bit metastability-resilient SFQ CDC FIFO that simulations show delivers over a 1000 reduction in logical error rate at 30 GHz. Moreover, for a 10-stage FIFO, the Josephson junction (JJ) area of our proposed design is only 7.5% larger than the non-resilient counterpart. Finally, we propose design guidelines that define the minimal FIFO depth subject to both throughput and burstiness constraints.

3 citations


Posted Content
TL;DR: A novel automated design flow is presented that converts flip-flop to 3-phase latch-based designs and the resulting circuits have the same performance as the master-slave based designs but require significantly less latches.
Abstract: Latch-based designs have many benefits over their flip-flop based counterparts but have limited use partially because most RTL specifications are flop-centric and automatic conversion of FF to latch-based designs is challenging. Conventional conversion algorithms target master-slave latch-based designs with two non-overlapping clocks. This paper presents a novel automated design flow that converts flip-flop to 3-phase latch-based designs. The resulting circuits have the same performance as the master-slave based designs but require significantly less latches. Our experimental results demonstrate the potential for savings in the number of latches (21.3%), area (5.8%), and power (16.3%) on a variety of ISCAS, CEP, and CPU benchmark circuits, compared to the master-slave conversions.

Posted Content
TL;DR: This paper proposed an end-to-end training and inference scheme that eliminates multiplications by approximate operations in the log-domain which has the potential to significantly reduce implementation complexity.
Abstract: The high computational complexity associated with training deep neural networks limits online and real-time training on edge devices. This paper proposed an end-to-end training and inference scheme that eliminates multiplications by approximate operations in the log-domain which has the potential to significantly reduce implementation complexity. We implement the entire training procedure in the log-domain, with fixed-point data representations. This training procedure is inspired by hardware-friendly approximations of log-domain addition which are based on look-up tables and bit-shifts. We show that our 16-bit log-based training can achieve classification accuracy within approximately 1% of the equivalent floating-point baselines for a number of commonly used datasets.

Proceedings ArticleDOI
01 Jul 2019
TL;DR: A novel algorithm for the physical implementation of the (HC)2LC network as a directed graph with multiple cycles representing the synchronizing feedback signals and a novel mixed integer linear programming (MILP) based approach minimizes the maximum clock skew among the sinks of the clock network.
Abstract: Single Flux Quantum (SFQ) is a promising option for high performance and low power supercomputing platforms. Nevertheless, timing uncertainty represents an obstacle to the design of high-frequency clock distribution networks. The hierarchical chains of homogeneous clover-leaves clocking, $(\mathrm{HC})^{2}\mathrm{LC}$ . was proposed as an innovative solution to this challenge. This paper presents a novel algorithm for the physical implementation of $(\mathrm{HC})^{2}\mathrm{LC}$ networks. The proposed method models the (HC)2LC network as a directed graph with multiple cycles representing the synchronizing feedback signals. This graph is then transformed to a directed acyclic graph (DAG) by eliminating feedback edges. The physical location of the nodes in the generated DAG (such as splitters and C-junctions) in the Manhattan plane is calculated using a zero-skew clock embedding algorithm. Additionally, a novel mixed integer linear programming (MILP) based approach minimizes the maximum clock skew among the sinks of the clock network and the sum of the delay of the edges in feedback loops, simultaneously. Experimental results show that using the proposed approach, the average clock skew for five benchmark circuits is 4.6ps.

Journal ArticleDOI
TL;DR: This study mathematically analyses the resulting yield subject to a limit on shipped product quality providing a practical mechanism of optimising the test margins for ring-oscillator-based clocks and bundled-data circuits.
Abstract: The ill effects of process, voltage, and temperature variations are significantly reduced by ring-oscillator (OR)-based clocks and bundled-data (BD) designs. Such designs include delay lines that enable the addition of test margin that can either by set uniformly across all manufactured chips or tuned individually per-chip. This study mathematically analyses the resulting yield subject to a limit on shipped product quality providing a practical mechanism of optimising the test margins for these circuits. The model also provides a means of quantifying the benefits from the correlation in the delay line and combinational logic. In particular, using correlation values obtained from Monte Carlo analysis of a sample circuit in a 65 nm process, the model shows that BD and OR-based circuits can have an over 50% yield advantage over their synchronous counterparts.