scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems in 2019"


Journal ArticleDOI
TL;DR: In this article, the authors propose a methodology to map the desired quantum functionality to a realization which satisfies all constraints given by the architecture and, at the same time, keeps the overhead in terms of additionally required quantum gates minimal.
Abstract: In the past years, quantum computers more and more have evolved from an academic idea to an upcoming reality. IBM’s project IBM ${Q}$ can be seen as evidence of this progress. Launched in March 2017 with the goal to provide access to quantum computers for a broad audience, this allowed users to conduct quantum experiments on a 5-qubit and, since June 2017, also on a 16-qubit quantum computer (called IBM QX2 and IBM QX3, respectively). Revised versions of these 5- and 16-qubit quantum computers (named IBM QX4 and IBM QX5, respectively) are available since September 2017. In order to use these, the desired quantum functionality (e.g., provided in terms of a quantum circuit) has to be properly mapped so that the underlying physical constraints are satisfied—a complex task. This demands solutions to automatically and efficiently conduct this mapping process. In this paper, we propose a methodology which addresses this problem, i.e., maps the given quantum functionality to a realization which satisfies all constraints given by the architecture and, at the same time, keeps the overhead in terms of additionally required quantum gates minimal. The proposed methodology is generic, can easily be configured for similar future architectures, and is fully integrated into IBM’s SDK. Experimental evaluations show that the proposed approach clearly outperforms IBM’s own mapping solution. In fact, for many quantum circuits, the proposed approach determines a mapping to the IBM architecture within minutes, while IBM’s solution suffers from long runtimes and runs into a timeout of 1 h in several cases. As an additional benefit, the proposed approach yields mapped circuits with smaller costs (i.e., fewer additional gates are required). All implementations of the proposed methodology are publicly available at http://iic.jku.at/eda/research/ibm_qx_mapping .

240 citations


Journal ArticleDOI
TL;DR: This paper designs and implements Caffeine, a hardware/software co-designed library to efficiently accelerate the entire CNN and FCN on FPGAs, and integrates it into the industry-standard software deep learning framework Caffe.
Abstract: With the recent advancement of multilayer convolutional neural networks (CNNs) and fully connected networks (FCNs), deep learning has achieved amazing success in many areas, especially in visual content understanding and classification. To improve the performance and energy efficiency of the computation-demanding CNN, the FPGA-based acceleration emerges as one of the most attractive alternatives. In this paper, we design and implement Caffeine, a hardware/software co-designed library to efficiently accelerate the entire CNN and FCN on FPGAs. First, we propose a uniformed convolutional matrix-multiplication representation for both computation-bound convolutional layers and communication-bound FCN layers. Based on this representation, we optimize the accelerator micro-architecture and maximize the underlying FPGA computing and bandwidth resource utilization based on a revised roofline model. Moreover, we design an automation flow to directly compile high-level network definitions to the final FPGA accelerator. As a case study, we integrate Caffeine into the industry-standard software deep learning framework Caffe. We evaluate Caffeine and its integration with Caffe by implementing VGG16 and AlexNet networks on multiple FPGA platforms. Caffeine achieves a peak performance of 1460 giga fixed point operations per second on a medium-sized Xilinx KU060 FPGA board; to our knowledge, this is the best published result. It achieves more than $100\boldsymbol {\times }$ speed-up on FCN layers over prior FPGA accelerators. An end-to-end evaluation with Caffe integration shows up to $29\boldsymbol {\times }$ and $150\boldsymbol {\times }$ performance and energy gains over Caffe on a 12-core Xeon server, and $5.7\boldsymbol {\times }$ better energy efficiency over the GPU implementation. Performance projections for a system with a high-end FPGA (Virtex7 690t) show even higher gains.

206 citations


Journal ArticleDOI
TL;DR: A circuit block is presented to enhance the security of existing logic locking techniques against the SAT attack and it is shown using a mathematical proof that the number of SAT attack iterations to reveal the correct key in a circuit comprising an Anti-SAT block is an exponential function of the key-size thereby making the SAT attacked computationally infeasible.
Abstract: Logic locking is a technique that is proposed to protect outsourced IC designs from piracy and counterfeiting by untrusted foundries. A locked IC preserves the correct functionality only when a correct key is provided. Recently, the security of logic locking is threatened by a new attack called SAT attack, which can decipher the correct key of most logic locking techniques within a few hours even for a reasonably large key-size. This attack iteratively solves SAT formulas which progressively eliminate the incorrect keys till the circuit is unlocked. In this paper, we present a circuit block (referred to as Anti-SAT block) to enhance the security of existing logic locking techniques against the SAT attack. We show using a mathematical proof that the number of SAT attack iterations to reveal the correct key in a circuit comprising an Anti-SAT block is an exponential function of the key-size thereby making the SAT attack computationally infeasible. Besides, we address the vulnerability of the Anti-SAT block to various removal attacks and investigate obfuscation techniques to prevent these removal attacks. More importantly, we provide a proof showing that these obfuscation techniques for making Anti-SAT un-removable would not weaken the Anti-SAT block’s resistance to SAT attack. Through our experiments, we illustrate the effectiveness of our approach to securing modern chips fabricated in untrusted foundries.

152 citations


Journal ArticleDOI
TL;DR: GraphH, a PIM architecture for graph processing on the hybrid memory cube array, is proposed to tackle all four problems mentioned above, including random access pattern causing local bandwidth degradation, poor locality leading to unpredictable global data access, heavy conflicts on updating the same vertex, and unbalanced workloads across processing units.
Abstract: Large-scale graph processing requires the high bandwidth of data access. However, as graph computing continues to scale, it becomes increasingly challenging to achieve a high bandwidth on generic computing architectures. The primary reasons include: the random access pattern causing local bandwidth degradation, the poor locality leading to unpredictable global data access, heavy conflicts on updating the same vertex, and unbalanced workloads across processing units. Processing-in-memory (PIM) has been explored as a promising solution to providing high bandwidth, yet open questions of graph processing on PIM devices remain in: 1) how to design hardware specializations and the interconnection scheme to fully utilize bandwidth of PIM devices and ensure locality and 2) how to allocate data and schedule processing flow to avoid conflicts and balance workloads. In this paper, we propose GraphH, a PIM architecture for graph processing on the hybrid memory cube array, to tackle all four problems mentioned above. From the architecture perspective, we integrate SRAM-based on-chip vertex buffers to eliminate local bandwidth degradation. We also introduce reconfigurable double-mesh connection to provide high global bandwidth. From the algorithm perspective, partitioning and scheduling methods like index mapping interval-block and round interval pair are introduced to GraphH, thus workloads are balanced and conflicts are avoided. Two optimization methods are further introduced to reduce synchronization overhead and reuse on-chip data. The experimental results on graphs with billions of edges demonstrate that GraphH outperforms DDR-based graph processing systems by up to two orders of magnitude and $5.12 {\times }$ speedup against the previous PIM design.

135 citations


Journal ArticleDOI
TL;DR: This paper proposes a novel static task scheduling algorithm to simultaneously maximize SER and LTR for real-time homogeneous MPSoC systems under the constraints of deadline, energy budget, and task precedence and develops a new solution representation scheme and two evolutionary operators that are closely integrated with two popular multiobjective evolutionary optimization frameworks.
Abstract: Multiprocessor system-on-chip (MPSoC) has been widely used in many real-time embedded systems where both soft-error reliability (SER) and lifetime reliability (LTR) are key concerns. Many existing works have investigated them, but they focus either on handling one of the two reliability concerns or on improving one type of reliability under the constraint of the other. These techniques are thus not applicable to maximize SER and LTR simultaneously, which is highly desired in some real-world applications. In this paper, we study the joint optimization of SER and LTR for real-time MPSoCs. We propose a novel static task scheduling algorithm to simultaneously maximize SER and LTR for real-time homogeneous MPSoC systems under the constraints of deadline, energy budget, and task precedence. Specifically, we develop a new solution representation scheme and two evolutionary operators that are closely integrated with two popular multiobjective evolutionary optimization frameworks, namely NSGAII and SPEA2. Extensive experimental results on standard benchmarks and synthetic applications show the efficacy of our scheme. More specifically, our scheme can achieve significantly better solutions (i.e., LTR-SER tradeoff fronts) with remarkably higher hypervolume and can be dozens or even hundreds of times faster than the state-of-the-art algorithms. The results also demonstrate that our scheme can be applied to heterogeneous MPSoC systems and is effective in improving reliability for heterogeneous MPSoC systems.

121 citations


Journal ArticleDOI
TL;DR: A novel circuit for implementing a synapse based on a memristor and two MOSFET tansistors and a fuzzy method for the adjustment of the learning rates of MNNs is developed, which increases the learning accuracy by 2%–3% compared with a constant learning rate.
Abstract: Back propagation (BP) based on stochastic gradient descent is the prevailing method to train multilayer neural networks (MNNs) with hidden layers. However, the existence of the physical separation between memory arrays and arithmetic module makes it inefficient and ineffective to implement BP in conventional digital hardware. Although CMOS may alleviate some problems of the hardware implementation of MNNs, synapses based on CMOS cost too much power and areas in very large scale integrated circuits. As a novel device, memristor shows promises to overcome this shortcoming due to its ability to closely integrate processing and memory. This paper proposes a novel circuit for implementing a synapse based on a memristor and two MOSFET tansistors (p-type and n-type). Compared with a CMOS-only circuit, the proposed one reduced the area consumption by 92%–98%. In addition, we develop a fuzzy method for the adjustment of the learning rates of MNNs, which increases the learning accuracy by 2%–3% compared with a constant learning rate. Meanwhile, the fuzzy adjustment method is robust and insensitive to parameter changes due to the approximate reasoning. Furthermore, the proposed methods can be extended to memristor-based multilayer convolutional neural network for complex tasks. The novel architecture behaves in a human-liking thinking process.

109 citations


Journal ArticleDOI
TL;DR: RePlAce is the first work to achieve superior solution quality across all the IS PD-2005, ISPD-2006, MMS, DAC-2012, and ICCAD-2012 benchmark suites with a single global placement engine.
Abstract: The Nesterov’s method approach to analytic placement has recently demonstrated strong solution quality and scalability. We dissect the previous implementation strategy and show that solution quality can be significantly improved using two levers: 1) constraint-oriented local smoothing and 2) dynamic step size adaptation. We propose a new density function that comprehends local overflow of area resources; this enables a constraint-oriented local smoothing at per-bin granularity. Our improved dynamic step size adaptation automatically determines step size and effectively allocates optimization effort to significantly improve solution quality without undue runtime impact. Our resulting global placement tool, RePlAce, achieves an average of 2.00% half-perimeter wirelength (HPWL) reduction over all best known ISPD-2005 and ISPD-2006 benchmark results, and an average of 2.73% over all best known modern mixed-size (MMS) benchmark results, without any benchmark-specific code or tuning. We further extend our global placer to address routability, and achieve on average 8.50%–9.59% scaled HPWL reduction over previous leading academic placers for the DAC-2012 and ICCAD-2012 benchmark suites. To our knowledge, RePlAce is the first work to achieve superior solution quality across all the ISPD-2005, ISPD-2006, MMS, DAC-2012, and ICCAD-2012 benchmark suites with a single global placement engine.

100 citations


Journal ArticleDOI
TL;DR: In this article, the authors revisit the basics of quantum computation, investigate how corresponding quantum states and quantum operations can be represented even more compactly, and, eventually, simulated in a more efficient fashion.
Abstract: Quantum computation is a promising emerging technology which, compared to conventional computation, allows for substantial speed-ups, e.g., for integer factorization or database search. However, since physical realizations of quantum computers are in their infancy, a significant amount of research in this domain still relies on simulations of quantum computations on conventional machines. This causes a significant complexity which current state-of-the-art simulators try to tackle with a rather straight forward array-based representation and by applying massive hardware power. There also exist solutions based on decision diagrams (i.e., graph-based approaches) that try to tackle the exponential complexity by exploiting redundancies in quantum states and operations. However, these existing approaches do not fully exploit redundancies that are actually present. In this paper, we revisit the basics of quantum computation, investigate how corresponding quantum states and quantum operations can be represented even more compactly, and, eventually, simulated in a more efficient fashion. This leads to a new graph-based simulation approach which outperforms state-of-the-art simulators (array-based as well as graph-based). Experimental evaluations show that the proposed solution is capable of simulating quantum computations for more qubits than before, and in significantly less run-time (several magnitudes faster compared to previously proposed simulators). An implementation of the proposed simulator is publicly available online at http://iic.jku.at/eda/research/quantum_simulation .

100 citations


Journal ArticleDOI
TL;DR: HLS is currently a viable option for fast prototyping and for designs with short time to market and to help close the QoR gap, a survey of literature focused on improving HLS concludes.
Abstract: To increase productivity in designing digital hardware components, high-level synthesis (HLS) is seen as the next step in raising the design abstraction level. However, the quality of results (QoRs) of HLS tools has tended to be behind those of manual register-transfer level (RTL) flows. In this paper, we survey the scientific literature published since 2010 about the QoR and productivity differences between the HLS and RTL design flows. Altogether, our survey spans 46 papers and 118 associated applications. Our results show that on average, the QoR of RTL flow is still better than that of the state-of-the-art HLS tools. However, the average development time with HLS tools is only a third of that of the RTL flow, and a designer obtains over four times as high productivity with HLS. Based on our findings, we also present a model case study to sum up the best practices in comparative studies between HLS and RTL. The outcome of our case study is also in line with the survey results, as using an HLS tool is seen to increase the productivity by a factor of six. In addition, to help close the QoR gap, we present a survey of literature focused on improving HLS. Our results let us conclude that HLS is currently a viable option for fast prototyping and for designs with short time to market.

99 citations


Journal ArticleDOI
TL;DR: This paper explores mobility-aware network lifetime maximization for battery-powered IoT applications that perform approximate real-time computation under the quality-of-service (QoS) constraint and develops a performance-guaranteed and time-efficient QoS-adaptive heuristic based on cross-entropy method.
Abstract: In recent years, the Internet of Things (IoT) has promoted many battery-powered emerging applications, such as smart home, environmental monitoring, and human healthcare monitoring, where energy management is of particular importance. Meanwhile, there is an accelerated tendency toward mobility of IoT devices, either being transported by humans or being mobile by itself. Existing energy management mechanisms for battery-powered IoT fail to consider the two significant characteristics of IoT: 1) the approximate real-time computation and 2) the mobility of IoT devices, resulting in unnecessary energy waste and network lifetime decay. In this paper, we explore mobility-aware network lifetime maximization for battery-powered IoT applications that perform approximate real-time computation under the quality-of-service (QoS) constraint. The proposed scheme is composed of offline and online stages. At offline stage, an optimal mobility-aware task schedule that maximizes network lifetime is derived by using mixed-integer linear programming technique. Redundant executions due to mobility-incurred overlapping of a single task on different IoT devices are avoided for energy savings. At online stage, a performance-guaranteed and time-efficient QoS-adaptive heuristic based on cross-entropy method is developed to adapt task execution to the fluctuating QoS requirements. Extensive simulations based on synthetic applications and real-life benchmarks have been implemented to validate the effectiveness of our proposed scheme. Experimental results demonstrate that the proposed technique can achieve up to 169.52% network lifetime improvement compared to benchmarking solutions.

91 citations


Journal ArticleDOI
TL;DR: A sparse SNN topology where noncritical connections are pruned to reduce the network size, and the remaining critical synapses are weight quantized to accommodate for limited conductance states is presented.
Abstract: Spiking neural networks (SNNs) with a large number of weights and varied weight distribution can be difficult to implement in emerging in-memory computing hardware due to the limitations on crossbar size (implementing dot product), the constrained number of conductance states in non-CMOS devices and the power budget. We present a sparse SNN topology where noncritical connections are pruned to reduce the network size, and the remaining critical synapses are weight quantized to accommodate for limited conductance states. Pruning is based on the power law weight-dependent spike timing dependent plasticity model; synapses between pre- and post-neuron with high spike correlation are retained, whereas synapses with low correlation or uncorrelated spiking activity are pruned. The weights of the retained connections are quantized to the available number of conductance states. The process of pruning noncritical connections and quantizing the weights of critical synapses is performed at regular intervals during training. We evaluated our sparse and quantized network on MNIST dataset and on a subset of images from Caltech-101 dataset. The compressed topology achieved a classification accuracy of 90.1% (91.6%) on the MNIST (Caltech-101) dataset with 3.1X (2.2X) and 4X (2.6X) improvement in energy and area, respectively. The compressed topology is energy and area efficient while maintaining the same classification accuracy of a 2-layer fully connected SNN topology.

Journal ArticleDOI
TL;DR: A deep learning framework for high performance and large scale hotspot detection that uses feature tensor generation to extract representative layout features that fit well with convolutional neural networks while keeping the spatial relationship of the original layout pattern with minimal information loss.
Abstract: Detecting layout hotspots is a key step in the physical verification flow. Although machine learning solutions show benefits over lithography simulation and pattern matching-based methods, it is still hard to select a proper model for large scale problems and inevitably, performance degradation occurs. To overcome these issues, in this paper, we develop a deep learning framework for high performance and large scale hotspot detection. First, we use feature tensor generation to extract representative layout features that fit well with convolutional neural networks while keeping the spatial relationship of the original layout pattern with minimal information loss. Second, we propose a biased learning (BL) algorithm to train the convolutional neural network to further improve detection accuracy with small false alarm penalties. In addition, to simplify the training procedure and seek a better tradeoff between accuracy and false alarms, we extend the original BL to a batch BL algorithm. Experimental results show that our framework outperforms previous machine learning-based hotspot detectors in both ICCAD 2012 Contest benchmarks and large scale industrial benchmarks. Source code and trained models are available at https://github.com/phdyang007/dlhsd .

Journal ArticleDOI
TL;DR: A quantitative security criterion is proposed for de-camouflaging complexity measurements and formally analyzed through the demonstration of the equivalence between the existing de- camouflaging strategy and the active learning scheme, and a provably secure camouflaging framework is developed combining these two techniques.
Abstract: The advancing of reverse engineering techniques has complicated the efforts in intellectual property protection. Proactive methods have been developed recently, among which layout-level integrated circuit camouflaging is the leading example. However, existing camouflaging methods are rarely supported by provably secure criteria, which further leads to an over-estimation of the security level when countering latest de-camouflaging attacks, e.g., the SAT-based attack. In this paper, a quantitative security criterion is proposed for de-camouflaging complexity measurements and formally analyzed through the demonstration of the equivalence between the existing de-camouflaging strategy and the active learning scheme. Supported by the new security criterion, two camouflaging techniques are proposed, including the low-overhead camouflaging cell generation strategy and the AND-tree camouflaging strategy, to help achieve exponentially increasing security levels at the cost of linearly increasing performance overhead on the circuit under protection. A provably secure camouflaging framework is then developed combining these two techniques. The experimental results using the security criterion show that camouflaged circuits with the proposed framework are of high resilience against different attack schemes with only negligible performance overhead.

Journal ArticleDOI
TL;DR: HEIF is presented, a highly efficient SC-based inference framework of the large-scale DCNNs, with broad applications including (but not limited to) LeNet-5 and AlexNet, that achieves high energy efficiency and low area/hardware cost.
Abstract: Deep convolutional neural networks (DCNNs) are one of the most promising deep learning techniques and have been recognized as the dominant approach for almost all recognition and detection tasks. The computation of DCNNs is memory intensive due to large feature maps and neuron connections, and the performance highly depends on the capability of hardware resources. With the recent trend of wearable devices and Internet of Things, it becomes desirable to integrate the DCNNs onto embedded and portable devices that require low power and energy consumptions and small hardware footprints. Recently stochastic computing (SC)-DCNN demonstrated that SC as a low-cost substitute to binary-based computing radically simplifies the hardware implementation of arithmetic units and has the potential to satisfy the stringent power requirements in embedded devices. In SC, many arithmetic operations that are resource-consuming in binary designs can be implemented with very simple hardware logic, alleviating the extensive computational complexity. It offers a colossal design space for integration and optimization due to its reduced area and soft error resiliency. In this paper, we present HEIF, a highly efficient SC-based inference framework of the large-scale DCNNs, with broad applications including (but not limited to) LeNet-5 and AlexNet , that achieves high energy efficiency and low area/hardware cost. Compared to SC-DCNN, HEIF features: 1) the first (to the best of our knowledge) SC-based rectified linear unit activation function to catch up with the recent advances in software models and mitigate degradation in application-level accuracy; 2) the redesigned approximate parallel counter and optimized stochastic multiplication using transmission gates and inverse mirror adders; and 3) the new optimization of weight storage using clustering. Most importantly, to achieve maximum energy efficiency while maintaining acceptable accuracy, HEIF considers holistic optimizations on cascade connection of function blocks in DCNN, pipelining technique, and bit-stream length reduction. Experimental results show that in large-scale applications HEIF outperforms previous SC-DCNN by the throughput of $4.1\times $ , by area efficiency of up to $6.5\times $ , and achieves up to ${5.6\times }$ energy improvement.

Journal ArticleDOI
TL;DR: A training-in-memory based on R RAM (TIME) architecture and the peripheral circuit design to enable training NN on RRAM, which has the potential to boost the energy efficiency by two orders of magnitudes compared with ASIC.
Abstract: The training of neural networks (NN) is usually time-consuming and resource intensive. The emerging metal-oxide resistive random-access memory (RRAM) device has shown potential for the computation of NN. RRAM crossbar structure and multibit characteristics can perform the matrix-vector product in high energy efficiency, which is the most common operation of NN. Two challenges exist for realizing training NN based on RRAM. First, the current architectures based on RRAM only support the inference in training NN and cannot perform the backpropagation (BP) and the weight update of training NN. Second, training NN requires enormous iterations to constantly update the weights for reaching the convergence. However, this weight update leads to large energy consumption because of the nonideal factors of RRAM. In this paper, we propose a training-in-memory based on RRAM (TIME) architecture and the peripheral circuit design to enable training NN on RRAM. TIME supports the BP and the weight update while maximizing the re-usage of peripheral circuits of the inference operation on RRAM. Meanwhile, a set of optimization strategies focusing on the nonideal factors are designed to reduce the cost of tuning RRAM. We explore the performance of both supervised learning (SL) and deep reinforcement learning (DRL) on TIME. A specific mapping method of DRL is also introduced to further improve energy efficiency. Simulation results show that in SL, TIME can achieve $5.3{\times }$ higher energy efficiency on average compared with DaDianNao, an application-specific integrated circuits (ASIC) in CMOS technology. In DRL, TIME can perform an average $126{\times }$ higher than GPU in energy efficiency. If the cost of tuning RRAM can be further reduced, TIME has the potential to boost the energy efficiency by two orders of magnitudes compared with ASIC.

Journal ArticleDOI
TL;DR: This paper analyzes how the vulnerabilities in an FSM can be exploited by fault injection attacks, and proposes a security-aware FSM design flow for ASIC designs to mitigate them and prevent fault attacks on FSM.
Abstract: The security of a system-on-chip (SoC) can be compromised by exploiting the vulnerabilities of the finite state machines (FSMs) in the SoC controller modules through fault injection attacks. These vulnerabilities may be unintentionally introduced by traditional FSM design practices or by CAD tools during synthesis. In this paper, we first analyze how the vulnerabilities in an FSM can be exploited by fault injection attacks. Then, we propose a security-aware FSM design flow for ASIC designs to mitigate them and prevent fault attacks on FSM. Our proposed FSM design flow starts with a security-aware encoding scheme which makes the FSM resilient against fault attacks. However, the vulnerabilities introduced by the CAD tools cannot be addressed by encoding schemes alone. To analyze for such vulnerabilities, we develop a novel technique named analyzing vulnerabilities in FSM. If any vulnerability exists, we propose a secure FSM architecture to address the issue. In this paper, we mainly focus on setup-time violation-based fault attacks which pose a serious threat on FSMs; though our proposed flow works for advanced laser-based fault attacks as well. We compare our proposed secure FSM design flow with traditional FSM design practices in terms of cost, performance, and security. We show that our FSM design flow ensures security while having a negligible impact on cost and performance.

Journal ArticleDOI
TL;DR: The experiments demonstrate that neural networks are generally insensitive to the precision of the activation function, and prove that the proposed combinational circuit-based approach is very efficient in terms of speed and area, with negligible accuracy loss on the MNIST, CIFAR-10, and IMAGE NET benchmarks.
Abstract: The widespread application of artificial neural networks has prompted researchers to experiment with field-programmable gate array and customized ASIC designs to speed up their computation These implementation efforts have generally focused on weight multiplication and signal summation operations, and less on activation functions used in these applications Yet, efficient hardware implementations of nonlinear activation functions like exponential linear units (ELU), scaled ELU (SELU), and hyperbolic tangent (tanh), are central to designing effective neural network accelerators, since these functions require lots of resources In this paper, we explore efficient hardware implementations of activation functions using purely combinational circuits, with a focus on two widely used nonlinear activation functions, ie, SELU and tanh Our experiments demonstrate that neural networks are generally insensitive to the precision of the activation function The results also prove that the proposed combinational circuit-based approach is very efficient in terms of speed and area, with negligible accuracy loss on the MNIST, CIFAR-10, and IMAGE NET benchmarks Synopsys design compiler synthesis results show that circuit designs for tanh and SELU can save between ${\times 313\sim \times 769}$ and ${ {\times 445\sim \times 845}}$ area compared to the look-up table/memory-based implementations, and can operate at 514 GHz and 452 GHz using the 28-nm SVT library, respectively The implementation is available at: https://githubcom/ThomasMrY/ActivationFunctionDemo

Journal ArticleDOI
TL;DR: This paper presents a scan attack countermeasure based on the encryption of the data written to or read from the scan chains, which provides expected test/diagnostic and debug facilities as classical scan design with marginal impacts on area, test time and design flows.
Abstract: Scan attacks exploit facilities offered by scan chains to retrieve embedded secret data, in particular, secret keys used by the device for data encryption/decryption in mission mode. This paper presents a scan attack countermeasure based on the encryption of the data written to or read from the scan chains. The secret-key management system already embedded in the device is used to provide appropriate keys for encryption of data flowing on the scan chains. The goal of the proposed solution is to counteract the scan-related security threats while preserving test and diagnosis efficiency provided by conventional design-for-testability techniques, as well as to allow debugging capabilities in mission mode. The proposed solution can deal with both stuck-at and transition-faults test schemes as well as single and multiple scan chain configurations using test data compression schemes. We will show that the proposed scheme provides expected test/diagnostic and debug facilities as classical scan design with marginal impacts on area, test time and design flows, while successfully preventing control and observation of data flowing in the scan chains by unauthorized users.

Journal ArticleDOI
TL;DR: A run-time simulation framework of both PD and architecture and captures their interactions that can achieve smaller than 1% deviation from SPICE for an entire PD system simulation and investigates the impact of dynamic noise on system level oxide breakdown reliability.
Abstract: With the reduced noise margin brought by relentless technology scaling, power integrity assurance has become more challenging than ever. On the other hand, traditional design methodologies typically focus on a single design layer without much cross-layer interaction, potentially introducing unnecessary guard-band and wasting significant design resources. Both issues imperatively call for a cross-layer framework for the co-exploration of power delivery (PD) and system architecture, especially in the early design stage with larger design and optimization freedom. Unfortunately, such a framework does not exist yet in the literature. As a step forward, this paper provides a run-time simulation framework of both PD and architecture and captures their interactions. Enabled by the proposed recursive run-time PD model, it can achieve smaller than 1% deviation from SPICE for an entire PD system simulation. Moreover, with seamless interactions among architecture, power and PD simulators, it can simulate actual benchmarks within reasonable time. The experimental results of running PARSEC suite have demonstrated the framework’s capability to discover the co-effect of PD and architecture for early stage design optimization. Moreover, it also shows multiple over-pessimism in traditional PD methodologies. Finally, the framework is able to investigate the impact of dynamic noise on system level oxide breakdown reliability and shows 31%–92% lifetime estimation deviations from typical static analysis.

Journal ArticleDOI
TL;DR: The proposed memristor emulator circuit contains only one VDTA as an active element and single grounded capacitor which benefits from the integrated circuit and can be utilized MOS-capacitance instead of the external capacitor in the circuit.
Abstract: In this paper, we present a memristor emulator based on voltage difference transconductance amplifier (VDTA). The proposed memristor emulator circuit contains only one VDTA as an active element and single grounded capacitor which benefits from the integrated circuit. Furthermore, it can be utilized MOS-capacitance instead of the external capacitor in the circuit. The complete memristor emulator is laid by using Cadence Environment using TSMC $0.18\ {\boldsymbol \mu }\text{m}$ process parameters. It occupies an area of $35.7\ {\boldsymbol \mu }\text{m}\,\,\times 29\ {\boldsymbol \mu }\text{m}$ . Its simulation results are given to demonstrate the performance of the presented memristor emulator in different operating frequencies, process corner, and radical temperature changes. Moreover, prototype circuit is implemented to confirm the theoretical analysis by employing the single LM13700 commercial device as an active element. Experimental results of the designed memristor emulator are given to investigate its ability for different operating frequencies, the capacitance value and resistor value and DC supply voltage. The experimental results are in accordance with theoretical analyses and simulation results.

Journal ArticleDOI
TL;DR: In this paper, the authors proposed a new resist modeling framework for contact layers, utilizing existing data from old technology nodes and active selection of data in a target technology node, to reduce the amount of data required from the target lithography configuration.
Abstract: Lithography simulation is one of the key steps in physical verification, enabled by the substantial optical and resist models. A resist model bridges the aerial image simulation to printed patterns. While the effectiveness of learning-based solutions for resist modeling has been demonstrated, they are considerably data-demanding. Meanwhile, a set of manufactured data for a specific lithography configuration is only valid for the training of one single model, indicating low data efficiency. Due to the complexity of the manufacturing process, obtaining enough data for acceptable accuracy becomes very expensive in terms of both time and cost, especially during the evolution of technology generations when the design space is intensively explored. In this paper, we propose a new resist modeling framework for contact layers, utilizing existing data from old technology nodes and active selection of data in a target technology node, to reduce the amount of data required from the target lithography configuration. Our framework based on transfer learning and active learning techniques is effective within a competitive range of accuracy, i.e., $3 \times -10 \times $ reduction on the amount of training data with comparable accuracy to the state-of-the-art learning approach.

Journal ArticleDOI
TL;DR: This paper combines the two types of affinities, and design a scheduling heuristic that assigns a task to the processor with the highest joint affinity, which can reduce the system makespan by up to 30.1% without violating the temperature and reliability constraints.
Abstract: With the advent of heterogeneous multiprocessor architectures, efficient scheduling for high performance has been of significant importance. However, joint considerations of reliability, temperature, and stochastic characteristics of precedence-constrained tasks for performance optimization make task scheduling particularly challenging. In this paper, we tackle this challenge by using an affinity (i.e., probability)-driven task allocation and scheduling approach that decouples schedule lengths and thermal profiles of processors. Specifically, we separately model the affinity of a task for processors with respect to schedule lengths and the affinity of a task for processors with regard to chip thermal profiles considering task reliability and stochastic characteristics of task execution time and intertask communication time. Subsequently, we combine the two types of affinities, and design a scheduling heuristic that assigns a task to the processor with the highest joint affinity. Extensive simulations based on randomly generated stochastic and real-world applications are performed to validate the effectiveness of the proposed approach. Experiment results show that the proposed scheme can reduce the system makespan by up to 30.1% without violating the temperature and reliability constraints compared to benchmarking methods.

Journal ArticleDOI
TL;DR: The results show that, compared to neural computing without fault tolerance, the recognition accuracy for the Cifar-10 dataset improves from 37% to 83% when using low-endurance RRAM cells, and from 63% to 76% when use RRAM Cells with high endurance but a high percentage of initial faults.
Abstract: An resistive random-access memory (RRAM)-based computing system (RCS) is an attractive hardware platform for implementing neural computing algorithms. On-line training for RCS enables hardware-based learning for a given application and reduces the additional error caused by device parameter variations. However, a high occurrence rate of hard faults due to immature fabrication processes and limited write endurance restrict the applicability of on-line training for RCS. We propose a fault-tolerant on-line training method that alternates between a fault-detection phase and a fault-tolerant training phase. In the fault-detection phase, a quiescent-voltage comparison method is utilized. In the training phase, a threshold-training method and a remapping scheme is proposed. Our results show that, compared to neural computing without fault tolerance, the recognition accuracy for the Cifar-10 dataset improves from 37% to 83% when using low-endurance RRAM cells, and from 63% to 76% when using RRAM cells with high endurance but a high percentage of initial faults.

Journal ArticleDOI
TL;DR: This paper aims at the energy-efficient design of soft real-time and reliable applications on uniprocessor embedded systems with dynamic voltage and frequency scaling (DVFS) for saving energy and taking into account the impact of DVFS on reliability.
Abstract: Energy efficiency, reliability, and real-time are three key requirements of mission-critical embedded systems. Existing approaches over emphasize the worst case design of real-time embedded systems, which will lead to serious waste of resources. In this paper, we aim at the energy-efficient design of soft real-time and reliable applications on uniprocessor embedded systems. We consider soft real-time tasks with stochastic execution durations regarding certain distributions. Thereby, we provide real-time guarantee with probability consideration. We utilize dynamic voltage and frequency scaling (DVFS) for saving energy, and also take into account the impact of DVFS on reliability. Our objective is to minimize the expected energy consumption of the system subject to statistical reliability and deadline constraints. The design optimization problem is a typical multidimensional multiple-choice knapsack problem, which is NP-hard. We first propose a dynamic programming-based optimal algorithm to solve the problem. To reduce the time complexity, we then develop a ( $1+{\beta }$ )-approximation algorithm based on a binary search approach, where ${\beta }$ is the approximating factor. The approximation algorithm can obtain the near-optimal solution with at most ( $1{+\beta }$ ) times of optimal energy cost under given real-time and reliability constraints and has fully polynomial time complexity. Extensive experiments and a real-life synthetic application are conducted to evaluate the performance of the proposed techniques. Compared with existing approaches, the approximation approach can save much energy with low time overhead while guaranteeing the statistical deadline and reliability constraints.

Journal ArticleDOI
TL;DR: This paper presents the implementation of an ELM/OS-ELM in a customized system-on-a-chip field-programmable gate array-based architecture to ensure efficient hardware acceleration.
Abstract: Machine learning algorithms such as those for object classification in images, video content analysis, and human action recognition are used to extract meaningful information from data recorded by image sensors and cameras. Among the existing machine learning algorithms for such purposes, extreme learning machines (ELMs) and online sequential ELMs (OS-ELMs) are well known for their computational efficiency and performance when processing large datasets. The latter approach was derived from the ELM approach and optimized for real-time application. However, OS-ELM classifiers are computationally demanding, and the existing state-of-the-art computing platforms are not efficient enough for embedded systems, especially for applications with strict requirements in terms of low power consumption, high throughput, and low latency. This paper presents the implementation of an ELM/OS-ELM in a customized system-on-a-chip field-programmable gate array-based architecture to ensure efficient hardware acceleration. The acceleration process comprises parallel extraction, deep pipelining, and efficient shared memory communication.

Journal ArticleDOI
Shouyi Yin1, Shibin Tang1, Xinhan Lin1, Peng Ouyang1, Fengbin Tu1, Leibo Liu1, Shaojun Wei1 
TL;DR: An FPGA resource efficient mapping mechanism that improves the utilization of DSPs by integrating multiple small bit-width operations on one DSP and uses the LB-level spatial mapping to exploit the complementary features between different neural networks in the hybrid-NN.
Abstract: Deep learning is the amazing technology which has promoted the development of artificial intelligence and achieved many amazing successes in intelligent fields. Convolution-based layers (CLs), fully connected layers (FLs) and recurrent layers (RLs) are three types of layers in classic neural networks. Most intelligent tasks are implemented by the hybrid neural networks (hybrid-NNs), which are commonly composed of different layer-blocks (LBs) of CLs, FLs, and RLs. Because the CLs require the most computation in hybrid-NNs, many field-programmable gate array (FPGA)-based accelerators focus on CLs acceleration and have demonstrated great performance. However, the CLs accelerators lead to an underutilization of FPGA resources in the acceleration of the whole hybrid-NN. To fully exploit the logic resources and the memory bandwidth in the acceleration of CLs/FLs/RLs, we propose an FPGA resource efficient mapping mechanism for hybrid-NNs. The mechanism first improves the utilization of DSPs by integrating multiple small bit-width operations on one DSP. Then the LB-level spatial mapping is used to exploit the complementary features between different neural networks in the hybrid-NN. We evaluate the mapping mechanism by implementing four hybrid-NNs on Xilinx Virtex7 690T FPGA. The proposed mechanism achieves a peak performance of 1805.8 giga operations per second (GOPs). With the analysis on resource utilization and throughput, the proposed method exploits more computing power in FPGA and achieves up to $4.13 \times$ higher throughput than the state-of-the-art acceleration.

Journal ArticleDOI
TL;DR: The synthesis framework is based on lookup-table (LUT) networks, which play a key role in conventional logic synthesis, and can advance over the state-of-the-art hierarchical reversible logic synthesis algorithms.
Abstract: We present a synthesis framework to map logic networks into quantum circuits for quantum computing. The synthesis framework is based on lookup-table (LUT) networks, which play a key role in conventional logic synthesis. Establishing a connection between LUTs in an LUT network and reversible single-target gates in a reversible network allows us to bridge conventional logic synthesis with logic synthesis for quantum computing, despite several fundamental differences. We call our synthesis framework LUT-based hierarchical reversible logic synthesis (LHRS). Input to LHRS is a classical logic network representing an arbitrary Boolean combinational operation; output is a quantum network (realized in terms of Clifford+ T gates). The framework allows one to account for qubit count requirements imposed by the overlying quantum algorithm or target quantum computing hardware. In a fast first step, an initial network is derived that only consists of single-target gates and already completely determines the number of qubits in the final quantum network. Different methods are then used to map each single-target gate into Clifford+ T gates, while aiming at optimally using available resources. We demonstrate the versatility of our method by conducting a design space exploration using different parameters on a set of large combinational benchmarks. On the same benchmarks, we show that our approach can advance over the state-of-the-art hierarchical reversible logic synthesis algorithms.

Journal ArticleDOI
TL;DR: An efficient test generation technique, which can be used to achieve full state and transition coverage in simulation-based verification for a wide variety of cache coherence protocols, and guarantees selection of important transitions by utilizing equivalence classes, and omits only similar transitions.
Abstract: Computing systems utilize multicore processors with complex cache coherence protocols to meet the increasing need for performance and energy improvement. It is a major challenge to verify the correctness of a cache coherence protocol since the number of reachable states grows exponentially with the number of cores. In this paper, we propose an efficient test generation technique, which can be used to achieve full state and transition coverage in simulation-based verification for a wide variety of cache coherence protocols. Based on effective analysis of the state space structure, our method can generate more efficient test sequences (50% shorter) on-the-fly compared with tests generated by BFS. While our on-the-fly method can reduce the numbers of required tests by half, it can still be impractical to verify all possible transitions in the presence of large number of cores. We propose scalable on-the-fly test generation techniques using quotient state space. The proposed approach guarantees selection of important transitions by utilizing equivalence classes, and omits only similar transitions. Our experimental results demonstrate that our proposed approaches can efficiently tradeoff between transition coverage and validation effort.

Journal ArticleDOI
TL;DR: TaintHLS is presented, to automatically generate a micro-architecture to support baseline operations and a shadow microarchitectures for intrinsic DIFT support in hardware accelerators while providing variable granularity of taint tags.
Abstract: Dynamic information flow tracking (DIFT) is a technique to track potential security vulnerabilities in software and hardware systems at run time. Untrusted data are marked with tags (tainted), which are propagated through the system and their potential for unsafe use is analyzed to prevent them. DIFT is not supported in heterogeneous systems especially hardware accelerators. Currently, DIFT is manually generated and integrated into the accelerators. This process is error-prone, potentially hurting the process of identifying security violations in heterogeneous systems. We present TaintHLS, to automatically generate a micro-architecture to support baseline operations and a shadow microarchitecture for intrinsic DIFT support in hardware accelerators while providing variable granularity of taint tags. TaintHLS offers a companion high-level synthesis (HLS) methodology to automatically generate such DIFT-enabled accelerators from a high-level specification. We extended a state-of-the-art HLS tool to generate DIFT-enhanced accelerators and demonstrated the approach on numerous benchmarks. The DIFT-enabled accelerators have negligible performance and no more than 30% hardware overhead.

Journal ArticleDOI
TL;DR: RLMap is proposed, a solution that formulates DFG mapping on CGRA as an agent in RL, which unifies placement, routing and processing element insertion by interchange actions of the agent.
Abstract: Coarse-grained reconfigurable architectures (CGRAs) have drawn increasing attention due to their flexibility and energy efficiency. Data flow graphs (DFGs) are often mapped onto CGRAs for acceleration. The problem of DFG mapping is challenging due to the diverse structures from DFGs and constrained hardware from CGRAs. Consequently, it is difficult to find a valid and high quality solution simultaneously. Inspired from the great progress in deep reinforcement learning (RL) for AI problems, we consider building methods that learn to map DFGs onto spatially programmed CGRAs directly from experiences. We propose RLMap, a solution that formulates DFG mapping on CGRA as an agent in RL, which unifies placement, routing and processing element insertion by interchange actions of the agent. Experimental results show that RLMap performs comparably to state-of-the-art heuristics in mapping quality, adapts to different architecture, and converges quickly.