Showing papers presented at "Asia and South Pacific Design Automation Conference in 2019"

PDF

Open Access

Proceedings Article•DOI•

Compiling SU(4) quantum circuits to IBM QX architectures

[...]

21 Jan 2019

TL;DR: This work analyzes the bottlenecks of existing compilers and provides a dedicated method for compiling this kind of circuits to IBM QX architectures and shows that the proposed approach significantly outperforms IBM's own solution regarding fidelity of the compiled circuit as well as runtime.

...read moreread less

Abstract: The Noisy Intermediate-Scale Quantum (NISQ) technology is currently investigated by major players in the field to build the first practically useful quantum computer. IBM QX architectures are the first ones which are already publicly available today. However, in order to use them, the respective quantum circuits have to be compiled for the respectively used target architecture. While first approaches have been proposed for this purpose, they are infeasible for a certain set of SU(4) quantum circuits which have recently been introduced to benchmark corresponding compilers. In this work, we analyze the bottlenecks of existing compilers and provide a dedicated method for compiling this kind of circuits to IBM QX architectures. Our experimental evaluation (using tools provided by IBM) shows that the proposed approach significantly outperforms IBM's own solution regarding fidelity of the compiled circuit as well as runtime. Moreover, the solution proposed in this work has been declared winner of the IBM QISKit Developer Challenge. An implementation of the proposed methodology is publicly available at http://iic.jku.at/eda/research/ibm_qx_mapping.

...read moreread less

73 citations

Proceedings Article•DOI•

Learning the sparsity for ReRAM: mapping and pruning sparse neural network for ReRAM based accelerator

[...]

Jilan Lin¹, Zhenhua Zhu¹, Yu Wang¹, Yuan Xie²•Institutions (2)

Tsinghua University¹, University of California, Santa Barbara²

21 Jan 2019

TL;DR: A sparse NN mapping scheme based on elements clustering to achieve better ReRAM crossbar utilization and a crossbar-grained pruning algorithm to remove the crossbars with low utilization is proposed.

...read moreread less

Abstract: With the in-memory processing ability, ReRAM based computing gets more and more attractive for accelerating neural networks (NNs). However, most ReRAM based accelerators cannot support efficient mapping for sparse NN, and we need to map the whole dense matrix onto ReRAM crossbar array to achieve O(1) computation complexity. In this paper, we propose a sparse NN mapping scheme based on elements clustering to achieve better ReRAM crossbar utilization. Further, we propose crossbar-grained pruning algorithm to remove the crossbars with low utilization. Finally, since most current ReRAM devices cannot achieve high precision, we analyze the effect of quantization precision for sparse NN, and propose to complete high-precision composing in the analog field and design related periphery circuits. In our experiments, we discuss how the system performs with different crossbar sizes to choose the optimized design. Our results show that our mapping scheme for sparse NN with proposed pruning algorithm achieves 3 -- 5X energy efficiency and more than 2.5 -- 6X speedup, compared with those accelerators for dense NN. Also, the accuracy experiments show that our pruning method appears to have almost no accuracy loss.

...read moreread less

64 citations

Proceedings Article•DOI•

BeSAT: behavioral SAT-based attack on cyclic logic encryption

[...]

Yuanqi Shen¹, You Li¹, Amin Rezaei¹, Shuyu Kong¹, David Dlott¹, Hai Zhou¹ - Show less +2 more•Institutions (1)

Northwestern University¹

21 Jan 2019

TL;DR: This paper proposes a behavioral SAT-based attack called BeSAT, which observes the behavior of the encrypted circuit on top of the structural analysis, so the stateful and oscillatory keys missed by CycSAT can still be blocked.

...read moreread less

Abstract: Cyclic logic encryption is newly proposed in the area of hardware security. It introduces feedback cycles into the circuit to defeat existing logic decryption techniques. To ensure that the circuit is acyclic under the correct key, CycSAT is developed to add the acyclic condition as a CNF formula to the SAT-based attack. However, we found that it is impossible to capture all cycles in any graph with any set of feedback signals as done in the CycSAT algorithm. In this paper, we propose a behavioral SAT-based attack called BeSAT. Be-SAT observes the behavior of the encrypted circuit on top of the structural analysis, so the stateful and oscillatory keys missed by CycSAT can still be blocked. The experimental results show that BeSAT successfully overcomes the drawback of CycSAT.

...read moreread less

47 citations

Proceedings Article•DOI•

XPPE: cross-platform performance estimation of hardware accelerators using machine learning

[...]

Hosein Mohammadi Makrani¹, Hossein Sayadi¹, Tinoosh Mohsenin¹, Setareh Rafatirad¹, Avesta Sasan¹, Houman Homayoun¹ - Show less +2 more•Institutions (1)

George Mason University¹

21 Jan 2019

TL;DR: XPPE, a neural network based cross-platform performance estimation that utilizes the resource utilization of an application on a specific FPGA to estimate the performance on other FPGAs, enables developers to explore the design space without requiring to fully implement and map the application.

...read moreread less

Abstract: The increasing heterogeneity in the applications to be processed ceased ASICs to exist as the most efficient processing platform. Hybrid processing platforms such as CPU+FPGA are emerging as powerful processing platforms to support an efficient processing for a diverse range of applications. Hardware/Software co-design enabled designers to take advantage of these new hybrid platforms such as Zynq. However, dividing an application into two parts that one part runs on CPU and the other part is converted to a hardware accelerator implemented on FPGA, is making the platform selection difficult for the developers as there is a significant variation in the application's performance achieved on different platforms. Developers are required to fully implement the design on each platform to have an estimation of the performance. This process is tedious when the number of available platforms is large. To address such challenge, in this work we propose XPPE, a neural network based cross-platform performance estimation. XPPE utilizes the resource utilization of an application on a specific FPGA to estimate the performance on other FPGAs. The proposed estimation is performed for a wide range of applications and evaluated against a vast set of platforms. Moreover, XPPE enables developers to explore the design space without requiring to fully implement and map the application. Our evaluation results show that the correlation between the estimated speed up using XPPE and actual speedup of applications on a Hybrid platform over an ARM processor is more than 0.98.

...read moreread less

43 citations

Proceedings Article•DOI•

ScanSAT: unlocking obfuscated scan chains

[...]

Lilas Alrahis¹, Muhammad Yasin², Hani Saleh¹, Baker Mohammad¹, Mahmoud Al-Qutayri¹, Ozgur Sinanoglu³ - Show less +2 more•Institutions (3)

Khalifa University¹, New York University², New York University Abu Dhabi³

21 Jan 2019

TL;DR: This paper proposes ScanSAT: an attack that transforms a scan obfuscated circuit to its logic-locked version and applies a variant of the Boolean satisfiability (SAT) based attack, thereby extracting the secret key.

...read moreread less

Abstract: While financially advantageous, outsourcing key steps such as testing to potentially untrusted Outsourced Semiconductor Assembly and Test (OSAT) companies may pose a risk of compromising on-chip assets. Obfuscation of scan chains is a technique that hides the actual scan data from the untrusted testers; logic inserted between the scan cells, driven by a secret key, hide the transformation functions between the scan-in stimulus (scan-out response) and the delivered scan pattern (captured response). In this paper, we propose ScanSAT: an attack that transforms a scan obfuscated circuit to its logic-locked version and applies a variant of the Boolean satisfiability (SAT) based attack, thereby extracting the secret key. Our empirical results demonstrate that ScanSAT can easily break naive scan obfuscation techniques using only three or fewer attack iterations even for large key sizes and in the presence of scan compression.

...read moreread less

43 citations

Proceedings Article•DOI•

Quantum circuit compilers using gate commutation rules

[...]

Toshinari Itoko¹, Rudy Raymond¹, Takashi Imamichi¹, Atsushi Matsuo¹, Andrew W. Cross¹ - Show less +1 more•Institutions (1)

IBM¹

21 Jan 2019

TL;DR: This work proposes a formulation and two algorithms exploiting gate commutation rules to obtain a better circuit compiler for NISQCs.

...read moreread less

Abstract: The use of noisy intermediate-scale quantum computers (NISQCs), which consist of dozens of noisy qubits with limited coupling constraints, has been increasing. A circuit compiler, which transforms an input circuit into an equivalent output circuit conforming the coupling constraints with as few additional gates as possible, is essential for running applications on NISQCs. We propose a formulation and two algorithms exploiting gate commutation rules to obtain a better circuit compiler.

...read moreread less

40 citations

Proceedings Article•DOI•

ADEPOS: a nomaly de tection based po wer s aving for predictive maintenance using edge computing

[...]

Sumon Kumar Bose¹, Bapi Kar¹, Mohendra Roy¹, Pradeep Kumar Gopalakrishnan¹, Arindam Basu¹ - Show less +1 more•Institutions (1)

Nanyang Technological University¹

21 Jan 2019

TL;DR: This paper proposes Anomaly Detection based Power Saving (ADEPOS) scheme using approximate computing through the lifetime of the machine, and shows using the NASA bearing dataset that using ADEPOS, you need 8.8X less neurons on average and based on post-layout results, the resultant energy savings are 6.4--6.65X.

...read moreread less

Abstract: In Industry 4.0, predictive maintenance (PdM) is one of the most important applications pertaining to the Internet of Things (IoT). Machine learning is used to predict the possible failure of a machine before the actual event occurs. However, main challenges in PdM are: (a) lack of enough data from failing machines, and (b) paucity of power and bandwidth to transmit sensor data to cloud throughout the lifetime of the machine. Alternatively, edge computing approaches reduce data transmission and consume low energy. In this paper, we propose Anomaly Detection based Power Saving (ADEPOS) scheme using approximate computing through the lifetime of the machine. In the beginning of the machine's life, low accuracy computations are used when machine is healthy. However, on detection of anomalies as time progresses, system is switched to higher accuracy modes. We show using the NASA bearing dataset that using ADEPOS, we need 8.8X less neurons on average and based on post-layout results, the resultant energy savings are 6.4--6.65X.

...read moreread less

38 citations

Proceedings Article•DOI•

GraphSAR: a sparsity-aware processing-in-memory architecture for large-scale graph processing on ReRAMs

[...]

Guohao Dai¹, Tianhao Huang², Yu Wang¹, Huazhong Yang¹, John Wawrzynek³ - Show less +1 more•Institutions (3)

Tsinghua University¹, Massachusetts Institute of Technology², University of California³

21 Jan 2019

TL;DR: GraphSAR is presented, a sparsity-aware processing-in-memory large-scale graph processing accelerator on ReRAMs that achieves 4.43x energy reduction and 1.85x speedup against previous graph processing architecture on Re RAMs.

...read moreread less

Abstract: Large-scale graph processing has drawn great attention in recent years. The emerging metal-oxide resistive random access memory (ReRAM) and ReRAM crossbars have shown huge potential in accelerating graph processing. However, the sparse feature of natural graphs hinders the performance of graph processing on ReRAMs. Previous work of graph processing on ReRAMs stored and computed edges separately, leading to high energy consumption and long latency of transferring data. In this paper, we present GraphSAR, a sparsity-aware processing-in-memory large-scale graph processing accelerator on ReRAMs. Computations over edges are performed in the memory, eliminating overheads of transferring edges. Moreover, graphs are divided considering the sparsity. Subgraphs with low densities are further divided into smaller ones to minimize the waste of memory space. According to our extensive experimental results, GraphSAR achieves 4.43x energy reduction and 1.85x speedup (8.19x lower energy-delay product, EDP) against previous graph processing architecture on ReRAMs (GraphR [1]).

...read moreread less

35 citations

Proceedings Article•DOI•

FACH: FPGA-based acceleration of hyperdimensional computing by reducing computational complexity

[...]

Mohsen Imani¹, Sahand Salamat¹, Saransh Gupta¹, Jiani Huang¹, Tajana Rosing¹ - Show less +1 more•Institutions (1)

University of California, San Diego¹

21 Jan 2019

TL;DR: A FPGA-based acceleration of HD (FACH) is proposed which significantly improves the computation efficiency by removing majority of multiplications during the reasoning task and can provide 5.9X energy efficiency improvement and 5.1X speedup as compared to baseline FPGa-based implementation, while ensuring the same quality of classification.

...read moreread less

Abstract: Brain-inspired hyperdimensional (HD) computing explores computing with hypervectors for the emulation of cognition as an alternative to computing with numbers. In HD, input symbols are mapped to a hypervector and an associative search is performed for reasoning and classification. An associative memory, which finds the closest match between a set of learned hypervectors and a query hypervector, uses simple Hamming distance metric for similarity check. However, we observe that, in order to provide acceptable classification accuracy HD needs to store non-binarized model in associative memory and uses costly similarity metrics such as cosine to perform a reasoning task. This makes the HD computationally expensive when it is used for realistic classification problems. In this paper, we propose a FPGA-based acceleration of HD (FACH) which significantly improves the computation efficiency by removing majority of multiplications during the reasoning task. FACH identifies representative values in each class hypervector using clustering algorithm. Then, it creates a new HD model with hardware-friendly operations, and accordingly propose an FPGA-based implementation to accelerate such tasks. Our evaluations on several classification problems show that FACH can provide 5.9X energy efficiency improvement and 5.1X speedup as compared to baseline FPGA-based implementation, while ensuring the same quality of classification.

...read moreread less

34 citations

Proceedings Article•DOI•

GRAM: graph processing in a ReRAM-based computational memory

[...]

Minxuan Zhou¹, Mohsen Imani¹, Saransh Gupta¹, Yeseong Kim¹, Tajana Rosing¹ - Show less +1 more•Institutions (1)

University of California, San Diego¹

21 Jan 2019

TL;DR: The proposed solution, GRAM, can efficiently executes vertex-centric model, which is widely used in large-scale parallel graph processing programs, in the computational memory, and maximizes the computation parallelism while minimizing the number of data movements.

...read moreread less

Abstract: The performance of graph processing for real-world graphs is limited by inefficient memory behaviours in traditional systems because of random memory access patterns. Offloading computations to the memory is a promising strategy to overcome such challenges. In this paper, we exploit the resistive memory (ReRAM) based processing-in-memory (PIM) technology to accelerate graph applications. The proposed solution, GRAM, can efficiently executes vertex-centric model, which is widely used in large-scale parallel graph processing programs, in the computational memory. The hardware-software co-design used in GRAM maximizes the computation parallelism while minimizing the number of data movements. Based on our experiments with three important graph kernels on seven real-world graphs, GRAM provides 122.5X and 11.1x speedup compared with an in-memory graph system and optimized multithreading algorithms running on a multi-core CPU. Compared to a GPU-based graph acceleration library and a recently proposed PIM accelerator, GRAM improves the performance by 7.1X and 3.8X respectively.

...read moreread less

33 citations

Proceedings Article•DOI•

Handling stuck-at-faults in memristor crossbar arrays using matrix transformations

[...]

Baogang Zhang¹, Necati Uysal¹, Deliang Fan¹, Rickard Ewetz¹•Institutions (1)

University of Central Florida¹

21 Jan 2019

TL;DR: The experimental results demonstrate that the proposed framework is capable of recovering 99% of the accuracy loss introduced by stuck-at-faults without requiring the neural network to be retrained.

...read moreread less

Abstract: Matrix-vector multiplication is the dominating computational workload in the inference phase of neural networks. Memristor crossbar arrays (MCAs) can inherently execute matrix-vector multiplication with low latency and small power consumption. A key challenge is that the classification accuracy may be severely degraded by stuck-at-fault defects. Earlier studies have shown that the accuracy loss can be recovered by retraining each neural network or by utilizing additional hardware. In this paper, we propose to handle stuck-at-faults using matrix transformations. A transformation T changes a weight matrix W into a weight matrix, @ = T(W), which is more robust to stuck-at-faults. In particular, we propose a row flipping transformation, a permutation transformation, and a value range transformation. The row flipping transformation results in that stuck-off (stuck-on) faults are translated into stuck-on (stuck-off) faults. The permutation transformation maps small (large) weights to memristors stuck-off (stuck-on). The value range transformation is based on reducing the magnitude of the smallest and largest elements in the matrix, which results in that each stuck-at-fault introduces an error of smaller magnitude. The experimental results demonstrate that the proposed framework is capable of recovering 99% of the accuracy loss introduced by stuck-at-faults without requiring the neural network to be retrained.

...read moreread less

Proceedings Article•DOI•

LithoROC: lithography hotspot detection with explicit ROC optimization

[...]

Wei Ye¹, Yibo Lin¹, Meng Li¹, Qiang Liu¹, David Z. Pan¹ - Show less +1 more•Institutions (1)

University of Texas at Austin¹

21 Jan 2019

TL;DR: This work proposes the use of the area under the ROC curve (AUC), which provides a more holistic measure for imbalanced datasets compared with the previous methods, and proposes the surrogate loss functions for direct AUC maximization as a substitute for the conventional cross-entropy loss.

...read moreread less

Abstract: As modern integrated circuits scale up with escalating complexity of layout design patterns, lithography hotspot detection, a key stage of physical verification to ensure layout finishing and design closure, has raised a higher demand on its efficiency and accuracy. Among all the hotspot detection approaches, machine learning distinguishes itself for achieving high accuracy while maintaining low false alarms. However, due to the class imbalance problem, the conventional practice which uses the accuracy and false alarm metrics to evaluate different machine learning models is becoming less effective. In this work, we propose the use of the area under the ROC curve (AUC), which provides a more holistic measure for imbalanced datasets compared with the previous methods. To systematically handle class imbalance, we further propose the surrogate loss functions for direct AUC maximization as a substitute for the conventional cross-entropy loss. Experimental results demonstrate that the new surrogate loss functions are promising to outperform the cross-entropy loss when applied to the state-of-the-art neural network model for hotspot detection.

...read moreread less

Proceedings Article•DOI•

Implementing neural machine translation with bi-directional GRU and attention mechanism on FPGAs using HLS

[...]

Qin Li¹, Xiaofan Zhang¹, Jinjun Xiong², Wen-mei W. Hwu¹, Deming Chen¹ - Show less +1 more•Institutions (2)

University of Illinois at Urbana–Champaign¹, IBM²

21 Jan 2019

TL;DR: A novel hybrid parallel structure for accelerating the NMT with affordable resource overhead for the targeted FPGA is proposed and demonstrated on a Xilinx VCU118 with overall performance at 7.16 GFLOPS.

...read moreread less

Abstract: Neural machine translation (NMT) is a popular topic in Natural Language Processing which uses deep neural networks (DNNs) for translation from source to targeted languages With the emerging technologies, such as bidirectional Gated Recurrent Units (GRU), attention mechanisms, and beam-search algorithms, NMT can deliver improved translation quality compared to the conventional statistics-based methods, especially for translating long sentences However, higher translation quality means more complicated models, higher computation/memory demands, and longer translation time, which causes difficulties for practical use In this paper, we propose a design methodology for implementing the inference of a real-life NMT (with the problem size = 172 GFLOP) on FPGA for improved run time latency and energy efficiency We use High-Level Synthesis (HLS) to build high-performance parameterized IPs for handling the most basic operations (multiply-accumulations) and construct these IPs to accelerate the matrix-vector multiplication (MVM) kernels, which are frequently used in NMT Also, we perform a design space exploration by considering both computation resources and memory access bandwidth when utilizing the hardware parallelism in the model and generate the best parameter configurations of the proposed IPs Accordingly, we propose a novel hybrid parallel structure for accelerating the NMT with affordable resource overhead for the targeted FPGA Our design is demonstrated on a Xilinx VCU118 with overall performance at 716 GFLOPS

...read moreread less

Proceedings Article•DOI•

Hardware-software co-design of slimmed optical neural networks

[...]

Zheng Zhao¹, Derong Liu¹, Meng Li¹, Zhoufeng Ying¹, Lu Zhang², Biying Xu¹, Bei Yu², Ray T. Chen¹, David Z. Pan¹ - Show less +5 more•Institutions (2)

University of Texas at Austin¹, The Chinese University of Hong Kong²

21 Jan 2019

TL;DR: This work designs a novel slimmed architecture for realizing optical neural network considering both its software and hardware implementations and shows a more area-efficient architecture which uses a sparse tree network block, a single unitary block and a diagonal block for each neural network layer.

...read moreread less

Abstract: Optical neural network (ONN) is a neuromorphic computing hardware based on optical components. Since its first on-chip experimental demonstration, it has attracted more and more research interests due to the advantages of ultra-high speed inference with low power consumption. In this work, we design a novel slimmed architecture for realizing optical neural network considering both its software and hardware implementations. Different from the originally proposed ONN architecture based on singular value decomposition which results in two implementation-expensive unitary matrices, we show a more area-efficient architecture which uses a sparse tree network block, a single unitary block and a diagonal block for each neural network layer. In the experiments, we demonstrate that by leveraging the training engine, we are able to find a comparable accuracy to that of the previous architecture, which brings about the flexibility of using the slimmed implementation. The area cost in terms of the Mach-Zehnder interferometers, the core optical components of ONN, is 15%-38% less for various sizes of optical neural networks.

...read moreread less

Proceedings Article•DOI•

Energy-efficient, low-latency realization of neural networks through boolean logic minimization

[...]

Mahdi Nazemi¹, Ghasem Pasandi¹, Massoud Pedram¹•Institutions (1)

University of Southern California¹

21 Jan 2019

TL;DR: This paper presents a training method that enables a radically different approach for realization of deep neural networks through Boolean logic minimization, which completely removes the energy-hungry step of accessing memory for obtaining model parameters.

...read moreread less

Abstract: Deep neural networks have been successfully deployed in a wide variety of applications including computer vision and speech recognition. To cope with computational and storage complexity of these models, this paper presents a training method that enables a radically different approach for realization of deep neural networks through Boolean logic minimization. The aforementioned realization completely removes the energy-hungry step of accessing memory for obtaining model parameters, consumes about two orders of magnitude fewer computing resources compared to realizations that use floating-point operations, and has a substantially lower latency.

...read moreread less

Proceedings Article•DOI•

SRAF insertion via supervised dictionary learning

[...]

Hao Geng¹, Haoyu Yang¹, Yuzhe Ma¹, Joydeep Mitra², Bei Yu¹ - Show less +1 more•Institutions (2)

The Chinese University of Hong Kong¹, Cadence Design Systems²

21 Jan 2019

TL;DR: Experimental results demonstrate that the proposed supervised online dictionary learning algorithm for simultaneous feature extraction and dimensionality reduction boosts the mask optimization quality in terms of edge placement error (EPE) and process variation (PV) band area, but also achieves some speed-up.

...read moreread less

Abstract: In modern VLSI design flow, sub-resolution assist feature (SRAF) insertion is one of the resolution enhancement techniques (RETs) to improve chip manufacturing yield. With aggressive feature size continuously scaling down, layout feature learning becomes extremely critical. In this paper, for the first time, we enhance conventional manual feature construction, by proposing a supervised online dictionary learning algorithm for simultaneous feature extraction and dimensionality reduction. By taking advantage of label information, the proposed dictionary learning engine can discriminatively and accurately represent the input data. We further consider SRAF design rules in a global view, and design an integer linear programming model in the post-processing stage of SRAF insertion framework. Experimental results demonstrate that, compared with a state-of-the-art SRAF insertion tool, our framework not only boosts the mask optimization quality in terms of edge placement error (EPE) and process variation (PV) band area, but also achieves some speed-up.

...read moreread less

Proceedings Article•DOI•

Detecting multi-layer layout hotspots with adaptive squish patterns

[...]

Haoyu Yang¹, Piyush Pathak², Frank E. Gennari², Ya-Chieh Lai², Bei Yu¹ - Show less +1 more•Institutions (2)

The Chinese University of Hong Kong¹, Cadence Design Systems²

21 Jan 2019

TL;DR: An adaptive squish representation for multilayer layouts, which is storage efficient, lossless and compatible with deep neural networks is proposed, which can achieve satisfactory hotspot detection accuracy by incorporating a medium-sized convolutional neural networks.

...read moreread less

Abstract: Layout hotpot detection is one of the critical steps in modern integrated circuit design flow. It aims to find potential weak points in layouts before feeding them into manufacturing stage. Rapid development of machine learning has made it a preferable alternative of traditional hotspot detection solutions. Recent researches range from layout feature extraction and learning model design. However, only single layer layout hotspots are considered in state-of-the-art hotspot detectors and certain defects such as metal-to-via failures are not naturally supported. In this paper, we propose an adaptive squish representation for multilayer layouts, which is storage efficient, lossless and compatible with deep neural networks. We conduct experiments on 14nm industrial designs with a metal layer and its two adjacent via layers that contain metal-to-via hotspots. Results show that the adaptive squish representation can achieve satisfactory hotspot detection accuracy by incorporating a medium-sized convolutional neural networks.

...read moreread less

Proceedings Article•DOI•

SAADI: a scalable accuracy approximate divider for dynamic energy-quality scaling

[...]

Setareh Behroozi¹, Jingjie Li¹, Jackson Melchert¹, Younghyun Kim¹•Institutions (1)

University of Wisconsin-Madison¹

21 Jan 2019

TL;DR: This paper proposes an approximate divider design that facilitates dynamic energy-quality scaling and makes an approximation to the reciprocal of the divisor in an incremental manner, thus the division speed and energy efficiency can be dynamically traded for accuracy by controlling the number of iterations.

...read moreread less

Abstract: Approximate computing can significantly improve the energy efficiency of arithmetic operations in error-resilient applications. In this paper, we propose an approximate divider design that facilitates dynamic energy-quality scaling. Conventional approximate dividers lack runtime energy-quality scalability, which is the key to maximizing the energy efficiency while meeting dynamically varying accuracy requirements. Our divider design, named SAADI, makes an approximation to the reciprocal of the divisor in an incremental manner, thus the division speed and energy efficiency can be dynamically traded for accuracy by controlling the number of iterations. For the approximate 8-bit division of 32-bit/16-bit division, the average accuracy of SAADI can be adjusted in between 92.5% and 99.0% by varying latency up to 7x. We evaluate the accuracy and energy consumption of SAADI for various design parameters and demonstrate its efficacy for low-power signal processing applications.

...read moreread less

Proceedings Article•DOI•

Fault tolerance in neuromorphic computing systems

[...]

Mengyun Liu¹, Lixue Xia², Yu Wang³, Krishnendu Chakrabarty¹•Institutions (3)

Duke University¹, Alibaba Group², Tsinghua University³

21 Jan 2019

TL;DR: This survey presents various fault-tolerant techniques that were designed to tolerate different types of RRAM faults, and describes RRAM-based crossbars and training architectures in RCS.

...read moreread less

Abstract: Resistive Random Access Memory (RRAM) and RRAM-based computing systems (RCS) provide energy-efficient technology options for neuromorphic computing. However, the applicability of RCS is limited by reliability problems that arise from the immature fabrication process. In order to take advantage of RCS in practical applications, fault-tolerant design is a key challenge. We present a survey of fault-tolerant designs for RRAM-based neuromorphic computing systems. We first describe RRAM-based crossbars and training architectures in RCS. Following this, we classify fault models into different categories, and review post-fabrication testing methods. Subsequently, online testing methods are presented. Finally, we present various fault-tolerant techniques that were designed to tolerate different types of RRAM faults. The methods reviewed in this survey represent recent trends in fault-tolerant designs of RCS, and are expected motivate further research in this field.

...read moreread less

Proceedings Article•DOI•

CycSAT-unresolvable cyclic logic encryption using unreachable states

[...]

Amin Rezaei¹, You Li¹, Yuanqi Shen¹, Shuyu Kong¹, Hai Zhou¹ - Show less +1 more•Institutions (1)

Northwestern University¹

21 Jan 2019

TL;DR: In this article, a new way of cyclic encryption by utilizing unreachable states to defeat CycSAT was proposed, and the attack complexity of the proposed scheme is discussed and its robustness is demonstrated.

...read moreread less

Abstract: Logic encryption has attracted much attention due to increasing IC design costs and growing number of untrusted foundries. Unreachable states in a design provide a space of flexibility for logic encryption to explore. However, due to the available access of scan chain, traditional combinational encryption cannot leverage the benefit of such flexibility. Cyclic logic encryption inserts key-controlled feedbacks into the original circuit to prevent piracy and overproduction. Based on our discovery, cyclic logic encryption can utilize unreachable states to improve security. Even though cyclic encryption is vulnerable to a powerful attack called CycSAT, we develop a new way of cyclic encryption by utilizing unreachable states to defeat CycSAT. The attack complexity of the proposed scheme is discussed and its robustness is demonstrated.

...read moreread less

Proceedings Article•DOI•

Insights into the mind of a trojan designer: the challenge to integrate a trojan into the bitstream

[...]

Maik Ender¹, Pawel Swierczynski², Sebastian Wallat³, Matthias Wilhelm¹, Paul Martin Knopp¹, Christof Paar¹ - Show less +2 more•Institutions (3)

Ruhr University Bochum¹, European School of Management and Technology², University of Massachusetts Amherst³

21 Jan 2019

TL;DR: In this paper, the authors present an improved methodology for bitstream file format reversing and introduce a novel idea for Trojan insertion, which can be used to infiltrate FPGAs in a non-invasive manner after shipment.

...read moreread less

Abstract: The threat of inserting hardware Trojans during the design, production, or in-field poses a danger for integrated circuits in real-world applications. A particular critical case of hardware Trojans is the malicious manipulation of third-party FPGA configurations. In addition to attack vectors during the design process, FPGAs can be infiltrated in a non-invasive manner after shipment through alterations of the bitstream. First, we present an improved methodology for bitstream file format reversing. Second, we introduce a novel idea for Trojan insertion.

...read moreread less

Proceedings Article•DOI•

A fast machine learning-based mask printability predictor for OPC acceleration

[...]

Bentian Jiang¹, Hang Zhang², Jinglei Yang³, Evangeline F. Y. Young¹•Institutions (3)

The Chinese University of Hong Kong¹, Cornell University², University of California³

21 Jan 2019

TL;DR: This paper proposes a high performance machine learning-based mask printability evaluation framework for lithography-related applications, and applies it in a conventional mask optimization tool to verify its effectiveness.

...read moreread less

Abstract: Continuous shrinking of VLSI technology nodes brings us powerful chips with lower power consumption, but it also introduces many issues in manufacturability. Lithography simulation process for new feature size suffers from large computational overhead. As a result, conventional mask optimization process has been drastically resource consuming in terms of both time and cost. In this paper, we propose a high performance machine learning-based mask printability evaluation framework for lithography-related applications, and apply it in a conventional mask optimization tool to verify its effectiveness.

...read moreread less

Proceedings Article•DOI•

Integrated flow for reverse engineering of nanoscale technologies

[...]

Bernhard Lippmann, Michael Werner, Niklas Unverricht, Aayush Singla, P. Egger, Anja Dübotzky, H. Gieser, Martin Rasche, Oliver Kellermann, Helmut Graeb¹ - Show less +6 more•Institutions (1)

Technische Universität München¹

21 Jan 2019

TL;DR: For the first time the error between the generated layout and the known drawn GDS will be compared quantitatively as a figure of merit (FOM) and from this layout a circuit graph of an ECC encryption and the partitioning in circuit blocks will be extracted.

...read moreread less

Abstract: In view of potential risks of piracy and malicious manipulation of complex integrated circuits built in technologies of 45 nm and less, there is an increasing need for an effective and efficient process of reverse engineering. This paper provides an overview of the current process and details on a new tool for the acquisition and synthesis of large area images and the extraction of a layout. For the first time the error between the generated layout and the known drawn GDS will be compared quantitatively as a figure of merit (FOM). From this layout a circuit graph of an ECC encryption and the partitioning in circuit blocks will be extracted.

...read moreread less

Proceedings Article•DOI•

Detailed routing by sparse grid graph and minimum-area-captured path search

[...]

Gengjie Chen¹, Chak-Wa Pui¹, Haocheng Li¹, Jingsong Chen¹, Bentian Jiang¹, Evangeline F. Y. Young¹ - Show less +2 more•Institutions (1)

The Chinese University of Hong Kong¹

21 Jan 2019

TL;DR: This work proposes Dr. CU, an efficient and effective detailed router, to tackle the challenges of 3D detailed routing grid graph of enormous size, and a set of two-level sparse data structures is designed for runtime and memory efficiency.

...read moreread less

Abstract: Different from global routing, detailed routing takes care of many detailed design rules and is performed on a significantly larger routing grid graph. In advanced technology nodes, it becomes the most complicated and time-consuming stage. We propose Dr. CU, an efficient and effective detailed router, to tackle the challenges. To handle a 3D detailed routing grid graph of enormous size, a set of two-level sparse data structures is designed for runtime and memory efficiency. For handling the minimum-area constraint, an optimal correct-by-construction path search algorithm is proposed. Besides, an efficient bulk synchronous parallel scheme is adopted to further reduce the runtime usage. Compared with the first place of ISPD 2018 Contest, our router improves the routing quality by up to 65% and on average 39%, according to the contest metric. At the same time, it achieves 80--93% memory reduction, and 2.5--15X speed-up.

...read moreread less

Proceedings Article•DOI•

ParaPIM: a parallel processing-in-memory accelerator for binary-weight deep neural networks

[...]

Shaahin Angizi¹, Zhezhi He¹, Deliang Fan¹•Institutions (1)

University of Central Florida¹

21 Jan 2019

TL;DR: PaPIM architecture is presented, which transforms current Spin Orbit Torque Magnetic Random Access Memory sub-arrays to massively parallel computational units capable of running inferences for Binary-Weight Deep Neural Networks (BWNNs).

...read moreread less

Abstract: Recent algorithmic progression has brought competitive classification accuracy despite constraining neural networks to binary weights (+1/-1). These findings show remarkable optimization opportunities to eliminate the need for computationally-intensive multiplications, reducing memory access and storage. In this paper, we present ParaPIM architecture, which transforms current Spin Orbit Torque Magnetic Random Access Memory (SOT-MRAM) sub-arrays to massively parallel computational units capable of running inferences for Binary-Weight Deep Neural Networks (BWNNs). ParaPIM's in-situ computing architecture can be leveraged to greatly reduce energy consumption dealing with convolutional layers, accelerate BWNNs inference, eliminate unnecessary off-chip accesses and provide ultra-high internal bandwidth. The device-to-architecture co-simulation results indicate ~4x higher energy efficiency and 7.3x speedup over recent processing-in-DRAM acceleration, or roughly 5x higher energy-efficiency and 20.5x speedup over recent ASIC approaches, while maintaining inference accuracy comparable to baseline designs.

...read moreread less

Proceedings Article•DOI•

P3M: a PIM-based neural network model protection scheme for deep learning accelerator

[...]

Wen Li¹, Ying Wang¹, Huawei Li¹, Xiaowei Li¹•Institutions (1)

Chinese Academy of Sciences¹

21 Jan 2019

TL;DR: The proposed PUFs-and-PIM based Protection method for neural Models (P3M), can utilize unstable PUFs to protect the neural models in edge deep learning accelerators with negligible performance overhead.

...read moreread less

Abstract: This work is oriented at the edge computing scenario that terminal deep learning accelerators use pre-trained neural network models distributed from third-party providers (e.g. from data center clouds) to process the private data instead of sending it to the cloud. In this scenario, the network model is exposed to the risk of being attacked in the unverified devices if the parameters and hyper-parameters are transmitted and processed in an unencrypted way. Our work tackles this security problem by using on-chip memory Physical Unclonable Functions (PUFs) and Processing-In-Memory (PIM). We allow the model execution only on authorized devices and protect the model from white-box attacks, black-box attacks and model tampering attacks. The proposed PUFs-and-PIM based Protection method for neural Models (P3M), can utilize unstable PUFs to protect the neural models in edge deep learning accelerators with negligible performance overhead. The experimental results show considerable performance improvement over two state-of-the-art solutions we evaluated.

...read moreread less

Proceedings Article•DOI•

Build reliable and efficient neuromorphic design with memristor technology

[...]

Bing Li¹, Bonan Yan², Chenchen Liu, Hai Li²•Institutions (2)

Research Triangle Park¹, Duke University²

21 Jan 2019

TL;DR: The impacts of the limited reliability of memristor devices are reviewed and the recent research progress in building reliable and efficient Memristor-based NCS development is summarized.

...read moreread less

Abstract: Neuromorphic computing is a revolutionary approach of computation, which attempts to mimic the human brain's mechanism for extremely high implementation efficiency and intelligence. Latest research studies showed that the memristor technology has a great potential for realizing power- and area-efficient neuromorphic computing systems (NCS). On the other hand, the memristor device processing is still under development. Unreliable devices can severely degrade system performance, which arises as one of the major challenges in developing memristor-based NCS. In this paper, we first review the impacts of the limited reliability of memristor devices and summarize the recent research progress in building reliable and efficient memristor-based NCS. In the end, we discuss the main difficulties and the trend in memristor-based NCS development.

...read moreread less

Proceedings Article•DOI•

Scalable design for field-coupled nanocomputing circuits

[...]

Marcel Walter¹, Robert Wille², Frank Sill Torres¹, Daniel Große¹, Rolf Drechsler¹ - Show less +1 more•Institutions (2)

University of Bremen¹, Johannes Kepler University of Linz²

21 Jan 2019

TL;DR: This work presents a design method which - for the first time - allows for the scalable design of FCN circuits that satisfy dedicated constraints of these technologies.

...read moreread less

Abstract: Field-coupled Nanocomputing (FCN) technologies are considered as a solution to overcome physical boundaries of conventional CMOS approaches. But despite ground breaking advances regarding their physical implementation as e.g. Quantum-dot Cellular Automata (QCA), Nanomagnet Logic (NML), and many more, there is an unsettling lack of methods for large-scale design automation of FCN circuits. In fact, design automation for this class of technologies still is in its infancy - heavily relying either on manual labor or automatic methods which are applicable for rather small functionality only. This work presents a design method which - for the first time - allows for the scalable design of FCN circuits that satisfy dedicated constraints of these technologies. The proposed scheme is capable of handling around 40000 gates within seconds while the current state-of-the-art takes hours to handle around 20 gates. This is confirmed by experimental results on the layout level for various established benchmarks libraries.

...read moreread less

Proceedings Article•DOI•

ReRAM-based processing-in-memory architecture for blockchain platforms

[...]

Fang Wang¹, Zhaoyan Shen², Lei Han¹, Zili Shao³•Institutions (3)

Hong Kong Polytechnic University¹, Shandong University², The Chinese University of Hong Kong³

21 Jan 2019

TL;DR: Simulation results show that the proposed Re-Mining architecture outperforms CPU-based and GPU-based implementations significantly, which is unacceptable for resource-limited embedded devices.

...read moreread less

Abstract: Blockchain's decentralized and consensus mechanism has attracted lots of applications, such as IoT devices. Blockchain maintains a linked list of blocks and grows by mining new blocks. However, the Blockchain mining consumes huge computation resource and energy, which is unacceptable for resource-limited embedded devices. This paper for the first time presents a ReRAM-based processing-in-memory architecture for Blockchain mining, called Re-Mining. Re-Mining includes a message schedule module and a SHA computation module. The modules are composed of several basic ReRAM-based logic operations units, such as ROR, RSF and XOR. Re-Mining further designs intra-transaction and inter-transaction parallel mechanisms to accelerate the Blockchain mining. Simulation results show that the proposed Re-Mining architecture outperforms CPU-based and GPU-based implementations significantly.

...read moreread less

Proceedings Article•DOI•

Semi-supervised hotspot detection with self-paced multi-task learning

[...]

Ying Chen¹, Yibo Lin¹, Tianyang Gai, Yajuan Su, Yayi Wei, David Z. Pan¹ - Show less +2 more•Institutions (1)

University of Texas at Austin¹

21 Jan 2019

TL;DR: A semi-supervised hotspot detection with self-paced multi-task learning paradigm, leveraging both data samples w./w.o. labels to improve model accuracy and generality is proposed.

...read moreread less

Abstract: Lithography simulation is computationally expensive for hotspot detection. Machine learning based hotspot detection is a promising technique to reduce the simulation overhead. However, most learning approaches rely on a large amount of training data to achieve good accuracy and generality. At the early stage of developing a new technology node, the amount of data with labeled hotspots or non-hotspots is very limited. In this paper, we propose a semi-supervised hotspot detection with self-paced multi-task learning paradigm, leveraging both data samples w./w.o. labels to improve model accuracy and generality. Experimental results demonstrate that our approach can achieve 2.9--4.5% better accuracy at the same false alarm levels than the state-of-the-art work using 10%-50% of training data. The source code and trained models are released on https://github.com/qwepi/SSL.

...read moreread less