scispace - formally typeset
Search or ask a question

Showing papers presented at "Asia and South Pacific Design Automation Conference in 2019"


Proceedings ArticleDOI
21 Jan 2019
TL;DR: This work analyzes the bottlenecks of existing compilers and provides a dedicated method for compiling this kind of circuits to IBM QX architectures and shows that the proposed approach significantly outperforms IBM's own solution regarding fidelity of the compiled circuit as well as runtime.
Abstract: The Noisy Intermediate-Scale Quantum (NISQ) technology is currently investigated by major players in the field to build the first practically useful quantum computer. IBM QX architectures are the first ones which are already publicly available today. However, in order to use them, the respective quantum circuits have to be compiled for the respectively used target architecture. While first approaches have been proposed for this purpose, they are infeasible for a certain set of SU(4) quantum circuits which have recently been introduced to benchmark corresponding compilers. In this work, we analyze the bottlenecks of existing compilers and provide a dedicated method for compiling this kind of circuits to IBM QX architectures. Our experimental evaluation (using tools provided by IBM) shows that the proposed approach significantly outperforms IBM's own solution regarding fidelity of the compiled circuit as well as runtime. Moreover, the solution proposed in this work has been declared winner of the IBM QISKit Developer Challenge. An implementation of the proposed methodology is publicly available at http://iic.jku.at/eda/research/ibm_qx_mapping.

73 citations


Proceedings ArticleDOI
21 Jan 2019
TL;DR: A sparse NN mapping scheme based on elements clustering to achieve better ReRAM crossbar utilization and a crossbar-grained pruning algorithm to remove the crossbars with low utilization is proposed.
Abstract: With the in-memory processing ability, ReRAM based computing gets more and more attractive for accelerating neural networks (NNs). However, most ReRAM based accelerators cannot support efficient mapping for sparse NN, and we need to map the whole dense matrix onto ReRAM crossbar array to achieve O(1) computation complexity. In this paper, we propose a sparse NN mapping scheme based on elements clustering to achieve better ReRAM crossbar utilization. Further, we propose crossbar-grained pruning algorithm to remove the crossbars with low utilization. Finally, since most current ReRAM devices cannot achieve high precision, we analyze the effect of quantization precision for sparse NN, and propose to complete high-precision composing in the analog field and design related periphery circuits. In our experiments, we discuss how the system performs with different crossbar sizes to choose the optimized design. Our results show that our mapping scheme for sparse NN with proposed pruning algorithm achieves 3 -- 5X energy efficiency and more than 2.5 -- 6X speedup, compared with those accelerators for dense NN. Also, the accuracy experiments show that our pruning method appears to have almost no accuracy loss.

64 citations


Proceedings ArticleDOI
Yuanqi Shen1, You Li1, Amin Rezaei1, Shuyu Kong1, David Dlott1, Hai Zhou1 
21 Jan 2019
TL;DR: This paper proposes a behavioral SAT-based attack called BeSAT, which observes the behavior of the encrypted circuit on top of the structural analysis, so the stateful and oscillatory keys missed by CycSAT can still be blocked.
Abstract: Cyclic logic encryption is newly proposed in the area of hardware security. It introduces feedback cycles into the circuit to defeat existing logic decryption techniques. To ensure that the circuit is acyclic under the correct key, CycSAT is developed to add the acyclic condition as a CNF formula to the SAT-based attack. However, we found that it is impossible to capture all cycles in any graph with any set of feedback signals as done in the CycSAT algorithm. In this paper, we propose a behavioral SAT-based attack called BeSAT. Be-SAT observes the behavior of the encrypted circuit on top of the structural analysis, so the stateful and oscillatory keys missed by CycSAT can still be blocked. The experimental results show that BeSAT successfully overcomes the drawback of CycSAT.

47 citations


Proceedings ArticleDOI
21 Jan 2019
TL;DR: XPPE, a neural network based cross-platform performance estimation that utilizes the resource utilization of an application on a specific FPGA to estimate the performance on other FPGAs, enables developers to explore the design space without requiring to fully implement and map the application.
Abstract: The increasing heterogeneity in the applications to be processed ceased ASICs to exist as the most efficient processing platform. Hybrid processing platforms such as CPU+FPGA are emerging as powerful processing platforms to support an efficient processing for a diverse range of applications. Hardware/Software co-design enabled designers to take advantage of these new hybrid platforms such as Zynq. However, dividing an application into two parts that one part runs on CPU and the other part is converted to a hardware accelerator implemented on FPGA, is making the platform selection difficult for the developers as there is a significant variation in the application's performance achieved on different platforms. Developers are required to fully implement the design on each platform to have an estimation of the performance. This process is tedious when the number of available platforms is large. To address such challenge, in this work we propose XPPE, a neural network based cross-platform performance estimation. XPPE utilizes the resource utilization of an application on a specific FPGA to estimate the performance on other FPGAs. The proposed estimation is performed for a wide range of applications and evaluated against a vast set of platforms. Moreover, XPPE enables developers to explore the design space without requiring to fully implement and map the application. Our evaluation results show that the correlation between the estimated speed up using XPPE and actual speedup of applications on a Hybrid platform over an ARM processor is more than 0.98.

43 citations


Proceedings ArticleDOI
21 Jan 2019
TL;DR: This paper proposes ScanSAT: an attack that transforms a scan obfuscated circuit to its logic-locked version and applies a variant of the Boolean satisfiability (SAT) based attack, thereby extracting the secret key.
Abstract: While financially advantageous, outsourcing key steps such as testing to potentially untrusted Outsourced Semiconductor Assembly and Test (OSAT) companies may pose a risk of compromising on-chip assets. Obfuscation of scan chains is a technique that hides the actual scan data from the untrusted testers; logic inserted between the scan cells, driven by a secret key, hide the transformation functions between the scan-in stimulus (scan-out response) and the delivered scan pattern (captured response). In this paper, we propose ScanSAT: an attack that transforms a scan obfuscated circuit to its logic-locked version and applies a variant of the Boolean satisfiability (SAT) based attack, thereby extracting the secret key. Our empirical results demonstrate that ScanSAT can easily break naive scan obfuscation techniques using only three or fewer attack iterations even for large key sizes and in the presence of scan compression.

43 citations


Proceedings ArticleDOI
Toshinari Itoko1, Rudy Raymond1, Takashi Imamichi1, Atsushi Matsuo1, Andrew W. Cross1 
21 Jan 2019
TL;DR: This work proposes a formulation and two algorithms exploiting gate commutation rules to obtain a better circuit compiler for NISQCs.
Abstract: The use of noisy intermediate-scale quantum computers (NISQCs), which consist of dozens of noisy qubits with limited coupling constraints, has been increasing. A circuit compiler, which transforms an input circuit into an equivalent output circuit conforming the coupling constraints with as few additional gates as possible, is essential for running applications on NISQCs. We propose a formulation and two algorithms exploiting gate commutation rules to obtain a better circuit compiler.

40 citations


Proceedings ArticleDOI
21 Jan 2019
TL;DR: This paper proposes Anomaly Detection based Power Saving (ADEPOS) scheme using approximate computing through the lifetime of the machine, and shows using the NASA bearing dataset that using ADEPOS, you need 8.8X less neurons on average and based on post-layout results, the resultant energy savings are 6.4--6.65X.
Abstract: In Industry 4.0, predictive maintenance (PdM) is one of the most important applications pertaining to the Internet of Things (IoT). Machine learning is used to predict the possible failure of a machine before the actual event occurs. However, main challenges in PdM are: (a) lack of enough data from failing machines, and (b) paucity of power and bandwidth to transmit sensor data to cloud throughout the lifetime of the machine. Alternatively, edge computing approaches reduce data transmission and consume low energy. In this paper, we propose Anomaly Detection based Power Saving (ADEPOS) scheme using approximate computing through the lifetime of the machine. In the beginning of the machine's life, low accuracy computations are used when machine is healthy. However, on detection of anomalies as time progresses, system is switched to higher accuracy modes. We show using the NASA bearing dataset that using ADEPOS, we need 8.8X less neurons on average and based on post-layout results, the resultant energy savings are 6.4--6.65X.

38 citations


Proceedings ArticleDOI
21 Jan 2019
TL;DR: GraphSAR is presented, a sparsity-aware processing-in-memory large-scale graph processing accelerator on ReRAMs that achieves 4.43x energy reduction and 1.85x speedup against previous graph processing architecture on Re RAMs.
Abstract: Large-scale graph processing has drawn great attention in recent years. The emerging metal-oxide resistive random access memory (ReRAM) and ReRAM crossbars have shown huge potential in accelerating graph processing. However, the sparse feature of natural graphs hinders the performance of graph processing on ReRAMs. Previous work of graph processing on ReRAMs stored and computed edges separately, leading to high energy consumption and long latency of transferring data. In this paper, we present GraphSAR, a sparsity-aware processing-in-memory large-scale graph processing accelerator on ReRAMs. Computations over edges are performed in the memory, eliminating overheads of transferring edges. Moreover, graphs are divided considering the sparsity. Subgraphs with low densities are further divided into smaller ones to minimize the waste of memory space. According to our extensive experimental results, GraphSAR achieves 4.43x energy reduction and 1.85x speedup (8.19x lower energy-delay product, EDP) against previous graph processing architecture on ReRAMs (GraphR [1]).

35 citations


Proceedings ArticleDOI
21 Jan 2019
TL;DR: A FPGA-based acceleration of HD (FACH) is proposed which significantly improves the computation efficiency by removing majority of multiplications during the reasoning task and can provide 5.9X energy efficiency improvement and 5.1X speedup as compared to baseline FPGa-based implementation, while ensuring the same quality of classification.
Abstract: Brain-inspired hyperdimensional (HD) computing explores computing with hypervectors for the emulation of cognition as an alternative to computing with numbers. In HD, input symbols are mapped to a hypervector and an associative search is performed for reasoning and classification. An associative memory, which finds the closest match between a set of learned hypervectors and a query hypervector, uses simple Hamming distance metric for similarity check. However, we observe that, in order to provide acceptable classification accuracy HD needs to store non-binarized model in associative memory and uses costly similarity metrics such as cosine to perform a reasoning task. This makes the HD computationally expensive when it is used for realistic classification problems. In this paper, we propose a FPGA-based acceleration of HD (FACH) which significantly improves the computation efficiency by removing majority of multiplications during the reasoning task. FACH identifies representative values in each class hypervector using clustering algorithm. Then, it creates a new HD model with hardware-friendly operations, and accordingly propose an FPGA-based implementation to accelerate such tasks. Our evaluations on several classification problems show that FACH can provide 5.9X energy efficiency improvement and 5.1X speedup as compared to baseline FPGA-based implementation, while ensuring the same quality of classification.

34 citations


Proceedings ArticleDOI
21 Jan 2019
TL;DR: The proposed solution, GRAM, can efficiently executes vertex-centric model, which is widely used in large-scale parallel graph processing programs, in the computational memory, and maximizes the computation parallelism while minimizing the number of data movements.
Abstract: The performance of graph processing for real-world graphs is limited by inefficient memory behaviours in traditional systems because of random memory access patterns. Offloading computations to the memory is a promising strategy to overcome such challenges. In this paper, we exploit the resistive memory (ReRAM) based processing-in-memory (PIM) technology to accelerate graph applications. The proposed solution, GRAM, can efficiently executes vertex-centric model, which is widely used in large-scale parallel graph processing programs, in the computational memory. The hardware-software co-design used in GRAM maximizes the computation parallelism while minimizing the number of data movements. Based on our experiments with three important graph kernels on seven real-world graphs, GRAM provides 122.5X and 11.1x speedup compared with an in-memory graph system and optimized multithreading algorithms running on a multi-core CPU. Compared to a GPU-based graph acceleration library and a recently proposed PIM accelerator, GRAM improves the performance by 7.1X and 3.8X respectively.

33 citations


Proceedings ArticleDOI
21 Jan 2019
TL;DR: The experimental results demonstrate that the proposed framework is capable of recovering 99% of the accuracy loss introduced by stuck-at-faults without requiring the neural network to be retrained.
Abstract: Matrix-vector multiplication is the dominating computational workload in the inference phase of neural networks. Memristor crossbar arrays (MCAs) can inherently execute matrix-vector multiplication with low latency and small power consumption. A key challenge is that the classification accuracy may be severely degraded by stuck-at-fault defects. Earlier studies have shown that the accuracy loss can be recovered by retraining each neural network or by utilizing additional hardware. In this paper, we propose to handle stuck-at-faults using matrix transformations. A transformation T changes a weight matrix W into a weight matrix, @ = T(W), which is more robust to stuck-at-faults. In particular, we propose a row flipping transformation, a permutation transformation, and a value range transformation. The row flipping transformation results in that stuck-off (stuck-on) faults are translated into stuck-on (stuck-off) faults. The permutation transformation maps small (large) weights to memristors stuck-off (stuck-on). The value range transformation is based on reducing the magnitude of the smallest and largest elements in the matrix, which results in that each stuck-at-fault introduces an error of smaller magnitude. The experimental results demonstrate that the proposed framework is capable of recovering 99% of the accuracy loss introduced by stuck-at-faults without requiring the neural network to be retrained.

Proceedings ArticleDOI
Wei Ye1, Yibo Lin1, Meng Li1, Qiang Liu1, David Z. Pan1 
21 Jan 2019
TL;DR: This work proposes the use of the area under the ROC curve (AUC), which provides a more holistic measure for imbalanced datasets compared with the previous methods, and proposes the surrogate loss functions for direct AUC maximization as a substitute for the conventional cross-entropy loss.
Abstract: As modern integrated circuits scale up with escalating complexity of layout design patterns, lithography hotspot detection, a key stage of physical verification to ensure layout finishing and design closure, has raised a higher demand on its efficiency and accuracy. Among all the hotspot detection approaches, machine learning distinguishes itself for achieving high accuracy while maintaining low false alarms. However, due to the class imbalance problem, the conventional practice which uses the accuracy and false alarm metrics to evaluate different machine learning models is becoming less effective. In this work, we propose the use of the area under the ROC curve (AUC), which provides a more holistic measure for imbalanced datasets compared with the previous methods. To systematically handle class imbalance, we further propose the surrogate loss functions for direct AUC maximization as a substitute for the conventional cross-entropy loss. Experimental results demonstrate that the new surrogate loss functions are promising to outperform the cross-entropy loss when applied to the state-of-the-art neural network model for hotspot detection.

Proceedings ArticleDOI
21 Jan 2019
TL;DR: A novel hybrid parallel structure for accelerating the NMT with affordable resource overhead for the targeted FPGA is proposed and demonstrated on a Xilinx VCU118 with overall performance at 7.16 GFLOPS.
Abstract: Neural machine translation (NMT) is a popular topic in Natural Language Processing which uses deep neural networks (DNNs) for translation from source to targeted languages With the emerging technologies, such as bidirectional Gated Recurrent Units (GRU), attention mechanisms, and beam-search algorithms, NMT can deliver improved translation quality compared to the conventional statistics-based methods, especially for translating long sentences However, higher translation quality means more complicated models, higher computation/memory demands, and longer translation time, which causes difficulties for practical use In this paper, we propose a design methodology for implementing the inference of a real-life NMT (with the problem size = 172 GFLOP) on FPGA for improved run time latency and energy efficiency We use High-Level Synthesis (HLS) to build high-performance parameterized IPs for handling the most basic operations (multiply-accumulations) and construct these IPs to accelerate the matrix-vector multiplication (MVM) kernels, which are frequently used in NMT Also, we perform a design space exploration by considering both computation resources and memory access bandwidth when utilizing the hardware parallelism in the model and generate the best parameter configurations of the proposed IPs Accordingly, we propose a novel hybrid parallel structure for accelerating the NMT with affordable resource overhead for the targeted FPGA Our design is demonstrated on a Xilinx VCU118 with overall performance at 716 GFLOPS

Proceedings ArticleDOI
21 Jan 2019
TL;DR: This work designs a novel slimmed architecture for realizing optical neural network considering both its software and hardware implementations and shows a more area-efficient architecture which uses a sparse tree network block, a single unitary block and a diagonal block for each neural network layer.
Abstract: Optical neural network (ONN) is a neuromorphic computing hardware based on optical components. Since its first on-chip experimental demonstration, it has attracted more and more research interests due to the advantages of ultra-high speed inference with low power consumption. In this work, we design a novel slimmed architecture for realizing optical neural network considering both its software and hardware implementations. Different from the originally proposed ONN architecture based on singular value decomposition which results in two implementation-expensive unitary matrices, we show a more area-efficient architecture which uses a sparse tree network block, a single unitary block and a diagonal block for each neural network layer. In the experiments, we demonstrate that by leveraging the training engine, we are able to find a comparable accuracy to that of the previous architecture, which brings about the flexibility of using the slimmed implementation. The area cost in terms of the Mach-Zehnder interferometers, the core optical components of ONN, is 15%-38% less for various sizes of optical neural networks.

Proceedings ArticleDOI
21 Jan 2019
TL;DR: This paper presents a training method that enables a radically different approach for realization of deep neural networks through Boolean logic minimization, which completely removes the energy-hungry step of accessing memory for obtaining model parameters.
Abstract: Deep neural networks have been successfully deployed in a wide variety of applications including computer vision and speech recognition. To cope with computational and storage complexity of these models, this paper presents a training method that enables a radically different approach for realization of deep neural networks through Boolean logic minimization. The aforementioned realization completely removes the energy-hungry step of accessing memory for obtaining model parameters, consumes about two orders of magnitude fewer computing resources compared to realizations that use floating-point operations, and has a substantially lower latency.

Proceedings ArticleDOI
21 Jan 2019
TL;DR: Experimental results demonstrate that the proposed supervised online dictionary learning algorithm for simultaneous feature extraction and dimensionality reduction boosts the mask optimization quality in terms of edge placement error (EPE) and process variation (PV) band area, but also achieves some speed-up.
Abstract: In modern VLSI design flow, sub-resolution assist feature (SRAF) insertion is one of the resolution enhancement techniques (RETs) to improve chip manufacturing yield. With aggressive feature size continuously scaling down, layout feature learning becomes extremely critical. In this paper, for the first time, we enhance conventional manual feature construction, by proposing a supervised online dictionary learning algorithm for simultaneous feature extraction and dimensionality reduction. By taking advantage of label information, the proposed dictionary learning engine can discriminatively and accurately represent the input data. We further consider SRAF design rules in a global view, and design an integer linear programming model in the post-processing stage of SRAF insertion framework. Experimental results demonstrate that, compared with a state-of-the-art SRAF insertion tool, our framework not only boosts the mask optimization quality in terms of edge placement error (EPE) and process variation (PV) band area, but also achieves some speed-up.

Proceedings ArticleDOI
21 Jan 2019
TL;DR: An adaptive squish representation for multilayer layouts, which is storage efficient, lossless and compatible with deep neural networks is proposed, which can achieve satisfactory hotspot detection accuracy by incorporating a medium-sized convolutional neural networks.
Abstract: Layout hotpot detection is one of the critical steps in modern integrated circuit design flow. It aims to find potential weak points in layouts before feeding them into manufacturing stage. Rapid development of machine learning has made it a preferable alternative of traditional hotspot detection solutions. Recent researches range from layout feature extraction and learning model design. However, only single layer layout hotspots are considered in state-of-the-art hotspot detectors and certain defects such as metal-to-via failures are not naturally supported. In this paper, we propose an adaptive squish representation for multilayer layouts, which is storage efficient, lossless and compatible with deep neural networks. We conduct experiments on 14nm industrial designs with a metal layer and its two adjacent via layers that contain metal-to-via hotspots. Results show that the adaptive squish representation can achieve satisfactory hotspot detection accuracy by incorporating a medium-sized convolutional neural networks.

Proceedings ArticleDOI
21 Jan 2019
TL;DR: This paper proposes an approximate divider design that facilitates dynamic energy-quality scaling and makes an approximation to the reciprocal of the divisor in an incremental manner, thus the division speed and energy efficiency can be dynamically traded for accuracy by controlling the number of iterations.
Abstract: Approximate computing can significantly improve the energy efficiency of arithmetic operations in error-resilient applications. In this paper, we propose an approximate divider design that facilitates dynamic energy-quality scaling. Conventional approximate dividers lack runtime energy-quality scalability, which is the key to maximizing the energy efficiency while meeting dynamically varying accuracy requirements. Our divider design, named SAADI, makes an approximation to the reciprocal of the divisor in an incremental manner, thus the division speed and energy efficiency can be dynamically traded for accuracy by controlling the number of iterations. For the approximate 8-bit division of 32-bit/16-bit division, the average accuracy of SAADI can be adjusted in between 92.5% and 99.0% by varying latency up to 7x. We evaluate the accuracy and energy consumption of SAADI for various design parameters and demonstrate its efficacy for low-power signal processing applications.

Proceedings ArticleDOI
21 Jan 2019
TL;DR: This survey presents various fault-tolerant techniques that were designed to tolerate different types of RRAM faults, and describes RRAM-based crossbars and training architectures in RCS.
Abstract: Resistive Random Access Memory (RRAM) and RRAM-based computing systems (RCS) provide energy-efficient technology options for neuromorphic computing. However, the applicability of RCS is limited by reliability problems that arise from the immature fabrication process. In order to take advantage of RCS in practical applications, fault-tolerant design is a key challenge. We present a survey of fault-tolerant designs for RRAM-based neuromorphic computing systems. We first describe RRAM-based crossbars and training architectures in RCS. Following this, we classify fault models into different categories, and review post-fabrication testing methods. Subsequently, online testing methods are presented. Finally, we present various fault-tolerant techniques that were designed to tolerate different types of RRAM faults. The methods reviewed in this survey represent recent trends in fault-tolerant designs of RCS, and are expected motivate further research in this field.

Proceedings ArticleDOI
Amin Rezaei1, You Li1, Yuanqi Shen1, Shuyu Kong1, Hai Zhou1 
21 Jan 2019
TL;DR: In this article, a new way of cyclic encryption by utilizing unreachable states to defeat CycSAT was proposed, and the attack complexity of the proposed scheme is discussed and its robustness is demonstrated.
Abstract: Logic encryption has attracted much attention due to increasing IC design costs and growing number of untrusted foundries. Unreachable states in a design provide a space of flexibility for logic encryption to explore. However, due to the available access of scan chain, traditional combinational encryption cannot leverage the benefit of such flexibility. Cyclic logic encryption inserts key-controlled feedbacks into the original circuit to prevent piracy and overproduction. Based on our discovery, cyclic logic encryption can utilize unreachable states to improve security. Even though cyclic encryption is vulnerable to a powerful attack called CycSAT, we develop a new way of cyclic encryption by utilizing unreachable states to defeat CycSAT. The attack complexity of the proposed scheme is discussed and its robustness is demonstrated.

Proceedings ArticleDOI
21 Jan 2019
TL;DR: In this paper, the authors present an improved methodology for bitstream file format reversing and introduce a novel idea for Trojan insertion, which can be used to infiltrate FPGAs in a non-invasive manner after shipment.
Abstract: The threat of inserting hardware Trojans during the design, production, or in-field poses a danger for integrated circuits in real-world applications. A particular critical case of hardware Trojans is the malicious manipulation of third-party FPGA configurations. In addition to attack vectors during the design process, FPGAs can be infiltrated in a non-invasive manner after shipment through alterations of the bitstream. First, we present an improved methodology for bitstream file format reversing. Second, we introduce a novel idea for Trojan insertion.

Proceedings ArticleDOI
21 Jan 2019
TL;DR: This paper proposes a high performance machine learning-based mask printability evaluation framework for lithography-related applications, and applies it in a conventional mask optimization tool to verify its effectiveness.
Abstract: Continuous shrinking of VLSI technology nodes brings us powerful chips with lower power consumption, but it also introduces many issues in manufacturability. Lithography simulation process for new feature size suffers from large computational overhead. As a result, conventional mask optimization process has been drastically resource consuming in terms of both time and cost. In this paper, we propose a high performance machine learning-based mask printability evaluation framework for lithography-related applications, and apply it in a conventional mask optimization tool to verify its effectiveness.

Proceedings ArticleDOI
21 Jan 2019
TL;DR: For the first time the error between the generated layout and the known drawn GDS will be compared quantitatively as a figure of merit (FOM) and from this layout a circuit graph of an ECC encryption and the partitioning in circuit blocks will be extracted.
Abstract: In view of potential risks of piracy and malicious manipulation of complex integrated circuits built in technologies of 45 nm and less, there is an increasing need for an effective and efficient process of reverse engineering. This paper provides an overview of the current process and details on a new tool for the acquisition and synthesis of large area images and the extraction of a layout. For the first time the error between the generated layout and the known drawn GDS will be compared quantitatively as a figure of merit (FOM). From this layout a circuit graph of an ECC encryption and the partitioning in circuit blocks will be extracted.

Proceedings ArticleDOI
21 Jan 2019
TL;DR: This work proposes Dr. CU, an efficient and effective detailed router, to tackle the challenges of 3D detailed routing grid graph of enormous size, and a set of two-level sparse data structures is designed for runtime and memory efficiency.
Abstract: Different from global routing, detailed routing takes care of many detailed design rules and is performed on a significantly larger routing grid graph. In advanced technology nodes, it becomes the most complicated and time-consuming stage. We propose Dr. CU, an efficient and effective detailed router, to tackle the challenges. To handle a 3D detailed routing grid graph of enormous size, a set of two-level sparse data structures is designed for runtime and memory efficiency. For handling the minimum-area constraint, an optimal correct-by-construction path search algorithm is proposed. Besides, an efficient bulk synchronous parallel scheme is adopted to further reduce the runtime usage. Compared with the first place of ISPD 2018 Contest, our router improves the routing quality by up to 65% and on average 39%, according to the contest metric. At the same time, it achieves 80--93% memory reduction, and 2.5--15X speed-up.

Proceedings ArticleDOI
21 Jan 2019
TL;DR: PaPIM architecture is presented, which transforms current Spin Orbit Torque Magnetic Random Access Memory sub-arrays to massively parallel computational units capable of running inferences for Binary-Weight Deep Neural Networks (BWNNs).
Abstract: Recent algorithmic progression has brought competitive classification accuracy despite constraining neural networks to binary weights (+1/-1). These findings show remarkable optimization opportunities to eliminate the need for computationally-intensive multiplications, reducing memory access and storage. In this paper, we present ParaPIM architecture, which transforms current Spin Orbit Torque Magnetic Random Access Memory (SOT-MRAM) sub-arrays to massively parallel computational units capable of running inferences for Binary-Weight Deep Neural Networks (BWNNs). ParaPIM's in-situ computing architecture can be leveraged to greatly reduce energy consumption dealing with convolutional layers, accelerate BWNNs inference, eliminate unnecessary off-chip accesses and provide ultra-high internal bandwidth. The device-to-architecture co-simulation results indicate ~4x higher energy efficiency and 7.3x speedup over recent processing-in-DRAM acceleration, or roughly 5x higher energy-efficiency and 20.5x speedup over recent ASIC approaches, while maintaining inference accuracy comparable to baseline designs.

Proceedings ArticleDOI
21 Jan 2019
TL;DR: The proposed PUFs-and-PIM based Protection method for neural Models (P3M), can utilize unstable PUFs to protect the neural models in edge deep learning accelerators with negligible performance overhead.
Abstract: This work is oriented at the edge computing scenario that terminal deep learning accelerators use pre-trained neural network models distributed from third-party providers (e.g. from data center clouds) to process the private data instead of sending it to the cloud. In this scenario, the network model is exposed to the risk of being attacked in the unverified devices if the parameters and hyper-parameters are transmitted and processed in an unencrypted way. Our work tackles this security problem by using on-chip memory Physical Unclonable Functions (PUFs) and Processing-In-Memory (PIM). We allow the model execution only on authorized devices and protect the model from white-box attacks, black-box attacks and model tampering attacks. The proposed PUFs-and-PIM based Protection method for neural Models (P3M), can utilize unstable PUFs to protect the neural models in edge deep learning accelerators with negligible performance overhead. The experimental results show considerable performance improvement over two state-of-the-art solutions we evaluated.

Proceedings ArticleDOI
21 Jan 2019
TL;DR: The impacts of the limited reliability of memristor devices are reviewed and the recent research progress in building reliable and efficient Memristor-based NCS development is summarized.
Abstract: Neuromorphic computing is a revolutionary approach of computation, which attempts to mimic the human brain's mechanism for extremely high implementation efficiency and intelligence. Latest research studies showed that the memristor technology has a great potential for realizing power- and area-efficient neuromorphic computing systems (NCS). On the other hand, the memristor device processing is still under development. Unreliable devices can severely degrade system performance, which arises as one of the major challenges in developing memristor-based NCS. In this paper, we first review the impacts of the limited reliability of memristor devices and summarize the recent research progress in building reliable and efficient memristor-based NCS. In the end, we discuss the main difficulties and the trend in memristor-based NCS development.

Proceedings ArticleDOI
21 Jan 2019
TL;DR: This work presents a design method which - for the first time - allows for the scalable design of FCN circuits that satisfy dedicated constraints of these technologies.
Abstract: Field-coupled Nanocomputing (FCN) technologies are considered as a solution to overcome physical boundaries of conventional CMOS approaches. But despite ground breaking advances regarding their physical implementation as e.g. Quantum-dot Cellular Automata (QCA), Nanomagnet Logic (NML), and many more, there is an unsettling lack of methods for large-scale design automation of FCN circuits. In fact, design automation for this class of technologies still is in its infancy - heavily relying either on manual labor or automatic methods which are applicable for rather small functionality only. This work presents a design method which - for the first time - allows for the scalable design of FCN circuits that satisfy dedicated constraints of these technologies. The proposed scheme is capable of handling around 40000 gates within seconds while the current state-of-the-art takes hours to handle around 20 gates. This is confirmed by experimental results on the layout level for various established benchmarks libraries.

Proceedings ArticleDOI
21 Jan 2019
TL;DR: Simulation results show that the proposed Re-Mining architecture outperforms CPU-based and GPU-based implementations significantly, which is unacceptable for resource-limited embedded devices.
Abstract: Blockchain's decentralized and consensus mechanism has attracted lots of applications, such as IoT devices. Blockchain maintains a linked list of blocks and grows by mining new blocks. However, the Blockchain mining consumes huge computation resource and energy, which is unacceptable for resource-limited embedded devices. This paper for the first time presents a ReRAM-based processing-in-memory architecture for Blockchain mining, called Re-Mining. Re-Mining includes a message schedule module and a SHA computation module. The modules are composed of several basic ReRAM-based logic operations units, such as ROR, RSF and XOR. Re-Mining further designs intra-transaction and inter-transaction parallel mechanisms to accelerate the Blockchain mining. Simulation results show that the proposed Re-Mining architecture outperforms CPU-based and GPU-based implementations significantly.

Proceedings ArticleDOI
21 Jan 2019
TL;DR: A semi-supervised hotspot detection with self-paced multi-task learning paradigm, leveraging both data samples w./w.o. labels to improve model accuracy and generality is proposed.
Abstract: Lithography simulation is computationally expensive for hotspot detection. Machine learning based hotspot detection is a promising technique to reduce the simulation overhead. However, most learning approaches rely on a large amount of training data to achieve good accuracy and generality. At the early stage of developing a new technology node, the amount of data with labeled hotspots or non-hotspots is very limited. In this paper, we propose a semi-supervised hotspot detection with self-paced multi-task learning paradigm, leveraging both data samples w./w.o. labels to improve model accuracy and generality. Experimental results demonstrate that our approach can achieve 2.9--4.5% better accuracy at the same false alarm levels than the state-of-the-art work using 10%-50% of training data. The source code and trained models are released on https://github.com/qwepi/SSL.