scispace - formally typeset
Search or ask a question

Showing papers presented at "Asia and South Pacific Design Automation Conference in 2020"


Proceedings ArticleDOI
01 Jan 2020
TL;DR: A Digital Annealer is a dedicated architecture for high-speed solving of combinatorial optimization problems mapped to an Ising model with fully coupled bit connectivity and high coupling resolution.
Abstract: A Digital Annealer (DA) is a dedicated architecture for high-speed solving of combinatorial optimization problems mapped to an Ising model. With fully coupled bit connectivity and high coupling resolution as a major feature, it can be used to express a wide variety of combinatorial optimization problems. The DA uses Markov Chain Monte Carlo as a basic search mechanism, accelerated by the hardware implementation of multiple speed-enhancement techniques such as parallel search, escape from a local solution, and replica exchange. It is currently being offered as a cloud service using a second-generation chip operating on a scale of 8,192 bits. This paper presents an overview of the DA, its performance against benchmarks, and application examples.

76 citations


Proceedings ArticleDOI
01 Jan 2020
TL;DR: This paper proposes an area-efficient ONN architecture based on structured neural networks, leveraging optical fast Fourier transform for efficient computation, and proposes a two-phase software training flow with structured pruning to further reduce the optical component utilization.
Abstract: As a promising neuromorphic framework, the optical neural network (ONN) demonstrates ultra-high inference speed with low energy consumption. However, the previous ONN architectures have high area overhead which limits their practicality. In this paper, we propose an area-efficient ONN architecture based on structured neural networks, leveraging optical fast Fourier transform for efficient computation. A two-phase software training flow with structured pruning is proposed to further reduce the optical component utilization. Experimental results demonstrate that the proposed architecture can achieve 2.2∼3.7× area cost improvement compared with the previous singular value decomposition-based architecture with comparable inference accuracy.

42 citations


Proceedings ArticleDOI
01 Jan 2020
TL;DR: A novel framework to co-explore NAS space and NoC Design Search (NDS) space with the objective to maximize accuracy and throughput is proposed and a multi-phase manager is developed to guide NANDS to gradually converge to solutions with the best accuracy-throughput tradeoff.
Abstract: Hardware-aware Neural Architecture Search (NAS), which automatically finds an architecture that works best on a given hardware design, has prevailed in response to the ever-growing demand for real-time Artificial Intelligence (AI). However, in many situations, the underlying hardware is not pre-determined. We argue that simply assuming an arbitrary yet fixed hardware design will lead to inferior solutions, and it is best to co-explore neural architecture space and hardware design space for the best pair of neural architecture and hardware design. To demonstrate this, we employ Network-on-Chip (NoC) as the infrastructure and propose a novel framework, namely NANDS, to co-explore NAS space and NoC Design Search (NDS) space with the objective to maximize accuracy and throughput. Since two metrics are tightly coupled, we develop a multi-phase manager to guide NANDS to gradually converge to solutions with the best accuracy-throughput tradeoff. On top of it, we propose techniques to detect and alleviate timing performance bottleneck, which allows better and more efficient exploration of NDS space. Experimental results on common datasets, CIFAR10, CIFAR-100 and STL-10, show that compared with state-of-the-art hardware-aware NAS, NANDS can achieve 42.99% higher throughput along with 1.58% accuracy improvement. There are cases where hardware-aware NAS cannot find any feasible solutions while NANDS can.

38 citations


Proceedings ArticleDOI
01 Jan 2020
TL;DR: This work proposes a novel approach to predict the routing congestion map for large-scale FPGA designs at the placement stage that can be incorporated into the placement engine to provide congestion prediction resulting in up to 7% reduction in routed wirelength for the most congested design in ISPD 2016 benchmark.
Abstract: To speed up the FPGA placement and routing closure, we propose a novel approach to predict the routing congestion map for large-scale FPGA designs at the placement stage. After reformulating the problem into an image translation task, our proposed approach leverages recent advancement in generative adversarial learning to address the task. Particularly, state-of-the-art generative adversarial networks for high-resolution image translation are used along with well-engineered features extracted from the placement stage. Unlike available approaches, our novel framework demonstrates a capability of handling large-scale FPGA designs. With its superior accuracy, our proposed approach can be incorporated into the placement engine to provide congestion prediction resulting in up to 7% reduction in routed wirelength for the most congested design in ISPD 2016 benchmark.

34 citations


Proceedings ArticleDOI
01 Jan 2020
TL;DR: Wang et al. as discussed by the authors proposed a memristor-based DNN framework which combines both structured weight pruning and quantization by incorporating ADMM algorithm for better pruning/quantization performance.
Abstract: The memristor crossbar array has emerged as an intrinsically suitable matrix computation and low-power acceleration framework for DNN applications. Many techniques such as memristor-based weight pruning and memristor-based quantization have been studied. However, the high accuracy solution for the above techniques is still waiting for unraveling. In this paper, we propose a memristor-based DNN framework which combines both structured weight pruning and quantization by incorporating ADMM algorithm for better pruning and quantization performance. We also discover the non-optimality of the ADMM solution in weight pruning and the unused data path in a structured pruned model. We design a software-hardware co-optimization framework which contains the first proposed Network Purification and Unused Path Removal algorithms targeting on post-processing a structured pruned model after ADMM steps. By taking memristor hardware constraints into our whole framework, we achieve extreme high compression rate with minimum accuracy loss. For quantizing structured pruned model, our framework achieves nearly no accuracy loss after quantizing weights to 8-bit memristor weight representation. We share our models at anonymous link https://bit.ly/2VnMUy0.

34 citations


Proceedings ArticleDOI
01 Jan 2020
TL;DR: In this article, an advantage actor critic (A2C) agent is trained to minimize area subject to a timing constraint, and the agent can be optimized autonomously with no-humans in-loop.
Abstract: Logic synthesis requires extensive tuning of the synthesis optimization flow where the quality of results (QoR) depends on the sequence of optimizations used. Efficient design space exploration is challenging due to the exponential number of possible optimization permutations. Therefore, automating the optimization process is necessary. In this work, we propose a novel reinforcement learning-based methodology that navigates the optimization space without human intervention. We demonstrate the training of an Advantage Actor Critic (A2C) agent that seeks to minimize area subject to a timing constraint. Using the proposed methodology, designs can be optimized autonomously with no-humans in-loop. Evaluation on the comprehensive EPFL benchmark suite shows that the agent outperforms existing exploration methodologies and improves QoRs by an average of 13%.

33 citations


Proceedings ArticleDOI
01 Jan 2020
TL;DR: In this paper, the authors investigate the impact of single event upset (SEU) induced parameter perturbation (SIPP) on neural networks and propose two remedy solutions to protect DNNs from SIPPs.
Abstract: Deep Neural Network has proved its potential in various perception tasks and hence become an appealing option for interpretation and data processing in security sensitive systems. However, security-sensitive systems demand not only high perception performance, but also design robustness under various circumstances. Unlike prior works that study network robustness from software level, we investigate from hardware perspective about the impact of Single Event Upset (SEU) induced parameter perturbation (SIPP) on neural networks. We systematically define the fault models of SEU and then provide the definition of sensitivity to SIPP as the robustness measure for the network. We are then able to analytically explore the weakness of a network and summarize the key findings for the impact of SIPP on different types of bits in a floating point parameter, layer-wise robustness within the same network and impact of network depth. Based on those findings, we propose two remedy solutions to protect DNNs from SIPPs, which can mitigate accuracy degradation from 28% to 0.27% for ResNet with merely 0.24-bit SRAM area overhead per parameter.

28 citations


Proceedings ArticleDOI
01 Jan 2020
TL;DR: This paper proposes a methodology that employs a library of predesigned, stitchable templates, and uses machine learning to rapidly build a PDN with region-wise uniform pitches based on these templates, applicable at both the floorplan and placement stages of physical implementation.
Abstract: Designing an optimal power delivery network (PDN) is a time-intensive task that involves many iterations. This paper proposes a methodology that employs a library of predesigned, stitchable templates, and uses machine learning (ML) to rapidly build a PDN with region-wise uniform pitches based on these templates. Our methodology is applicable at both the floorplan and placement stages of physical implementation. (i) At the floorplan stage, we synthesize an optimized PDN based on early estimates of current and congestion, using a simple multilayer perceptron classifier. (ii) At the placement stage, we incrementally optimize an existing PDN based on more detailed congestion and current distributions, using a convolution neural network. At each stage, the neural network builds a safe-by-construction PDN that meets IR drop and electromigration (EM) specifications. On average, the optimization of the PDN brings an extra 3% of routing resources, which corresponds to a thousands of routing tracks in congestion-critical regions, when compared to a globally uniform PDN, while staying within the IR drop and EM limits.

26 citations


Proceedings ArticleDOI
01 Jan 2020
TL;DR: This work proposes an improved DD-based equivalence checking approach which addresses shortcomings of existing solutions and utilizes decision diagrams to exploit the fact that quantum operations are inherently reversible - allowing for dedicated strategies that keep the overhead moderate in many cases.
Abstract: Quantum computing is gaining considerable momentum through the recent progress in physical realizations of quantum computers. This led to rather sophisticated design flows in which the originally specified quantum functionality is compiled through different abstractions. This increasingly raises the question whether the respectively resulting quantum circuits indeed realize the originally intended function. Accordingly, efficient methods for equivalence checking are gaining importance. However, existing solutions still suffer from significant shortcomings such as their exponential worst case performance and an increased effort to obtain counterexamples in case of non-equivalence. In this work, we propose an improved DD-based equivalence checking approach which addresses these shortcomings. To this end, we utilize decision diagrams and exploit the fact that quantum operations are inherently reversible - allowing for dedicated strategies that keep the overhead moderate in many cases. Experimental results confirm that the proposed strategies lead to substantial speed-ups – allowing to perform equivalence checking of quantum circuits factors or even magnitudes faster than the state of the art.

26 citations


Proceedings ArticleDOI
01 Jan 2020
TL;DR: In this article, a machine learning-based automatic parameter tuning methodology is proposed to find the best design quality with a limited number of trials, instead of merely plugging in machine learning engines.
Abstract: Design flow parameters are of utmost importance to chip design quality and require a painfully long time to evaluate their effects. In reality, flow parameter tuning is usually performed manually based on designers’ experience in an ad hoc manner. In this work, we introduce a machine learning-based automatic parameter tuning methodology that aims to find the best design quality with a limited number of trials. Instead of merely plugging in machine learning engines, we develop clustering and approximate sampling techniques for improving tuning efficiency. The feature extraction in this method can reuse knowledge from prior designs. Furthermore, we leverage a state-of-the-art XGBoost model and propose a novel dynamic tree technique to overcome overfitting. Experimental results on benchmark circuits show that our approach achieves 25% improvement in design quality or 37% reduction in sampling cost compared to random forest method, which is the kernel of a highly cited previous work. Our approach is further validated on two industrial designs. By sampling less than 0.02% of possible parameter sets, it reduces area by 1.83% and 1.43% compared to the best solutions hand-tuned by experienced designers.

25 citations


Proceedings ArticleDOI
01 Jan 2020
TL;DR: This paper proposes a novel method of detecting system level symmetry constraints for analog circuits with graph similarity using spectral graph analysis and graph centrality, which can be applied to circuits and systems of large scale and different architectures.
Abstract: Symmetry and matching between critical building blocks have a significant impact on analog system performance. However, there is limited research on generating system level symmetry constraints. In this paper, we propose a novel method of detecting system symmetry constraints for analog circuits with graph similarity. Leveraging spectral graph analysis and graph centrality, the proposed algorithm can be applied to circuits and systems of large scale and different architectures. To the best of our knowledge, this is the first work in detecting system level symmetry constraints for analog and mixed-signal (AMS) circuits. Experimental results show that the proposed method can achieve high accuracy of 88.3% with low false alarm rate of less than 1.1% in largescale AMS designs.

Proceedings ArticleDOI
01 Jan 2020
TL;DR: This work proposes an automated and scalable mechanism to generate directed tests using a combination of symbolic execution and concrete simulation of RTL models, and shows that the directed tests are able to activate assertions non-vacuously.
Abstract: A major challenge in assertion-based validation is how to activate the assertions to ensure that they are valid. While existing test generation using model checking is promising, it cannot generate directed tests for large designs due to state space explosion. We propose an automated and scalable mechanism to generate directed tests using a combination of symbolic execution and concrete simulation of RTL models. Experimental results show that the directed tests are able to activate assertions non-vacuously.

Proceedings ArticleDOI
01 Jan 2020
TL;DR: It is proved that the trigger activation problem can be mapped to clique cover problem and an efficient test generation algorithm to activate trigger conditions by repeated maximal clique sampling is proposed.
Abstract: Hardware Trojans are serious threat to security and reliability of computing systems. It is hard to detect these malicious implants using traditional validation methods since an adversary is likely to hide them under rare trigger conditions. While existing statistical test generation methods are promising for Trojan detection, they are not suitable for activating extremely rare trigger conditions in stealthy Trojans. To address the fundamental challenge of activating rare triggers, we propose a new test generation paradigm by mapping trigger activation problem to clique cover problem. The basic idea is to utilize a satisfiability solver to construct a test corresponding to each maximal clique. This paper makes two fundamental contributions: 1) it proves that the trigger activation problem can be mapped to clique cover problem, 2) it proposes an efficient test generation algorithm to activate trigger conditions by repeated maximal clique sampling. Experimental results demonstrate that our approach is scalable and it outperforms state-of-the-art approaches by several orders-of-magnitude in detecting stealthy Trojans.

Proceedings ArticleDOI
01 Jan 2020
TL;DR: A targeted optimization strategy and implementation of a processor for Kyber on FPGAs that is at least 3x more efficient than the state-of-the-art module for NTT with a similar security level is proposed.
Abstract: Kyber is a promising candidate in post-quantum cryptography standardization process. In this paper, we propose a targeted optimization strategy and implement a processor for Kyber on FPGAs. By merging the operations, we cut off 29.4% clock cycles for Kyber512 and 33.3% for Kyber1024 compared with the textbook implementations. We utilize Gentlemen-Sande (GS) butterfly to optimize the Number-Theoretic Transform (NTT) implementation. The bottleneck of memory access is broken taking advantage of a dual-column sequential scheme. We further propose a pipeline architecture for better performance. The optimizations help the processor achieve 31684 NTT operations per second using only 477 LUTs, 237 FFs and 1 DSP. Our strategy is at least 3x more efficient than the state-of-the-art module for NTT with a similar security level.

Proceedings ArticleDOI
01 Jan 2020
TL;DR: The Adaptive Incremental Router (AIR) is described, a high-performance timing-driven FPGA router that dynamically adapts to the routing problem, which it solves ‘lazily’ to minimize work.
Abstract: Routing is a key step in the FPGA design process, which significantly impacts design implementation quality. Routing is also very time-consuming, and can scale poorly to very large designs. This paper describes the Adaptive Incremental Router (AIR), a high-performance timing-driven FPGA router. AIR dynamically adapts to the routing problem, which it solves ‘lazily’ to minimize work. Compared to the widely used VPR 7 router, AIR significantly reduces route-time ($7.1 \times$ faster), while also improving quality (15% wirelength, and 18% critical path delay reductions). We also show how these techniques enable efficient incremental improvement of existing routing.

Proceedings ArticleDOI
01 Jan 2020
TL;DR: Compared to both software tools and state-of-the-art accelerators, PARC shows significant improvement in alignment throughput and energy efficiency, thanks to the in-site computation capability and optimized data mapping.
Abstract: Technological advances in long read sequences have greatly facilitated the development of genomics. However, managing and analyzing the raw genomic data that outpaces Moore’s Law requires extremely high computational efficiency. On the one hand, existing software solutions can take hundreds of CPU hours to complete human genome alignment. On the other hand, the recently proposed hardware platforms achieve low processing throughput with significant overhead. In this paper, we propose PARC, an Processing-in-Memory architecture for long read pairwise alignment leveraging emerging resistive CAM (content-addressable memory) to accelerate the bottleneck chaining step in DNA alignment. Chaining takes 2-tuple anchors as inputs and identifies a set of correlated anchors as potential alignment candidates. Unlike traditional main memory which organizes relational data structure in a linear address space, PARC stores tuples in two neighboring crossbar arrays with shared row decoder such that column-wise in-memory computational operations and row-wise memory accesses can be performed in-situ in a symmetric crossbar structure. Compared to both software tools and state-of-the-art accelerators, PARC shows significant improvement in alignment throughput and energy efficiency, thanks to the in-site computation capability and optimized data mapping.

Proceedings ArticleDOI
01 Jan 2020
TL;DR: An energy-efficient quantized and regularized training framework for PIM accelerators, which consists of a PIM-based non-uniform activation quantization scheme and an energy-aware weight regularization method that can improve the energy efficiency of PIM architectures by reducing the ADC resolution requirements and training low energy consumption CNN models for Pim, with little accuracy loss.
Abstract: Convolutional Neural Networks (CNNs) have made breakthroughs in various fields, while the energy consumption becomes enormous. Processing-In-Memory (PIM) architectures based on emerging non-volatile memory (e.g., Resistive Random Access Memory, RRAM) have demonstrated great potential in improving the energy efficiency of CNN computing. However, there is still much room for improvement in the energy efficiency of existing PIM architectures. On the one hand, current work shows that high resolution Analog-to-Digital Converters (ADCs) are required for maintaining computing accuracy, but they dominate more than 60% energy consumption of the entire system, damaging the energy efficiency benefits of PIM. On the other hand, the characteristic of computing in the analog domain in PIM accelerators leads to the computing energy consumption is influenced by the specific input and weight values. However, as far as we know, there is no energy efficiency optimization method based on this characteristic in existing work. To solve these problems, in this paper, we propose an energy-efficient quantized and regularized training framework for PIM accelerators, which consists of a PIM-based non-uniform activation quantization scheme and an energy-aware weight regularization method. The proposed framework can improve the energy efficiency of PIM architectures by reducing the ADC resolution requirements and training low energy consumption CNN models for PIM, with little accuracy loss. The experimental results show that the proposed training framework can reduce the resolution of ADCs by 2 bits and the computing energy consumption in the analog domain by 35%. The energy efficiency, therefore, can be enhanced by $3.4 \times$ in our proposed training framework.

Proceedings ArticleDOI
01 Jan 2020
TL;DR: This work proposes a novel approach to real-time estimation of full-chip transient heatmaps for commercial processors based on machine learning that supplements the temperature data sensed from the existing on-chip sensors, allowing for the development of more robust runtime power and thermal control schemes that can take advantage of the additional thermal information otherwise not available.
Abstract: Runtime power and thermal control is crucial in any modern processor. However, these control schemes require accurate real-time temperature information, ideally of the entire die area, in order to be effective. On-chip temperature sensors alone cannot provide the full-chip temperature information since the number of sensors that are typically available is very limited due to their high area and power overheads. Furthermore, as we will demonstrate, the peak locations within hot-spots are not stationary and are very workload dependent, making it difficult to rely on fixed temperature sensors alone. Therefore, we propose a novel approach to real-time estimation of full-chip transient heatmaps for commercial processors based on machine learning. The model derived in this work supplements the temperature data sensed from the existing on-chip sensors, allowing for the development of more robust runtime power and thermal control schemes that can take advantage of the additional thermal information that is otherwise not available. The new approach involves offline acquisition of accurate spatial and temporal heatmaps using an infrared thermal imaging setup while nominal working conditions are maintained on the chip. To build the dynamic thermal model, we apply Long-Short-Term-Memory (LSTM) neutral networks with system-level variables such as chip frequency, instruction counts, and other performance metrics as inputs. To reduce the dimensionality of the model, 2D spatial discrete cosine transformation (DCT) is first performed on the heatmaps so that they can be expressed with just their dominant DCT frequencies. Our study shows that only 6×6 DCT coefficients are required to maintain sufficient accuracy across a variety of workloads. Experimental results show that the proposed approach can estimate the full-chip heatmaps with less than $1.4^{o}$C root-mean-square-error and take only $\sim$19ms for each inference which suits well for real-time use.

Proceedings ArticleDOI
01 Jan 2020
TL;DR: Experimental results obtained from Vivado, Artificial Neural Network (ANN) and image processing applications indicate superiority of proposed multiplier over accurate and state-of-the-art approximate counterparts, and in particular, LeAp outperforms the 32x32 accurate multiplier.
Abstract: Approximate multipliers are ubiquitously used in diverse applications by exploiting circuit simplification, mainly specialized for Application-Specific Integrated Circuit (ASIC) platforms. However, the intrinsic architectural specifications of Field-Programmable Gate Arrays (FPGAs) prohibited comparable resource gains when directly applying these techniques. LeAp is an area-, throughput-, and energy-efficient approximate multiplier for FPGAs which efficiently utilizes 6-input Look-up Tables (6-LUTs) and fast carry chains in its novel approximate log calculator to implement Mitchell’s algorithm. Moreover, three novel error-refinement schemes with negligible area overhead and independent from multiplier-size, have boosted accuracy to $\gt 99$%. Experimental results obtained from Vivado, Artificial Neural Network (ANN) and image processing applications indicate superiority of proposed multiplier over accurate and state-of-the-art approximate counterparts. In particular, LeAp outperforms the 32x32 accurate multiplier by achieving 69.7%, 14.7%, 42.1%, and 37.1% improvement in area, throughput, power, and energy, respectively. The library of RTL and behavioral implementations will be open-sourced at https://cfaed.tu-dresden.de/pd-downloads.

Proceedings ArticleDOI
01 Jan 2020
TL;DR: In this paper, a multi-objective fault-tolerant neural architecture search framework is proposed to automatically discover convolutional neural network (CNN) architectures that are reliable to various faults in nowadays edge devices.
Abstract: With the fast evolvement of deep-learning specific embedded computing systems, applications powered by deep learning are moving from the cloud to the edge. When deploying NNs onto the edge devices under complex environments, there are various types of possible faults: soft errors caused by atmospheric neutrons and radioactive impurities, voltage instability, aging, temperature variations, and malicious attackers. Thus the safety risk of deploying neural networks at edge computing devices in safety-critic applications is now drawing much attention. In this paper, we implement the random bit-flip, Gaussian, and Salt-and-Pepper fault models and establish a multi-objective fault-tolerant neural architecture search framework. On top of the NAS framework, we propose Fault-Tolerant Neural Architecture Search (FT-NAS) to automatically discover convolutional neural network (CNN) architectures that are reliable to various faults in nowadays edge devices. Then we incorporate fault-tolerant training (FTT) in the search process to achieve better results, which we called FTT-NAS. Experiments show that the discovered architecture FT-NAS-Net and FTT-NAS-Net outperform other hand-designed baseline architectures (58.1%/86.6% VS. 10.0%/52.2%), with comparable FLOPs and less parameters. What is more, the architectures trained under a single fault model can also defend against other faults. By inspecting the discovered architecture, we find that there are redundant connections learned to protect the sensitive paths. This insight can guide future fault-tolerant neural architecture design, and we verify it by a modification on ResNet-20–ResNet-M.

Proceedings ArticleDOI
01 Jan 2020
TL;DR: In this paper, the authors developed a fast dynamic IR drop estimation technique, named PowerNet, based on a convolutional neural network (CNN), which can handle both vector-based and vectorless IR analyses.
Abstract: IR drop is a fundamental constraint required by almost all chip designs. However, its evaluation usually takes a long time that hinders mitigation techniques for fixing its violations. In this work, we develop a fast dynamic IR drop estimation technique, named PowerNet, based on a convolutional neural network (CNN). It can handle both vector-based and vectorless IR analyses. Moreover, the proposed CNN model is general and transferable to different designs. This is in contrast to most existing machine learning (ML) approaches, where a model is applicable only to a specific design. Experimental results show that PowerNet outperforms the latest ML method by 9% in accuracy for the challenging case of vectorless IR drop and achieves a 30× speedup compared to an accurate IR drop commercial tool. Further, a mitigation tool guided by PowerNet reduces IR drop hotspots by 26% and 31% on two industrial designs, respectively, with very limited modification on their power grids.

Proceedings ArticleDOI
01 Jan 2020
TL;DR: The proposed SP&R utilizes the Optimization Modulo Theories (OMT), an extension of the Satisfiability modulo theories (SMT), to obtain optimal standard cell layout by virtue of SAT (Boolean Satisfiability)-based fast reasoning ability.
Abstract: Standard cell synthesis requires careful engineering approaches to ensure routability across various digital IC designs since physical design (PD) for sub-7nm technology nodes demands holistic efforts to address urgent and nontrivial design challenges. The smaller number of routing tracks and more complex design rules due to the sophisticated multi-patterning technology make place-and-route (P&R) for designing a standard cell extremely hard and time-consuming. Many conventional approaches have been suggested for improving transistor-level P&R and pin accessibility, nonetheless insufficient because of the heuristic/divide-and-conquer manners.In this paper, we propose a novel framework, SP&R, which simultaneously solves P&R for designing standard cell’s layout without deploying any sequential procedures (between place and route steps) by using dynamic pin allocation-based cell synthesis. The proposed SP&R utilizes the Optimization Modulo Theories (OMT), an extension of the Satisfiability modulo theories (SMT), to obtain optimal standard cell layout by virtue of SAT (Boolean Satisfiability)-based fast reasoning ability. We validate that our SP&R framework achieves 10.5% of reduction on average in terms of metal length compared to the sequential approach, through practical standard cell designs targeting sub-7nm technology nodes.

Proceedings ArticleDOI
01 Jan 2020
TL;DR: This work presents a reconfigurable approximate multiplier to support multiplications at various precisions, i.e., bit-widths, and proposes a new block-based approximate adder as part of the multiplier to ensure energy efficient operation with awareness of bit-wise correlation.
Abstract: Quantized CNNs, featured with different bit-widths at different layers, have been widely deployed in mobile and embedded applications. The implementation of a quantized CNN may have multiple multipliers at different precisions with limited resource reuse or one multiplier at higher precision than needed causing area overhead. It is then highly desired to design a multiplier by accounting for the characteristics of quantized CNNs to ensure both flexibility and energy efficiency. In this work, we present a reconfigurable approximate multiplier to support multiplications at various precisions, i.e., bit-widths. Moreover, unlike prior works assuming uniform distribution with bit-wise independence, a quantized CNN may have centralized weight distribution and hence follow a Gaussian-like distribution with correlated adjacent bits. Thus, a new block-based approximate adder is also proposed as part of the multiplier to ensure energy efficient operation with awareness of bit-wise correlation. Our experimental results show that the proposed adder significantly reduces the error rate by 76-98% over a state-of-the-art approximate adder for such scenarios. Moreover, with the deployment of the proposed multiplier, which is 17% faster and 22% more power saving than a Xilinx multiplier IP at the same precision, a quantized CNN implemented in FPGA achieves 17% latency reduction and 15% power saving compared with a full precision case.

Proceedings ArticleDOI
01 Jan 2020
TL;DR: While the embedded checksums deliver low-cost detection of computation errors upon violations of the trained equilibrium during network inference, the regularization term enables the network to generalize better during training by preventing overfitting, thus leading to significantly higher network accuracy.
Abstract: The abundant usage of deep neural networks in safety-critical domains such as autonomous driving raises concerns regarding the impact of hardware-level faults on deep neural network computations. As a failure can prove to be disastrous, low-cost safety mechanisms are needed to check the integrity of the deep neural network computations. We embed safety checksums into deep neural networks by introducing a custom regularization term in the network training. We partition the outputs of each network layer into two groups and guide the network to balance the summation of these groups through an additional penalty term in the cost function. The proposed approach delivers twin benefits. While the embedded checksums deliver low-cost detection of computation errors upon violations of the trained equilibrium during network inference, the regularization term enables the network to generalize better during training by preventing overfitting, thus leading to significantly higher network accuracy.

Proceedings ArticleDOI
01 Jan 2020
TL;DR: This paper proposes a Chip-Package Co-Design flow for implementing 2.5D systems using existing commercial chip design tools that enables design efficiency and flexibility with accurate cross-boundary parasitic extraction and design verification.
Abstract: Chiplet integration using 2.5D packaging is gaining popularity nowadays which enables several interesting features like heterogeneous integration and drop-in design method. In the traditional die-by-die approach of designing a 2.5D system, each chiplet is designed independently without any knowledge of the package RDLs. In this paper, we propose a Chip-Package Co-Design flow for implementing 2.5D systems using existing commercial chip design tools. Our flow encompasses 2.5D-aware partitioning suitable for SoC design, Chip-Package Floorplanning, and post-design analysis and verification of the entire 2.5D system. We also designed our own package planners to route RDL layers on top of chiplet layers. We use an ARM Cortex-M0 SoC system to illustrate our flow and compare analysis results with a monolithic 2D implementation of the same system. We also compare two different 2.5D implementations of the same SoC system following the drop-in approach. Alongside the traditional die-by-die approach, our holistic flow enables design efficiency and flexibility with accurate cross-boundary parasitic extraction and design verification.

Proceedings ArticleDOI
01 Jan 2020
TL;DR: This paper introduces a nonvolatile ferroelectric field-effect transistors (FeFET)-based sequential multiplier with the ability to do continued calculation after a power outage, thus achieving zero backup overhead.
Abstract: Energy-harvesting internet-of-things devices must deal with unstable power input. Nonvolatile processors (NVPs) can offer an effective solution. Compact and low-energy arithmetic circuits that can efficiently switch between computation and backup operations are highly desirable for NVP design. This paper introduces a nonvolatile ferroelectric field-effect transistors (FeFET)-based sequential multiplier with the ability to do continued calculation after a power outage, thus achieving zero backup overhead. We exploit the unique characteristics of FeFETs to construct key components of a sequential multiplier. The multiplier relies on a FeFET-based adder and a new FeFET-based latch to achieve compact area and low operating energy. Moreover, it uses the hysteretic characteristic of FeFETs to realize the storage capability, and hence is able to store, at no extra cost, the intermediate data of an operation in a nonvolatile manner. This property provides support for continued computation when power supplies may be intermittent. Simulation results show that, assuming the same technology node, the proposed FeFET-based multiplier saves up to 21% and 19% area than a conventional CMOS-based sequential multiplier of 4-bits and 8-bits, respectively. It also saves 32% and 73% less area compared with a CMOS-based array multiplier. Furthermore, the proposed design can offer up to 32%/23% energy saving per operation compared with a 4/8-bit CMOS-based sequential multiplier.

Proceedings ArticleDOI
01 Jan 2020
TL;DR: Key milestones in the evolution of Ferroelectric Field Effect Transistors (FeFETs) and the emergence of a versatile ferroelectronic platform are presented and ferroelectric emerges as a highly promising platform for various exciting applications.
Abstract: Research discovery of ferroelectricity in doped hafnium dioxide thin films has ignited tremendous activity in exploration of ferroelectric FETs for a range of applications from low-power logic to embedded non-volatile memory to in-memory compute kernels. In this paper, key milestones in the evolution of Ferroelectric Field Effect Transistors (FeFETs) and the emergence of a versatile ferroelectronic platform are presented. FeFET exhibits superior energy efficiency and high performance as embedded nonvolatile memory. When embedded into logic, such as SRAM or D-flip-flop, nonvolatile processor can be designed, which is critical for intermittent computing with unreliable power. The partial polarization switching in multi-domain ferroelectric can be harnessed to develop analog synaptic weight cell for deep learning accelerators. To further improve the energy-efficiency of computation, ferroelectric in-memory computing hardware primitive is designed, with one prominent example of ferroelectric TCAM. Utilizing the ferroelectric switching dynamics, ferroelectric neuron with intrinsic homeostasis can be realized to enable a unified ferroelectric platform for spiking neural network. From all these developments, ferroelectric emerges as a highly promising platform for various exciting applications.

Proceedings ArticleDOI
01 Jan 2020
TL;DR: This paper presents a novel methodology for designing approximate multipliers by employing the FPGA-based fabrics and has a better accuracy-hardware tradeoff than other designs with com-parable accuracy.
Abstract: Approximate multiplier design is an effective technique to improve hardware performance at the cost of accuracy loss. The current approximate multipliers are mostly ASIC-based and are dedicated for one particular application. In contrast, FPGA has been an attractive choice for many applications, because of its high performance, reconfigurability, and fast development. This paper presents a novel methodology for designing approximate multipliers by employing the FPGA-based fabrics. The area and latency are significantly reduced by cutting the carry propagation path in the multiplier. Moreover, we explore higher-order multipliers on architectural space by using our proposed small-size approximate multipliers as elementary modules. For different accuracy requirements, eight configurations for approximate 8 × 8 multiplier are discussed. In terms of mean relative error distance (MRED), the accuracy loss of the proposed 8 × 8 multiplier is low as 0.17%. Compared with the exact multiplier, our proposed design can reduce area by 43.66% and power by 20.36%. The critical path latency reduction is up to 27.66%. The proposed multiplier design has a better accuracy-hardware tradeoff than other designs with com-parable accuracy.

Proceedings ArticleDOI
01 Jan 2020
TL;DR: Experimental results prove the feasibility of the proposed side-channel leakage evaluation method at pre-silicon stage and the remedies and suggested security design rules are also discussed.
Abstract: Electromagnetic (EM) side-channel attacks aim at extracting secret information from cryptographic hardware implementations. Countermeasures have been proposed at device level, register-transfer level (RTL) and layout level, though efficient, there are still requirements for quantitative assessment of the hardware implementations’ resistance against EM side-channel attacks. In this paper, we propose a design for EM side-channel security evaluation and optimization framework based on the t-test evaluation results derived from RTL hardware implementations. Different implementations of the same cryptographic algorithm are evaluated under different hypothesis leakage models considering the driven capabilities of logic components, and the evaluation results are validated with side-channel attacks on FPGA platform. Experimental results prove the feasibility of the proposed side-channel leakage evaluation method at pre-silicon stage. The remedies and suggested security design rules are also discussed.

Proceedings ArticleDOI
01 Jan 2020
TL;DR: This paper proposes LanCe – a comprehensive and lightweight CNN defense methodology against different physical adversarial attacks, finding that non-semantic adversarial perturbations can activate CNN with significantly abnormal activations and even overwhelm other semantic input patterns’ activations.
Abstract: Recently, adversarial attacks can be applied to the physical world, causing practical issues to various Convolutional Neural Networks (CNNs) powered applications. Most existing physical adversarial attack defense works only focus on eliminating explicit perturbation patterns from inputs, ignoring interpretation to CNN’s intrinsic vulnerability. Therefore, they lack expected versatility to different attacks and thereby depend on considerable data processing costs. In this paper, we propose LanCe – a comprehensive and lightweight CNN defense methodology against different physical adversarial attacks. By interpreting CNN’s vulnerability, we find that non-semantic adversarial perturbations can activate CNN with significantly abnormal activations and even overwhelm other semantic input patterns’ activations. We improve the CNN recognition process by adding a self-verification stage to detect the potential adversarial input with only one CNN inference cost. Based on the detection result, we further propose a data recovery methodology to defend the physical adversarial attacks. We apply such defense methodology into both image and audio CNN recognition scenarios and analyze the computational complexity for each scenario, respectively. Experiments show that our methodology can achieve an average 91% successful rate for attack detection and 89% accuracy recovery. Moreover, it is at most 3× faster compared with the state-of-the-art defense methods, making it feasible to resource-constrained embedded systems, such as mobile devices.