scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems in 2020"


Journal ArticleDOI
Onur Mutlu1, Jeremie S. Kim1
TL;DR: Kim et al. as mentioned in this paper comprehensively survey the scientific literature on RowHammer-based attacks as well as mitigation techniques to prevent RowHammers, and discuss what other related vulnerabilities may be lurking in DRAM and other types of memories, e.g., NAND flash memory or phase change memory, that can potentially threaten the foundations of secure systems.
Abstract: This retrospective paper describes the RowHammer problem in dynamic random access memory (DRAM), which was initially introduced by Kim et al. at the ISCA 2014 Conference. RowHammer is a prime (and perhaps the first) example of how a circuit-level failure mechanism can cause a practical and widespread system security vulnerability. It is the phenomenon that repeatedly accessing a row in a modern DRAM chip causes bit flips in physically adjacent rows at consistently predictable bit locations. RowHammer is caused by a hardware failure mechanism called DRAM disturbance errors , which is a manifestation of circuit-level cell-to-cell interference in a scaled memory technology. Researchers from Google Project Zero demonstrated in 2015 that this hardware failure mechanism can be effectively exploited by user-level programs to gain kernel privileges on real systems. Many other follow-up works demonstrated other practical attacks exploiting RowHammer. In this paper, we comprehensively survey the scientific literature on RowHammer-based attacks as well as mitigation techniques to prevent RowHammer. We also discuss what other related vulnerabilities may be lurking in DRAM and other types of memories, e.g., NAND flash memory or phase change memory, that can potentially threaten the foundations of secure systems, as the memory technologies scale to higher densities. We conclude by describing and advocating a principled approach to memory reliability and security research that can enable us to better anticipate and prevent such vulnerabilities.

153 citations


Journal ArticleDOI
TL;DR: It is demonstrated that the co-exploration framework can effectively expand the search space to incorporate models with high accuracy, and theoretically show that the proposed two-level optimization can efficiently prune inferior solutions to better explore thesearch space.
Abstract: We propose a novel hardware and software co-exploration framework for efficient neural architecture search (NAS). Different from existing hardware-aware NAS which assumes a fixed hardware design and explores the NAS space only, our framework simultaneously explores both the architecture search space and the hardware design space to identify the best neural architecture and hardware pairs that maximize both test accuracy and hardware efficiency. Such a practice greatly opens up the design freedom and pushes forward the Pareto frontier between hardware efficiency and test accuracy for better design tradeoffs. The framework iteratively performs a two-level (fast and slow) exploration. Without lengthy training, the fast exploration can effectively fine-tune hyperparameters and prune inferior architectures in terms of hardware specifications, which significantly accelerates the NAS process. Then, the slow exploration trains candidates on a validation set and updates a controller using the reinforcement learning to maximize the expected accuracy together with the hardware efficiency. In this article, we demonstrate that the co-exploration framework can effectively expand the search space to incorporate models with high accuracy, and we theoretically show that the proposed two-level optimization can efficiently prune inferior solutions to better explore the search space. The experimental results on ImageNet show that the co-exploration NAS can find solutions with the same accuracy, 35.24% higher throughput, 54.05% higher energy efficiency, compared with the hardware-aware NAS.

116 citations


Journal ArticleDOI
TL;DR: A general framework for ML attacks on strong PUFs is proposed, and two novel modeling attacks, named logical approximation and global approximation, that use artificial neural network (ANN) to characterize the nonlinear structure of MPUF, rMPUF, cM PUF, and XOR Arbiter PUF are presented.
Abstract: Physical unclonable function (PUF) is a promising lightweight hardware security primitive for resource-constrained systems. It can generate a large number of challenge-response pairs (CRPs) for device authentication based on process variations. However, attackers can collect the CRPs to build a machine learning (ML) model with high prediction accuracy for the PUF. Recently, a lot of ML-resistant PUF structures have been proposed, e.g., a multiplexer-based PUF (MPUF) was introduced to resist ML attacks and its two variants (rMPUF and cMPUF) were further proposed to resist reliability-based and cryptanalysis modeling attacks, respectively. In this article, we propose a general framework for ML attacks on strong PUFs, then based on the framework, we present two novel modeling attacks, named logical approximation and global approximation, that use artificial neural network (ANN) to characterize the nonlinear structure of MPUF, rMPUF, cMPUF, and XOR Arbiter PUF. The logical approximation method uses linear functions to approximate logical operations and builds a precise soft model based on the combination of logical gates in the PUF. The global approximation method uses the function sinc with filtering characteristics to fit the mapping relationship between the challenge and response. The experimental results show that the proposed two approximation attacks can successfully model the ( $n$ , $k$ )-MPUF ( $k= 3, 4$ ), ( $n$ , $k$ )-rMPUF ( $k = 2, 3$ ), cMPUF ( $k = 4, 5$ ), and $l$ -XOR Arbiter PUF ( $l= 3, 4, 5$ ) ( $n = 32, 64$ ) with the average accuracies of 96.85%, 95.33%, 94.52%, and 96.26%, respectively.

85 citations


Journal ArticleDOI
TL;DR: The evolution of logic locking over the last decade is surveyed and various “cat-and-mouse” games involved in logic locking along with its novel applications—including, processor pipelines, graphics processing units (GPUs), and analog circuits are introduced.
Abstract: The fabless business model has given rise to many security threats, including piracy of intellectual property (IP), overproduction, counterfeiting, reverse engineering (RE), and hardware Trojans (HT). Such threats severely undermine the benefits of the fabless model. Among the countermeasures developed to thwart piracy and RE attacks, logic locking has emerged as a promising and versatile solution that is being adopted by both academia and industry. The idea behind logic locking is to lock the design using a “keying” mechanism; only the rightful owner has control over the locked design. Therefore, the design remains nonfunctional without the knowledge of the key. In this article, we survey the evolution of logic locking over the last decade. We introduce various “cat-and-mouse” games involved in logic locking along with its novel applications—including, processor pipelines, graphics processing units (GPUs), and analog circuits. We aim this article to be a primer for researchers interested in developing new logic-locking techniques and employing logic locking in different application domains.

79 citations


Journal ArticleDOI
TL;DR: This paper presents practical case studies to demonstrate MRIMA’s acceleration for binary-weight and low bit-width convolutional neural networks (CNNs) as well as data encryption, and shows ~77% and 21% lower energy consumption compared to CMOS-ASIC and recent domain-wall-based design, respectively.
Abstract: In this paper, we propose MRIMA, as a novel magnetic RAM (MRAM)-based in-memory accelerator for nonvolatile, flexible, and efficient in-memory computing. MRIMA transforms current spin transfer torque magnetic random access memory (STT-MRAM) arrays to massively parallel computational units capable of working as both nonvolatile memory and in-memory logic. Instead of integrating complex logic units in cost-sensitive memory, MRIMA exploits hardware-friendly bit-line computing methods to implement complete Boolean logic functions between operands within a memory array in a single clock cycle, overcoming the multicycle logic issue in contemporary processing-in-memory (PIM) platforms. We present practical case studies to demonstrate MRIMA’s acceleration for binary-weight and low bit-width convolutional neural networks (CNNs) as well as data encryption. Our device-to-architecture co-simulation results on CNN acceleration demonstrate that MRIMA can obtain $1.7 {\times }$ better energy-efficiency and $11.2{\times }$ speed-up compared to ASICs, and $1.8 {\times }$ better energy-efficiency and $2.4 {\times }$ speed-up over the best DRAM-based PIM solutions. As an advanced encryption standard (AES) in-memory encryption engine, MRIMA shows ~77% and 21% lower energy consumption compared to CMOS-ASIC and recent domain-wall-based design, respectively.

76 citations


Journal ArticleDOI
TL;DR: Pipe-it develops a performance-prediction model that utilizes only the convolutional layer descriptors to predict the execution time of each layer individually on all permitted core configurations (type and count), and exploits the predictions to create a balanced pipeline using an efficient design space exploration algorithm.
Abstract: Internet of Things edge intelligence requires convolutional neural network (CNN) inference to take place in the edge devices itself. ARM big.LITTLE architecture is at the heart of prevalent commercial edge devices. It comprises of single-ISA heterogeneous cores grouped into multiple homogeneous clusters that enable power and performance tradeoffs. All cores are expected to be simultaneously employed in inference to attain maximal throughput. However, high communication overhead involved in parallelization of computations from convolution kernels across clusters is detrimental to throughput. We present an alternative framework called Pipe-it that employs pipelined design to split convolutional layers across clusters while limiting parallelization of their respective kernels to the assigned cluster. We develop a performance-prediction model that utilizes only the convolutional layer descriptors to predict the execution time of each layer individually on all permitted core configurations (type and count). Pipe-it then exploits the predictions to create a balanced pipeline using an efficient design space exploration algorithm. Pipe-it on average results in a 39% higher throughput than the highest antecedent throughput.

73 citations


Journal ArticleDOI
TL;DR: This paper presents an overview of several research efforts that propose to use machine learning techniques for power and thermal management on single-core and multicore processors, and can potentially adapt to varying system conditions and workloads.
Abstract: Due to the high integration density and roadblock of voltage scaling, modern multicore processors experience higher power densities than previous technology scaling nodes. When unattended, this issue might lead to temperature hot spots, that in turn may cause nonuniform aging, accelerate chip failure, impair reliability, and reduce the performance of the system. This paper presents an overview of several research efforts that propose to use machine learning (ML) techniques for power and thermal management on single-core and multicore processors. Traditional power and thermal management techniques rely on a certain a-priori knowledge of the chip’s thermal model, as well as information of the workloads/applications to be executed (e.g., transient and average power consumption). Nevertheless, these a-priori information is not always available, and even if it is, it cannot reflect the spatial and temporal uncertainties and variations that come from the environment, the hardware, or from the workloads/applications. Contrarily, techniques based on ML can potentially adapt to varying system conditions and workloads, learning from past events in order to improve themselves as the environment changes, resulting in improved management decisions.

70 citations


Journal ArticleDOI
TL;DR: This article presents a survey of the different modern high-level synthesis (HLS) design space exploration (DSE) techniques proposed so far and addresses the critical issues still not resolved as well identifies new opportunities in this field.
Abstract: This article presents a survey of the different modern high-level synthesis (HLS) design space exploration (DSE) techniques that have been proposed so far to automatically generate hardware accelerators of different tradeoffs. HLS has multiple advantages compared to traditional RT-level-based hardware design. One key advantage is that a variety of different microarchitectures of unique tradeoffs can be obtained from the same untimed behavioral description by setting different synthesis options. Out of all the possible microarchitectures, the one that the designers are most interested in are the Pareto-optimal ones. The main problem is that the search space grows superlinearly with the number of synthesis options, and hence, heuristics have been proposed to search the space efficiently. This article summarizes the main techniques proposed and addresses the critical issues still not resolved as well identifies new opportunities in this field. It also serves as a guide for anyone wanting to create their own HLS DSE.

69 citations


Journal ArticleDOI
TL;DR: A novel automatic framework for efficient implementation of arbitrary combinational logic functions within a memristive memory using synthesis and in-memory mapping of logic execution in a single row (SIMPLER), a tool that optimizes the execution of in- memory logic operations in terms of throughput and area.
Abstract: In-memory processing can dramatically improve the latency and energy consumption of computing systems by minimizing the data transfer between the memory and the processor. Efficient execution of processing operations within the memory is therefore, a highly motivated objective in modern computer architecture. This article presents a novel automatic framework for efficient implementation of arbitrary combinational logic functions within a memristive memory. Using tools from logic design, graph theory and compiler register allocation technology, we developed synthesis and in-memory mapping of logic execution in a single row (SIMPLER), a tool that optimizes the execution of in-memory logic operations in terms of throughput and area. Given a logical function, SIMPLER automatically generates a sequence of atomic memristor-aided logic (MAGIC) NOR operations and efficiently locates them within a single size-limited memory row, reusing cells to save area when needed. This approach fully exploits the parallelism offered by the MAGIC NOR gates. It allows multiple instances of the logic function to be performed concurrently, each compressed into a single row of the memory. This virtue makes SIMPLER an attractive candidate for designing in-memory single instruction, multiple data (SIMD) operations. Compared to the previous work (that optimizes latency rather than throughput for a single function), SIMPLER achieves an average throughput improvement of $435\times $ . When the previous tools are parallelized similarly to SIMPLER, SIMPLER achieves higher throughput of at least $5\times $ , with $23\times $ improvement in area and $20\times $ improvement in area efficiency. These improvements more than fully compensate for the increase (up to 17% on average) in latency.

57 citations


Journal ArticleDOI
TL;DR: A new noise-aware DVFS sequence optimization technique is proposed by formulating a mixed 0/1 programming to resolve the problems of clock skipping sequence optimization, and the method is also extended to schedule extensive wake-up activities on different clock domains for the same purpose.
Abstract: Low power system-on-chips (SoCs) are now at the heart of Internet-of-Things (IoT) devices, which are well-known for their bursty workloads and limited energy storage—usually in the form of tiny batteries. To ensure battery lifetime, dynamic voltage frequency scaling (DVFS) has become an essential technique in such SoC chips. With continuously decreasing supply level, noise margins in these devices are already being squeezed. During DVFS transition, large current that accompanies the clock speed transition runs into or out of clock networks in a few clock cycles, induces large ${\text {L}di}{/}{\mathrm {d}t}$ noise, thereby stressing the power delivery system (PDS). Due to the limited area and cost target, adding additional decoupling capacitance to mitigate such noise is usually challenging. A common approach is to gradually introduce/remove the additional clock cycles to increase/decrease the clock frequency in steps, also known as, clock skipping. However, such a technique may increase DVFS transition time, and still cannot guarantee minimal noise. In this paper, we propose a new noise-aware DVFS sequence optimization technique by formulating a mixed 0/1 programming to resolve the problems of clock skipping sequence optimization. Moreover, the method is also extended to schedule extensive wake-up activities on different clock domains for the same purpose. The experiments show that the optimized sequence is able to significantly mitigate noise within the desired transition time, thereby saving both power and energy.

56 citations


Journal ArticleDOI
TL;DR: In this paper, the authors proposed a co-search framework that starts from a "hot" state based on a set of existing pretrained models to avoid lengthy training time, which can reduce the search time from 200 GPU hours to less than 3 GPU hours.
Abstract: Hardware and neural architecture co-search that automatically generates artificial intelligence (AI) solutions from a given dataset are promising to promote AI democratization; however, the amount of time that is required by current co-search frameworks is in the order of hundreds of GPU hours for one target hardware. This inhibits the use of such frameworks on commodity hardware. The root cause of the low efficiency in existing co-search frameworks is the fact that they start from a “cold” state (i.e., search from scratch). In this article, we propose a novel framework, namely, HotNAS, that starts from a “hot” state based on a set of existing pretrained models (also known as model zoo) to avoid lengthy training time. As such, the search time can be reduced from 200 GPU hours to less than 3 GPU hours. In HotNAS, in addition to hardware design space and neural architecture search space, we further integrate a compression space to conduct model compressing during the co-search, which creates new opportunities to reduce latency, but also brings challenges. One of the key challenges is that all of the above search spaces are coupled with each other, e.g., compression may not work without hardware design support. To tackle this issue, HotNAS builds a chain of tools to design hardware to support compression, based on which a global optimizer is developed to automatically co-search all the involved search spaces. Experiments on ImageNet dataset and Xilinx FPGA show that, within the timing constraint of 5 ms, neural architectures generated by HotNAS can achieve up to 5.79% Top-1 and 3.97% Top-5 accuracy gain, compared with the existing ones.

Journal ArticleDOI
TL;DR: A honeycomb-based RTSV architecture to utilize the area and delay more efficiently as well as to maintain high yield is proposed and the simulation results show that the proposed architecture has a 99.84% repair rate for uniform faults and an 81.42% repair rates for highly clustered faults.
Abstract: Due to the winding level of the thinned wafers and the surface roughness of silicon dies, the quality of through-silicon vias (TSVs) varies during the fabrication and bonding process. If one TSV exhibits a defect during its manufacturing process, the probability of multiple defects occurring in the TSVs neighboring the faulty TSV increases, i.e., the TSV defects tend to be clustered, which significantly reduces the yield of 3-D integrated circuit. To resolve the clustered TSV faults, router-based, ring-based, group-based, and cellular-based redundant TSV (RTSV) architectures were proposed. However, the repair rate is low and the hardware overhead as well as delay overhead is high. In this article, we propose a honeycomb-based RTSV architecture to utilize the area and delay more efficiently as well as to maintain high yield. The simulation results show that the proposed architecture has a 99.84% repair rate for uniform faults and an 81.42% repair rate for highly clustered faults. The proposed design achieves a 51.66% reduction of hardware overhead compared with the router-based design and a 20.69%, 46.93%, 34.17%, and 11.15% reduction of total delay compared with ring-based, router-based, group-based, and cellular-based methods, respectively.

Journal ArticleDOI
TL;DR: A novel architecture for implementing fast algorithms on FPGAs, which effectively pipeline the Winograd/FFT processing element (PE) engine and initiate multiple PEs through parallelization, and proposes an analytical model to predict the resource usage and the performance.
Abstract: In recent years, convolutional neural networks (CNNs) have become widely adopted for computer vision tasks. Field-programmable gate arrays (FPGAs) have been adequately explored as a promising hardware accelerator for CNNs due to its high performance, energy efficiency, and reconfigurability. However, prior FPGA solutions based on the conventional convolutional algorithm is often bounded by the computational capability of FPGAs (e.g., the number of DSPs). To address this problem, the feature maps are transformed to a special domain using fast algorithms to reduce the arithmetic complexity. Winograd and fast Fourier transformation (FFT), as fast algorithm representatives, first transform input data and filter to Winograd or frequency domain, then perform element-wise multiplication, and apply inverse transformation to get the final output. In this paper, we propose a novel architecture for implementing fast algorithms on FPGAs. Our design employs line buffer structure to effectively reuse the feature map data among different tiles. We also effectively pipeline the Winograd/FFT processing element (PE) engine and initiate multiple PEs through parallelization. Meanwhile, there exists a complex design space to explore. We propose an analytical model to predict the resource usage and the performance. Then, we use the model to guide a fast design space exploration. Experiments using the state-of-the-art CNNs demonstrate the best performance and energy efficiency on FPGAs. We achieve 854.6 and 2479.6 GOP/s for AlexNet and VGG16 on Xilinx ZCU102 platform using Winograd. We achieve 130.4 GOP/s for Resnet using Winograd and 201.1 GOP/s for YOLO using FFT on Xilinx ZC706 platform.

Journal ArticleDOI
TL;DR: A revised version of SFLL, namely SFLL-rem, is presented, that not only retains all security properties ofSFLL but also delivers resilience to all the state-of-the-art attacks SFLL can thwart, but also to the latest removal attacks that broke some SFLL instances.
Abstract: Logic locking is a holistic solution to counter manufacturing threats, such as intellectual property (IP) piracy and overbuilding at the hardware level. However, years of research has exposed various flaws in locking, including a Boolean satisfiability (SAT)-based attack. Consequently, several SAT-resilient locking techniques, such as SARLock, Anti-SAT, and SFLL have been proposed, although certain instances of them have also been broken by a class of attacks, called removal attack. In this article, we approach logic locking by leveraging well-known principles from very large-scale integration (VLSI) testing and elicit logic locking properties that dictate the resilience of a locking technique against different attacks. We present a revised version of SFLL, namely SFLL-rem, that not only retains all security properties of SFLL, delivering resilience to all the state-of-the-art attacks SFLL can thwart, but also to the latest removal attacks that broke some SFLL instances. Further, we develop a security-aware CAD framework integrated with industry tools that incurs only −1.5%, 0%, and 4.13% overhead for power, performance, and area, respectively. We demonstrate a silicon implementation of SFLL-rem on ARM Cortex-M0 microprocessor in 65 nm. Moreover, we provide a framework for an SoC designer to customize logic locking based on the SoC blocks and their threat models; this is illustrated by locking a multimillion-gate SoC provided by DARPA, and taking the SoC all the way to GDSII layout.

Journal ArticleDOI
TL;DR: SearcHD is proposed, a fully binarized HD computing algorithm with a fully binary training which generates multiple binary hypervectors for each class and uses the analog characteristic of nonvolatile memories to perform all encoding, training, and inference computations in memory.
Abstract: Brain-inspired hyperdimensional (HD) computing emulates cognitive tasks by computing with long binary vectors—also know as hypervectors—as opposed to computing with numbers. However, we observed that in order to provide acceptable classification accuracy on practical applications, HD algorithms need to be trained and tested on nonbinary hypervectors. In this article, we propose SearcHD, a fully binarized HD computing algorithm with a fully binary training. SearcHD maps every data points to a high-dimensional space with binary elements. Instead of training an HD model with nonbinary elements, SearcHD implements a full binary training method which generates multiple binary hypervectors for each class. We also use the analog characteristic of nonvolatile memories (NVMs) to perform all encoding, training, and inference computations in memory. We evaluate the efficiency and accuracy of SearcHD on a wide range of classification applications. Our evaluation shows that SearcHD can provide on average $31.1\times $ higher energy efficiency and $12.8\times $ faster training as compared to the state-of-the-art HD computing algorithms.

Journal ArticleDOI
Li Yaping1, Yong Wang1, Li Yusong1, Ranran Zhou1, Zhaojun Lin1 
TL;DR: A new analog circuit optimization system for automated sizing of analog integrated circuits that consists of a genetic algorithm (GA) based global optimization engine and an artificial neural network (ANN) based local optimization engine so the local minimum search (LMS) can have a much faster search speed.
Abstract: This article presents a new analog circuit optimization system for automated sizing of analog integrated circuits. It consists of a genetic algorithm (GA)-based global optimization engine and an artificial neural network (ANN)-based local optimization engine. The key new idea is to use parallel computation to train ANN models for design space neighborhoods thus the local minimum search (LMS) can have a much faster search speed. For the GA-based global optimization, circuit performances are calculated by parallel SPICE simulations. For the LMS, circuit performance data are derived from ANN model predictions instead of SPICE simulations. Since the most time for an ANN-based LMS is spent on SPICE calls which can be run in parallel, the LMS process can also exploit the multiple core configuration of a modern computational server in addition to the GA global search. The fully parallelized optimization system is deployed to design a two-stage rail-to-rail operational amplifier and a fifth-order active-RC Chebyshev complex band-pass filter. The experimental results show that the proposed method provides about four times speed enhancement and comparable results compared with traditional approaches employing the same parallel global optimization but sequential SPICE calls during LMS.

Journal ArticleDOI
TL;DR: In this paper, the authors propose a full-stack compiler deep neural network virtual machine (DNNVM), which is an integration of optimizers for graphs, loops and data layouts, an assembler, a runtime supporter, and a validation environment.
Abstract: The convolutional neural network (CNN) has become a state-of-the-art method for several artificial intelligence domains in recent years. The increasingly complex CNN models are both computation-bound and I/O-bound. Field-programmable gate array-based accelerators driven by custom instruction set architecture (ISA) achieve a balance between generality and efficiency, but there is much on them left to be optimized. We propose the full-stack compiler deep neural network virtual machine (DNNVM), which is an integration of optimizers for graphs, loops and data layouts, an assembler, a runtime supporter, and a validation environment. The DNNVM works in the context of deep learning frameworks and transforms CNN models into the directed acyclic graph: XGraph. Based on XGraph, we transform the optimization challenges for both data layout and pipeline into graph-level problems. DNNVM enumerates all potentially profitable fusion opportunities by a heuristic subgraph isomorphism algorithm to leverage pipeline and data layout optimizations, and searches for the best choice of execution strategies of the whole computing graph. On the Xilinx ZU2@330 MHz and ZU9@330 MHz, we achieve equivalently state-of-the-art performance on our benchmarks by naive implementations without optimizations, and the throughput is further improved up to $1.26\times $ by leveraging heterogeneous optimizations in DNNVM. Finally, with ZU9@330 MHz, we achieve state-of-the-art performance for VGG and ResNet50. We achieve a throughput of 2.82 TOPs/s and an energy efficiency of 123.7 GOPs/s/W for VGG. Additionally, we achieve 1.38 TOPs/s for ResNet50 and 1.41 TOPs/s for GoogleNet.

Journal ArticleDOI
TL;DR: QuantHD enables HD computing to work with a low-cost quantized model (binary or ternary model) while providing a similar accuracy as the floating point model, and proposes an FPGA implementation which accelerates HD computing in both training and inference phases.
Abstract: Brain-inspired hyperdimensional (HD) computing models cognition by exploiting properties of high dimensional statistics—high-dimensional vectors, instead of working with numeric values used in contemporary processors. A fundamental weakness of existing HD computing algorithms is that they require to use floating point models in order to provide acceptable accuracy on realistic classification problems. However, working with floating point values significantly increases the HD computation cost. To address this issue, we proposed QuantHD, a novel framework for quantization of HD computing model during training. QuantHD enables HD computing to work with a low-cost quantized model (binary or ternary model) while providing a similar accuracy as the floating point model. We accordingly propose an FPGA implementation which accelerates HD computing in both training and inference phases. We evaluate QuantHD accuracy and efficiency on various real-world applications, and observe that QuantHD can achieve on average 17.2% accuracy improvement as compared to the existing binarized HD computing algorithms which provide a similar computation cost. In terms of efficiency, QuantHD FPGA implementation can achieve on average $42.3\times $ and $4.7\times $ ( $34.1\times $ and $4.1\times $ ) energy efficiency improvement and speedup during inference (training) as compared to the state-of-the-art HD computing algorithms.

Journal ArticleDOI
TL;DR: In this paper, the authors proposed an efficient method to minimize the number of added gates by using simulated annealing an initial mapping which fits well with the input circuit and then, with the help of a heuristic cost function, stepwise apply the best-selected SWAP gates until all quantum gates in the circuit can be executed.
Abstract: Quantum algorithm design usually assumes access to a perfect quantum computer with ideal properties like full connectivity, noise-freedom, and arbitrarily long coherence time. In noisy intermediate-scale quantum (NISQ) devices, however, the number of qubits is highly limited and quantum operation error and qubit coherence are not negligible. Besides, the connectivity of physical qubits in a quantum processing unit (QPU) is also strictly constrained. Thereby, additional operations like SWAP gates have to be inserted to satisfy this constraint while preserving the functionality of the original circuit. This process is known as quantum circuit transformation. Adding additional gates will increase both the size and depth of a quantum circuit and, therefore, cause further decay of the performance of a quantum circuit. Thus, it is crucial to minimize the number of added gates. In this article, we propose an efficient method to solve this problem. We first choose by using simulated annealing an initial mapping which fits well with the input circuit and then, with the help of a heuristic cost function, stepwise apply the best-selected SWAP gates until all quantum gates in the circuit can be executed. Our algorithm runs in time polynomial in all parameters, including the size and the qubit number of the input circuit, and the qubit number in the QPU. Its space complexity is quadratic to the number of edges in the QPU. The experimental results on extensive realistic circuits confirm that the proposed method is efficient and the number of added gates of our algorithm is, on average, only 57% of that of state-of-the-art algorithms on IBM Q20 (Tokyo), the most recent IBM quantum device.

Journal ArticleDOI
TL;DR: An RRAM crossbar-based low bit-width CNN (LB-CNN) accelerator is proposed, including the matrix splitting strategies to enhance the scalability, and the pipelined implementation based on line buffers to accelerate the inference.
Abstract: The emerging resistive random-access memory (RRAM) has been widely applied in accelerating the computing of deep neural networks. However, it is challenging to achieve high-precision computations based on RRAM due to the limits of the resistance level and the interfaces. Low bit-width convolutional neural networks (CNNs) provide promising solutions to introduce low bit-width RRAM devices and low bit-width interfaces in RRAM-based computing system (RCS). While open questions still remain regarding: 1) how to make matrix splitting when a single crossbar is not large enough to hold all parameters of one weight matrix; 2) how to design a pipeline to accelerate the inference based on line buffer structure; and 3) how to reduce the accuracy drop due to the parameter splitting and data quantization. In this paper, we propose an RRAM crossbar-based low bit-width CNN (LB-CNN) accelerator. We make detailed discussion on the system design, including the matrix splitting strategies to enhance the scalability, and the pipelined implementation based on line buffers to accelerate the inference. In addition, we propose a splitting and quantizing while training method to incorporate the actual hardware constraints with the training. In our experiments, low bit-width LeNet-5 on RRAM show much better robustness than multibit models with device variation. The pipeline strategy achieves approximately $6.0\times $ speedup to process each image on ResNet-18. For low-bit VGG-8 on CIFAR-10, the proposed accelerator saves 54.9% of the energy consumption and 48.3% of the area compared with the multibit VGG-8 structure.

Journal ArticleDOI
TL;DR: In this article, the authors proposed FSpiNN, an optimization framework for obtaining memory-efficient and energy-efficient SNNs for training and inference processing, with unsupervised learning capability while maintaining accuracy.
Abstract: Spiking neural networks (SNNs) are gaining interest due to their event-driven processing which potentially consumes low-power/energy computations in hardware platforms while offering unsupervised learning capability due to the spike-timing-dependent plasticity (STDP) rule. However, state-of-the-art SNNs require a large memory footprint to achieve high accuracy, thereby making them difficult to be deployed on embedded systems, for instance, on battery-powered mobile devices and IoT Edge nodes. Toward this, we propose FSpiNN, an optimization framework for obtaining memory-efficient and energy-efficient SNNs for training and inference processing, with unsupervised learning capability while maintaining accuracy. It is achieved by: 1) reducing the computational requirements of neuronal and STDP operations; 2) improving the accuracy of STDP-based learning; 3) compressing the SNN through a fixed-point quantization; and 4) incorporating the memory and energy requirements in the optimization process. FSpiNN reduces the computational requirements by reducing the number of neuronal operations, the STDP-based synaptic weight updates, and the STDP complexity. To improve the accuracy of learning, FSpiNN employs timestep-based synaptic weight updates and adaptively determines the STDP potentiation factor and the effective inhibition strength. The experimental results show that as compared to the state-of-the-art work, FSpiNN achieves $7.5\times $ memory saving, and improves the energy efficiency by $3.5\times $ on average for training and by $1.8\times $ on average for inference, across MNIST and Fashion MNIST datasets, with no accuracy loss for a network with 4900 excitatory neurons, thereby enabling energy-efficient SNNs for edge devices/embedded systems.

Journal ArticleDOI
TL;DR: This article proposes a renewable-adaptive computation offloading approach for QoS optimization of real-time applications in fog computing systems equipped with reusable end devices and powered by hybrid energy of renewable generations and grid electricity.
Abstract: Fog computing is an emerging architectural paradigm for the implementation of the Internet of Things, where computation moves from cloud servers to network edges. Fog computing systems are with three characteristics: 1) low latency; 2) strong presence of real-time applications; and 3) reusability of end devices. Most existing designs of fog computing systems concentrate on reducing application processing latency, but neglect real-time requirements of applications and reusability of end devices, which may drastically degrade both functionality and quality-of-service (QoS) of applications. In this article, we investigate QoS optimization of real-time applications in fog computing systems equipped with reusable end devices and powered by hybrid energy of renewable generations and grid electricity. We propose a renewable-adaptive computation offloading approach. At the end device layer, local energy allocation schemes are designed at the application-level and component-level, where techniques of the cooperative game and mixed-integer linear programming (MILP) are leveraged, respectively. At the fog layer, the local energy allocation method is augmented to a local-remote scheduling solution by judiciously judging whether or not the computation offloading of an application needs to be triggered. The experimental results demonstrate that compared to benchmarking algorithms, our approach improves the overall and individual application QoS by up to 101.93% and 59.30%, respectively.

Journal ArticleDOI
TL;DR: A performance model is described to estimate the performance and resource utilization of an FPGA implementation and it is shown that the performance bottleneck and design bound can be identified and the optimal design option can be explored early in the design phase.
Abstract: The recently reported successes of convolutional neural networks (CNNs) in many areas have generated wide interest in the development of field-programmable gate array (FPGA)-based accelerators. To achieve high performance and energy efficiency, an FPGA-based accelerator must fully utilize the limited computation resources and minimize the data communication and memory access, both of which are impacted and constrained by a variety of design parameters, e.g., the degree and dimension of parallelism, the size of on-chip buffers, the bandwidth of the external memory, and many more. The large design space of the accelerator makes it impractical to search for the optimal design in the implementation phase. To address this problem, a performance model is described to estimate the performance and resource utilization of an FPGA implementation. By this means, the performance bottleneck and design bound can be identified and the optimal design option can be explored early in the design phase. The proposed performance model is validated using a variety of CNN algorithms comparing the results with on-board test results on two different FPGAs.

Journal ArticleDOI
TL;DR: This paper presents an register-transfer level (RTL)-level CNN compiler that automatically generates customized FPGA hardware for the inference tasks of various CNNs, in order to enable high-level fast prototyping of CNNs from software to FPGAs and still keep the benefits of low-level hardware optimization.
Abstract: A broad range of applications are increasingly benefiting from the rapid and flourishing development of convolutional neural networks (CNNs). The FPGA-based CNN inference accelerator is gaining popularity due to its high-performance and low-power as well as FPGA’s conventional advantage of reconfigurability and flexibility. Without a general compiler to automate the implementation, however, significant efforts and expertise are still required to customize the design for each CNN model. In this paper, we present an register-transfer level (RTL)-level CNN compiler that automatically generates customized FPGA hardware for the inference tasks of various CNNs, in order to enable high-level fast prototyping of CNNs from software to FPGA and still keep the benefits of low-level hardware optimization. First, a general-purpose library of RTL modules is developed to model different operations at each layer. The integration and dataflow of physical modules are predefined in the top-level system template and reconfigured during compilation for a given CNN algorithm. The runtime control of layer-by-layer sequential computation is managed by the proposed execution schedule so that even highly irregular and complex network topology, e.g., GoogLeNet and ResNet, can be compiled. The proposed methodology is demonstrated with various CNN algorithms, e.g., NiN, VGG, GoogLeNet, and ResNet, on two standalone Intel FPGAs, Arria 10, and Stratix 10, achieving end-to-end inference throughputs of 969 GOPS and 1604 GOPS, respectively, with batch size of one.

Journal ArticleDOI
TL;DR: A generative adversarial network (GAN) model is developed that can create quasi-optimal masks for given target circuit patterns and fewer normal OPC steps are required to generate high quality masks at convergence.
Abstract: Mask optimization has been a critical problem in the VLSI design flow due to the mismatch between the lithography system and the continuously shrinking feature sizes. Optical proximity correction (OPC) is one of the prevailing resolution enhancement techniques (RETs) that can significantly improve mask printability. However, in advanced technology nodes, the mask optimization process consumes more and more computational resources. In this article, we develop a generative adversarial network (GAN) model to achieve better mask optimization performance. We first develop an OPC-oriented GAN flow that can learn target-mask mapping from the improved architecture and objectives, which leads to satisfactory mask optimization results. To facilitate the training process and ensure better convergence, we propose a pretraining scheme that jointly trains the neural network with inverse lithography technique (ILT). We also propose an enhanced generator design with a U-Net architecture and a subpixel super-resolution structure that promise a better convergence and a better mask quality, respectively. At convergence, the generative network is able to create quasi-optimal masks for given target circuit patterns and fewer normal OPC steps are required to generate high quality masks. The experimental results show that our flow can facilitate the mask optimization process as well as ensure a better printability.

Journal ArticleDOI
TL;DR: This work is the first to investigate the thermal challenges that NPUs bring, revealing how MAC arrays, which form the heart of any NPU, impose serious thermal bottlenecks to on-chip systems due to their excessive power densities.
Abstract: Neural processing units (NPUs) are becoming an integral part in all modern computing systems due to their substantial role in accelerating neural networks (NNs). The significant improvements in cost-energy-performance stem from the massive array of multiply accumulate (MAC) units that remarkably boosts the throughput of NN inference. In this work, we are the first to investigate the thermal challenges that NPUs bring, revealing how MAC arrays, which form the heart of any NPU, impose serious thermal bottlenecks to on-chip systems due to their excessive power densities. For the first time, we explore: 1) the effectiveness of precision scaling and frequency scaling (FS) in temperature reductions and 2) how advanced on-chip cooling using superlattice thin-film thermoelectric (TE) open doors for new tradeoffs between temperature, throughput, cooling cost, and inference accuracy in NPU chips. Our work unveils that hybrid thermal management , which composes different means to reduce the NPU temperature, is a key. To achieve that, we propose and implement PFS-TE technique that couples precision and FS together with superlattice TE cooling for effective NPU thermal management. Using commercial signoff tools, we obtain accurate power and timing analysis of MAC arrays after a full-chip design is performed based on 14-nm Intel FinFET technology. Then, multiphysics simulations using finite-element methods are carried out for accurate heat simulations in the presence and absence of on-chip cooling. Afterward, comprehensive design-space exploration is presented to demonstrate the Pareto frontier and the existing tradeoffs between temperature reductions, power overheads due to cooling, throughput, and inference accuracy. Using a wide range of NNs trained for image classification, experimental results demonstrate that our novel NPU thermal management increases the inference efficiency (TOPS/Joule) by $1.33\times $ , $1.87\times $ , and $2\times $ under different temperature constraints; 105 °C, 85 °C, and 70 °C, respectively, while the average accuracy drops merely from 89.0% to 85.5%.

Journal ArticleDOI
TL;DR: A real-time and lightweight DDoS attack detection technique for NoC-based SoCs by monitoring packets to detect any violations and capable of localizing the malicious IPs using the latency data in the NoC routers.
Abstract: Network-on-chip (NoC) is widely employed by multicore system-on-chip (SoC) architectures to cater to their communication requirements. Increasing NoC complexity coupled with its widespread usage has made it a focal point of potential security attacks. Distributed denial-of-service (DDoS) is one such attack that is caused by malicious intellectual property (IP) cores flooding the network with unnecessary packets causing significant performance degradation through NoC congestion. In this article, we propose an efficient framework for real-time detection and localization of DDoS attacks. This article makes three important contributions. We propose a real-time and lightweight DDoS attack detection technique for NoC-based SoCs by monitoring packets to detect any violations. Once a potential attack has been flagged, our approach is also capable of localizing the malicious IPs using the latency data in the NoC routers. The applications are statically profiled during design time to determine communication patterns. These patterns are then used for real-time detection and localization of DDoS attacks. We have evaluated the effectiveness of our approach against different NoC topologies and architecture models using both real benchmarks and synthetic traffic patterns. Our experimental results demonstrate that our proposed approach is capable of real-time detection and localization of DDoS attacks originating from multiple malicious IPs in NoC-based SoCs.

Journal ArticleDOI
TL;DR: This article presents a concurrent detailed placement framework, ABCDPlace, exploiting multithreading and graphic processing unit (GPU) acceleration and proposes batch-based concurrent algorithms for widely adopted sequential detailed placement techniques, such as independent set matching, global swap, and local reordering.
Abstract: Placement is an important step in modern very-large-scale integrated (VLSI) designs. Detailed placement is a placement refining procedure intensively called throughout the design flow, thus its efficiency has a vital impact on design closure. However, since most detailed placement techniques are inherently greedy and sequential, they are generally difficult to parallelize. In this article, we present a concurrent detailed placement framework, ABCDPlace, exploiting multithreading and graphic processing unit (GPU) acceleration. We propose batch-based concurrent algorithms for widely adopted sequential detailed placement techniques, such as independent set matching, global swap, and local reordering. The experimental results demonstrate that ABCDPlace can achieve $2\times $ – $5\times $ faster runtime than sequential implementations with multithreaded CPU and over $10\times $ with GPU on ISPD 2005 contest benchmarks without quality degradation. On larger industrial benchmarks, we show more than $16\times $ speedup with GPU over the state-of-the-art sequential detailed placer. ABCDPlace finishes the detailed placement of a 10-million-cell industrial design in 1 min.

Journal ArticleDOI
TL;DR: The performance evaluation of the proposed meminductor is verified using post-layout simulation in the Cadence Virtuoso tool and experimentally using off the self component such as operational transconductance amplifier (OTA) for the implementation of VDTA with passive components.
Abstract: This research article comes with a grounded as well as floating meminductor using voltage difference transconductance amplifier (VDTA). The proposed meminductor models have only two active VDTA and grounded capacitors each. The performance evaluation of the proposed meminductor is verified using post-layout simulation in the Cadence Virtuoso tool and experimentally using off the self component such as operational transconductance amplifier (OTA) for the implementation of VDTA with passive components. Moreover, neuromorphic circuit implementation as an application of the proposed meminductor is well incorporated in this literature.

Journal ArticleDOI
TL;DR: This article proposes a framework that handles stuck-at-faults using matrix transformations, which is capable of recovering 99% of the accuracy loss on both the MNIST and CIFAR-10 datasets without utilizing hardware aware training.
Abstract: Matrix-vector multiplication is the dominating computational workload in the inference phase of deep neural networks (DNNs). Memristor crossbar arrays (MCAs) can efficiently perform matrix-vector multiplication in the analog domain. A key challenge is that memristor devices may suffer stuck-at-fault defects, which can severely degrade the classification accuracy. Earlier studies have shown that the accuracy loss can be recovered by utilizing additional hardware or hardware aware training. In this article, we propose a framework that handles stuck-at-faults using matrix transformations, which is called the MT framework. The framework is based on introducing a cost metric that captures the negative impact of the stuck-at-fault defects. Next, the cost metric is minimized by applying matrix transformations $T$ . A transformation $T$ changes a weight matrix $W$ into a new weight matrix $\widetilde {W}= T(W)$ . In particular, a row flipping transformation, a permutation transformation, and a value range transformation are proposed. The row flipping transformation results in that stuck-off (stuck-on) faults are translated into stuck-on (stuck-off) faults. The permutation transformation maps small (large) weights to memristors stuck-off (stuck-on). The value range transformation is based on reducing the magnitude of the smallest and largest elements in the weight matrices, which results in that the stuck-at-faults introduce smaller errors. The experimental results demonstrate that the MT framework is capable of recovering 99% of the accuracy loss on both the MNIST and CIFAR-10 datasets without utilizing hardware aware training. The accuracy improvements come at the expense of an $8.19\times $ and $9.23\times $ overhead in power and area, respectively. Nevertheless, the overhead can be reduced with up to 50% by leveraging hardware aware training.