scispace - formally typeset
Search or ask a question

Showing papers in "ACM Transactions on Design Automation of Electronic Systems in 2022"


Journal ArticleDOI
TL;DR: This article provides an overview of efficient deep learning methods, systems, and applications by introducing popular model compression methods, including pruning, factorization, quantization, as well as compact model design.
Abstract: Deep neural networks (DNNs) have achieved unprecedented success in the field of artificial intelligence (AI), including computer vision, natural language processing, and speech recognition. However, their superior performance comes at the considerable cost of computational complexity, which greatly hinders their applications in many resource-constrained devices, such as mobile phones and Internet of Things (IoT) devices. Therefore, methods and techniques that are able to lift the efficiency bottleneck while preserving the high accuracy of DNNs are in great demand to enable numerous edge AI applications. This article provides an overview of efficient deep learning methods, systems, and applications. We start from introducing popular model compression methods, including pruning, factorization, quantization, as well as compact model design. To reduce the large design cost of these manual solutions, we discuss the AutoML framework for each of them, such as neural architecture search (NAS) and automated pruning and quantization. We then cover efficient on-device training to enable user customization based on the local data on mobile devices. Apart from general acceleration techniques, we also showcase several task-specific accelerations for point cloud, video, and natural language processing by exploiting their spatial sparsity and temporal/token redundancy. Finally, to support all these algorithmic advancements, we introduce the efficient deep learning system design from both software and hardware perspectives.

42 citations


Journal ArticleDOI
TL;DR: An automated DSE framework—AutoDSE—that leverages a bottleneck-guided coordinate optimizer to systematically find a better design point and detects the bottleneck of the design in each step and focuses on high-impact parameters to overcome it is proposed.
Abstract: Adopting FPGA as an accelerator in datacenters is becoming mainstream for customized computing, but the fact that FPGAs are hard to program creates a steep learning curve for software programmers. Even with the help of high-level synthesis (HLS), accelerator designers still have to manually perform code reconstruction and cumbersome parameter tuning to achieve optimal performance. While many learning models have been leveraged by existing work to automate the design of efficient accelerators, the unpredictability of modern HLS tools becomes a major obstacle for them to maintain high accuracy. To address this problem, we propose an automated DSE framework—AutoDSE—that leverages a bottleneck-guided coordinate optimizer to systematically find a better design point. AutoDSE detects the bottleneck of the design in each step and focuses on high-impact parameters to overcome it. The experimental results show that AutoDSE is able to identify the design point that achieves, on the geometric mean, 19.9× speedup over one CPU core for MachSuite and Rodinia benchmarks. Compared to the manually optimized HLS vision kernels in Xilinx Vitis libraries, AutoDSE can reduce their optimization pragmas by 26.38× while achieving similar performance. With less than one optimization pragma per design on average, we are making progress towards democratizing customizable computing by enabling software programmers to design efficient FPGA accelerators.

11 citations


Journal ArticleDOI
TL;DR: EF-Train is designed, an efficient DNN training accelerator with a unified channel-level parallelism-based convolution kernel that can achieve end-to-end training on resource-limited low-power edge-level FPGAs and develops a data reshaping approach with intra-tile continuous memory allocation and weight reuse.
Abstract: Conventionally, DNN models are trained once in the cloud and deployed in edge devices such as cars, robots, or unmanned aerial vehicles (UAVs) for real-time inference. However, there are many cases that require the models to adapt to new environments, domains, or users. In order to realize such domain adaption or personalization, the models on devices need to be continuously trained on the device. In this work, we design EF-Train, an efficient DNN training accelerator with a unified channel-level parallelism-based convolution kernel that can achieve end-to-end training on resource-limited low-power edge-level FPGAs. It is challenging to implement on-device training on resource-limited FPGAs due to the low efficiency caused by different memory access patterns among forward and backward propagation and weight update. Therefore, we developed a data reshaping approach with intra-tile continuous memory allocation and weight reuse. An analytical model is established to automatically schedule computation and memory resources to achieve high energy efficiency on edge FPGAs. The experimental results show that our design achieves 46.99 GFLOPS and 6.09 GFLOPS/W in terms of throughput and energy efficiency, respectively.

8 citations


Journal ArticleDOI
TL;DR: This study presents attacks based on micro-architectural hardware vulnerabilities and the side effects they produce in the system and presents security mechanisms that can be implemented to address some of these attacks.
Abstract: With the advances in the field of the Internet of Things (IoT) and Industrial IoT (IIoT), these devices are increasingly used in daily life or industry. To reduce costs related to the time required...

7 citations


Journal ArticleDOI
TL;DR: This work proposes an automatic synthesis option tuner dedicated for FPGAs and uses a technique based on perceptual hashing that allows the proposed explorer to recognize similar program structures in the new description to be explored and match them with structures in the authors' database.
Abstract: The quest to democratize the use of Field-Programmable Gate Arrays (FPGAs) has given High-Level Synthesis (HLS) the final push to be widely accepted with FPGA vendors strongly supporting this VLSI design methodology to expand the FPGA user base. HLS takes as input an untimed behavioral description and generates efficient RTL (Verilog or VHDL). One major advantage of HLS is that it allows us to generate a variety of different micro-architectures from the same behavioral description by simply specifying different combination of synthesis options. In particular, commercial HLS tools make extensive use of synthesize directives in the form pragmas. This strength is also a weakness as it forces HLS users to fully understand how these synthesis options work and how they interact to efficiently set them to get a hardware implementation with the desired characteristics. Luckily, this process can be automated. Unfortunately, the search space grows supra-linearly with the number of synthesis options. To address this, this work proposes an automatic synthesis option tuner dedicated for FPGAs. We have explored a larger number of behavioral descriptions targeting ASICs and FPGAs and found out that due to the internal structure of the FPGA a large number of synthesis options combinations never lead to a Pareto-optimal design and, hence, the search space can be drastically reduced. Moreover, we make use of large database of DSE results that we have generated since we started working in this field to further accelerate the exploration process. For this, we use a technique based on perceptual hashing that allows our proposed explorer to recognize similar program structures in the new description to be explored and match them with structures in our database. This allows us to directly retrieve the pragma settings that lead to Pareto-optimal configurations. Experimental results show that the search space can be accelerated substantially while leading to finding most of the Pareto-optimal designs.

6 citations


Journal ArticleDOI
TL;DR: Sherlock is a DSE framework that can handle multiple conflicting optimization objectives and aggressively focuses on finding Pareto-optimal solutions and it is found that Sherlock converges toward the set of optimal designs faster than similar frameworks.
Abstract: Design space exploration (DSE) provides intelligent methods to tune the large number of optimization parameters present in modern FPGA high-level synthesis tools. High-level synthesis parameter tuning is a time-consuming process due to lengthy hardware compilation times—synthesizing an FPGA design can take tens of hours. DSE helps find an optimal solution faster than brute-force methods without relying on designer intuition to achieve high-quality results. Sherlock is a DSE framework that can handle multiple conflicting optimization objectives and aggressively focuses on finding Pareto-optimal solutions. Sherlock integrates a model selection process to choose the regression model that helps reach the optimal solution faster. Sherlock designs a strategy based around the multi-armed bandit problem, opting to balance exploration and exploitation based on the learned and expected results. Sherlock can decrease the importance of models that do not provide correct estimates, reaching the optimal design faster. Sherlock is capable of tailoring its choice of regression models to the problem at hand, leading to a model that best reflects the application design space. We have tested the framework on a large dataset of FPGA design problems and found that Sherlock converges toward the set of optimal designs faster than similar frameworks.

5 citations


Journal ArticleDOI
TL;DR: An external monitoring printed circuit board (PCB) named RASC is introduced to provide remote access to side-channels and its success in power analysis is comparable to an oscilloscope/probe setup, but is lightweight and two orders of magnitude cheaper.
Abstract: The Internet of Things (IoT) and smart devices are currently being deployed in systems such as autonomous vehicles and medical monitoring devices. The introduction of IoT devices into these systems enables network connectivity for data transfer, cloud support, and more, but can also lead to malware injection. Since many IoT devices operate in remote environments, it is also difficult to protect them from physical tampering. Conventional protection approaches rely on software. However, these can be circumvented by the moving target nature of malware or through hardware attacks. Alternatively, insertion of the internal monitoring circuits into IoT chips requires a design trade-off, balancing the requirements of the monitoring circuit and the main circuit. A very promising approach to detecting anomalous behavior in the IoT and other embedded systems is side-channel analysis. To date, however, this can be performed only before deployment due to the cost and size of side-channel setups (e.g., and oscilloscopes, probes) or by internal performance counters. Here, we introduce an external monitoring printed circuit board (PCB) named RASC to provide remote access to side-channels. RASC reduces the complete side-channel analysis system into two small PCBs (2 \( \times \) 2 cm), providing the ability to monitor power and electromagnetic (EM) traces of the target device. Additionally, RASC can transmit data and/or alerts of anomalous activities detected to a remote host through Bluetooth. To demonstrate RASCs capabilities, we extract keys from encryption modules such as AES implemented on Arduino and FPGA boards. To illustrate RASC’s defensive capabilities, we also use it to perform malware detection. RASC’s success in power analysis is comparable to an oscilloscope/probe setup but is lightweight and two orders of magnitude cheaper.

5 citations


Journal ArticleDOI
TL;DR: A comprehensive review of the existing works linking the EDA flow for chip design and Graph Neural Networks is presented, mapping those works to a design pipeline by defining graphs, tasks, and model types and summarizing challenges faced when applying GNNs within the Eda design flow.
Abstract: Driven by Moore’s law, the chip design complexity is steadily increasing. Electronic Design Automation (EDA) has been able to cope with the challenging very large-scale integration process, assuring scalability, reliability, and proper time-to-market. However, EDA approaches are time and resource demanding, and they often do not guarantee optimal solutions. To alleviate these, Machine Learning (ML) has been incorporated into many stages of the design flow, such as in placement and routing. Many solutions employ Euclidean data and ML techniques without considering that many EDA objects are represented naturally as graphs. The trending Graph Neural Networks (GNNs) are an opportunity to solve EDA problems directly using graph structures for circuits, intermediate Register Transfer Levels, and netlists. In this article, we present a comprehensive review of the existing works linking the EDA flow for chip design and GNNs. We map those works to a design pipeline by defining graphs, tasks, and model types. Furthermore, we analyze their practical implications and outcomes. We conclude by summarizing challenges faced when applying GNNs within the EDA design flow.

5 citations


Journal ArticleDOI
TL;DR: In this paper, a post-CMOS magneto-electric FET (MEFET) is proposed for high-speed and low-power design in both logic and memory applications.
Abstract: Magneto-Electric FET (MEFET) is a recently developed post-CMOS FET, which offers intriguing characteristics for high-speed and low-power design in both logic and memory applications. In this articl...

5 citations


Journal ArticleDOI
TL;DR: Secure authentication of any Internet-of-Things (IoT) device becomes the utmost necessity due to the lack of specifically designed IoT standards and intrinsic vulnerabilities with limited resources as mentioned in this paper.
Abstract: Secure authentication of any Internet-of-Things (IoT) device becomes the utmost necessity due to the lack of specifically designed IoT standards and intrinsic vulnerabilities with limited resources...

5 citations


Journal ArticleDOI
TL;DR: Data-Flow Integrity (DFI) is a well-known approach to effectively detecting a wide range of software attacks as discussed by the authors, however, its real-world application has been quite limited so far because of the prohib...
Abstract: Data-Flow Integrity (DFI) is a well-known approach to effectively detecting a wide range of software attacks. However, its real-world application has been quite limited so far because of the prohib...

Journal ArticleDOI
TL;DR: A Reinforce Learning based neural framework is proposed to learn the heuristics of the application mapping problem and results show that the communication cost and run-time of the RL-MAP improved considerably in comparison with the heuristic algorithms.
Abstract: Application mapping is one of the early stage design processes aimed to improve the performance of Network-on-Chip. Mapping is an NP-hard problem. A massive amount of high-quality supervised data is required to solve the application mapping problem using traditional neural networks. In this article, a reinforcement learning–based neural framework is proposed to learn the heuristics of the application mapping problem. The proposed reinforcement learning–based mapping algorithm (RL-MAP) has actor and critic networks. The actor is a policy network, which provides mapping sequences. The critic network estimates the communication cost of these mapping sequences. The actor network updates the policy distribution in the direction suggested by the critic. The proposed RL-MAP is trained with unsupervised data to predict the permutations of the cores to minimize the overall communication cost. Further, the solutions are improved using the 2-opt local search algorithm. The performance of RL-MAP is compared with a few well-known heuristic algorithms, the Neural Mapping Algorithm (NMA) and message-passing neural network-pointer network-based genetic algorithm (MPN-GA). Results show that the communication cost and runtime of the RL-MAP improved considerably in comparison with the heuristic algorithms. The communication cost of the solutions generated by RL-MAP is nearly equal to MPN-GA and improved by 4.2% over NMA, while consuming less runtime.

Journal ArticleDOI
TL;DR: Since the memristor emerged as a programmable analog storage device, it has stimulated research on the design of analog/mixed-signal circuits with the Memristor as the enabler of in-memory computations.
Abstract: Since the memristor emerged as a programmable analog storage device, it has stimulated research on the design of analog/mixed-signal circuits with the memristor as the enabler of in-memory computat...

Journal ArticleDOI
TL;DR: In this article , a hierarchical composition graph based on functional blocks is proposed to automatically synthesize the structure and initial sizing of an operational amplifier for single-output, fully differential, and complementary operational amplifiers.
Abstract: This article presents a method to automatically synthesize the structure and initial sizing of an operational amplifier. It is positioned between approaches with fixed design plans and a small search space of structures and approaches with generic structural production rules and a large search space with technically impractical structures. The presented approach develops a hierarchical composition graph based on functional blocks that spans a search space of thousands of technically meaningful structure variants for single-output, fully differential, and complementary operational amplifiers. The search algorithm is a combined heuristic and enumerative process. The evaluation is based on circuit sizing with a library of behavioral equations of functional blocks. Formalizing the knowledge of functional blocks in op-amps for structural synthesis and sizing inherently reduces the search space and lessens the number of created topologies not fulfilling the specifications. Experimental results for the three op-amp classes are presented. An outlook how this method can be extended to multi-stage op-amps is given.

Journal ArticleDOI
TL;DR: The energy consumption of manycore architectures is dominated by data movement, which calls for energy-efficient and high-bandwidth interconnects as discussed by the authors, which is a challenge of many architectures.
Abstract: The energy consumption of manycore architectures is dominated by data movement, which calls for energy-efficient and high-bandwidth interconnects. To overcome the bandwidth limitation of electrical...

Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed a non-linear Gaussian process (GP) model to model the relationships among the different FPGA design stages, and for the first time, the GP model is enhanced as correlated GP (CGP) by considering the correlations between the multiple design objectives, to find better Pareto designs.
Abstract: High-level synthesis (HLS) tools have gained great attention in recent years because it emancipates engineers from the complicated and heavy hardware description language writing and facilitates the implementations of modern applications (e.g., deep learning models) on Field-programmable Gate Array (FPGA) , by using high-level languages and HLS directives. However, finding good HLS directives is challenging, due to the time-consuming design processes, the balances among different design objectives, and the diverse fidelities (accuracies of data) of the performance values between the consecutive FPGA design stages. To find good HLS directives, a novel automatic optimization algorithm is proposed to explore the Pareto designs of the multiple objectives while making full use of the data with different fidelities from different FPGA design stages. Firstly, a non-linear Gaussian process (GP) is proposed to model the relationships among the different FPGA design stages. Secondly, for the first time, the GP model is enhanced as correlated GP (CGP) by considering the correlations between the multiple design objectives, to find better Pareto designs. Furthermore, we extend our model to be a deep version deep CGP (DCGP) by using the deep neural network to improve the kernel functions in Gaussian process models, to improve the characterization capability of the models, and learn better feature representations. We test our design method on some public benchmarks (including general matrix multiplication and sparse matrix-vector multiplication) and deep learning-based object detection model iSmart2 on FPGA. Experimental results show that our methods outperform the baselines significantly and facilitate the deep learning designs on FPGA.

Journal ArticleDOI
TL;DR: The results indicate that the first complete High-Level Synthesis (HLS) implementation for HEVC intra encoder on FPGA provides previously unseen design scalability with competitive performance over the existing FPGa and ASIC encoder implementations.
Abstract: High Efficiency Video Coding (HEVC) is the key enabling technology for numerous modern media applications. Overcoming its computational complexity and customizing its rich features for real-time HEVC encoder implementations, calls for automated design methodologies. This article introduces the first complete High-Level Synthesis (HLS) implementation for HEVC intra encoder on FPGA. The C source code of our open-source Kvazaar HEVC encoder is used as a design entry point for HLS that is applied throughout the whole encoder design process, from data-intensive coding tools like intra prediction and discrete transforms to more control-oriented tools such as context-adaptive binary arithmetic coding (CABAC). Our prototype is run on Nokia AirFrame Cloud Server equipped with 2.4 GHz dual 14-core Intel Xeon processors and two Intel Arria 10 PCIe FPGA accelerator cards with 40 Gigabit Ethernet. This proof-of-concept system is designed for hardware-accelerated HEVC encoding and it achieves real-time 4K coding speed up to 120 fps. The coding performance can be easily scaled up by adding practically any number of network-connected FPGA cards to the system. These results indicate that our HLS proposal not only boosts development time, but also provides previously unseen design scalability with competitive performance over the existing FPGA and ASIC encoder implementations.

Journal ArticleDOI
TL;DR: This work proposes a novel LUT architecture where the security provided by the proposed primitive is superior to that of the traditional LUT-based obfuscation, and leverages the security-driven design flow, which uses off-the-shelf industrial EDA tools to mitigate the design overheads further while being non-disruptive to the current industrial physical design flow.
Abstract: Logic locking and Integrated Circuit (IC) camouflaging are the most prevalent protection schemes that can thwart most hardware security threats. However, the state-of-the-art attacks, including Boolean Satisfiability (SAT) and approximation-based attacks, question the efficacy of the existing defense schemes. Recent obfuscation schemes have employed reconfigurable logic to secure designs against various hardware security threats. However, they have focused on specific design elements such as SAT hardness. Despite meeting the focused criterion such as security, obfuscation incurs additional overheads, which are not evaluated in the present works. This work provides an extensive analysis of Look-up-table (LUT)–based obfuscation by exploring several factors such as LUT technology, size, number of LUTs, and replacement strategy as they have a substantial influence on Power-Performance-Area (PPA) and Security (PPA/S) of the design. We show that using large LUT makes LUT-based obfuscation resilient to hardware security threats. However, it also results in enormous design overheads beyond practical limits. To make the reconfigurable logic obfuscation efficient in terms of design overheads, this work proposes a novel LUT architecture where the security provided by the proposed primitive is superior to that of the traditional LUT-based obfuscation. Moreover, we leverage the security-driven design flow, which uses off-the-shelf industrial EDA tools to mitigate the design overheads further while being non-disruptive to the current industrial physical design flow. We empirically evaluate the security of the LUTs against state-of-the-art obfuscation techniques in terms of design overheads and SAT-attack resiliency. Our findings show that the proposed primitive significantly reduces both area and power by a factor of 8 \( \times \) and 2 \( \times \) , respectively, without compromising security.

Journal ArticleDOI
TL;DR: A state flip-flop set definition for a human-readable state machine is presented, a systematic categorization of state flips is derived, based on properties of single state Flip-flops and their sets are derived, and four post-processing methods are developed.
Abstract: The target of sequential reverse engineering is to extract the state machine of a design. Sequential reverse engineering of a gate-level netlist consists of the identification of so-called state flip-flops (sFFs), as well as the extraction of the state machine. The second step can be solved with an exact approach if the correct sFFs and the correct reset state are provided. For the first step, several more or less heuristic approaches exist. This work investigates sequential reverse engineering with the objective of a human-readable state machine extraction. A human-readable state machine reflects the original state machine and is not overloaded by additional design information. For this purpose, the work derives a systematic categorization of sFF sets, based on properties of single sFFs and their sets. These properties are determined by analyzing the degrees of freedom in describing state machines as the well-known Moore and Mealy machines. Based on the systematic categorization, this work presents an sFF set definition for a human-readable state machine, categorizes existing sFF identification strategies, and develops four post-processing methods. The results show that post-processing predominantly improves the outcome of several existing sFF identification algorithms.

Journal ArticleDOI
TL;DR: Approximate computing has emerged as an alternative way to further reduce the power consumption of integrated circuits (ICs) by trading off errors at the output with simpler, more efficient logic as discussed by the authors.
Abstract: Approximate Computing has emerged as an alternative way to further reduce the power consumption of integrated circuits (ICs) by trading off errors at the output with simpler, more efficient logic. ...

Journal ArticleDOI
TL;DR: Unmanned Aerial Vehicles have rapidly become popular for monitoring, delivery, and actuation in many application domains such as environmental management, disaster mitigation, homeland security, and more.
Abstract: Unmanned Aerial Vehicles (UAVs) have rapidly become popular for monitoring, delivery, and actuation in many application domains such as environmental management, disaster mitigation, homeland secur...

Journal ArticleDOI
TL;DR: Experimental results on ICCAD 2012 and ICCAD 2019 Contest benchmarks show that the architectures designed by the proposed scheme achieve higher performance on hotspot detection task compared with state-of-the-art manually designed neural networks.
Abstract: Layout hotspot detection is of great importance in the physical verification flow. Deep neural network models have been applied to hotspot detection and achieved great success. Despite their success, high-performance neural networks are still quite difficult to design. In this article, we propose a bayesian optimization-based neural architecture search scheme to automatically do this time-consuming and fiddly job. Experimental results on ICCAD 2012 and ICCAD 2019 Contest benchmarks show that the architectures designed by our proposed scheme achieve higher performance on hotspot detection task compared with state-of-the-art manually designed neural networks.

Journal ArticleDOI
TL;DR: This article presents a systematic review of DMFB security, focusing on both the state-of-the-art attack and defense techniques, and classify these techniques into three categories according to their respective defense purposes, including confidentiality protection, integrity protection, and availability protection.
Abstract: As an emerging lab-on-a-chip technology platform, digital microfluidic biochips (DMFBs) have been widely used for executing various laboratory procedures in biochemistry and biomedicine such as gene sequencing and near-patient diagnosis, with the advantages of low reagent consumption, high precision, and miniaturization and integration. With the ongoing rapid deployment of DMFBs, however, these devices are now facing serious and complicated security challenges that not only damage their functional integrity but also affect their system reliability. In this article, we present a systematic review of DMFB security, focusing on both the state-of-the-art attack and defense techniques. First, the overall security situation, the working principle, and the corresponding fabrication technology of DMFBs are introduced. Afterwards, existing attack approaches are divided into several categories and discussed in detail, including denial of service, intellectual property piracy, bioassay tampering, layout modification, actuation sequence tampering, concentration altering, parameter modification, reading forgery, and information leakage. To prevent biochips from being damaged by these attack behaviors, a number of defense measures have been proposed in recent years. Accordingly, we further classify these techniques into three categories according to their respective defense purposes, including confidentiality protection, integrity protection, and availability protection. These measures, to varying degrees, can provide effective protection for DMFBs. Finally, key trends and directions for future research that are related to the security of DMFBs are discussed from several aspects, e.g., manufacturing materials, biochip structure, and usage environment, thus providing new ideas for future biochip protection.

Journal ArticleDOI
TL;DR: This paper proposes phase-aware CPU workload forecasting as a novel approach that applies long-term phase prediction to improve the accuracy of short-term workload forecasting and proposes different multi-core settings differentiated by the number of cores they access and whether they produce specialized or global outputs per core.
Abstract: Predicting workload behavior during workload execution is essential for dynamic resource optimization in multi-processor systems. Recent studies have proposed advanced machine learning techniques for dynamic workload prediction. Workload prediction can be cast as a time series forecasting problem. However, traditional forecasting models struggle to predict abrupt workload changes. These changes occur because workloads are known to go through phases. Prior work has investigated machine-learning-based approaches for phase detection and prediction, but such approaches have not been studied in the context of dynamic workload forecasting. In this article, we propose phase-aware CPU workload forecasting as a novel approach that applies long-term phase prediction to improve the accuracy of short-term workload forecasting. Phase-aware forecasting requires machine learning models for phase classification, phase prediction, and phase-based forecasting that have not been explored in this combination before. Furthermore, existing prediction approaches have only been studied in single-core settings. This work explores phase-aware workload forecasting with multi-threaded workloads running on multi-core systems. We propose different multi-core settings differentiated by the number of cores they access and whether they produce specialized or global outputs per core. We study various advanced machine learning models for phase classification, phase prediction, and phase-based forecasting in isolation and different combinations for each setting. We apply our approach to forecasting of multi-threaded Parsec and SPEC workloads running on an eight-core Intel Core-i9 platform. Our results show that combining GMM clustering with LSTMs for phase prediction and phase-based forecasting yields the best phase-aware forecasting results. An approach that uses specialized models per core achieves an average error of 23% with up to 22% improvement in prediction accuracy compared to a phase-unaware setup.

Journal ArticleDOI
TL;DR: A scheduling scheme by resorting to the degraded read mode and the read-modify-write mode, to reduce the long-tail latency of I/O requests in RAID-enabled SSDs and builds a queuing overhead assessment model on the top of factors of data redundancy and the current blocked I/o traffics on SSD channels.
Abstract: RAID-enabled SSDs commonly have unbalanced I/O workloads on their components (e.g., SSD channels), as the data/parity chunks in the same stripe may have varied access frequency, which greatly impacts I/O responsiveness. This article proposes a I/O scheduling scheme by resorting to the degraded read mode and the read-modify-write mode to reduce the long-tail latency of I/O requests in RAID-enabled SSDs. The basic idea is to avoid scheduling read or update requests to the heavily congested but targeted RAID components. Such requests are satisfied by accessing other relevant RAID components by certain XOR computations (we call the degraded modes). Specially, we build a queuing overhead assessment model on the top of factors of data redundancy and the current blocked I/O traffics on SSD channels to precisely dispatch incoming I/O requests to be fulfilled with the degraded mode or not. The trace-driven experiments illustrate that the proposed scheme can reduce the long-tail latency of read requests by 23.1% on average at the 99.99th percentile, in contrast to state-of-the-art scheduling methods.

Journal ArticleDOI
TL;DR: In this article , a decision diagram style data structure, called Tensor Decision Diagram (TDD), is proposed for more principled and convenient applications of tensor networks for quantum circuits.
Abstract: Tensor networks have been successfully applied in simulation of quantum physical systems for decades. Recently, they have also been employed in classical simulation of quantum computing, in particular, random quantum circuits. This article proposes a decision diagram style data structure, called Tensor Decision Diagram (TDD), for more principled and convenient applications of tensor networks. This new data structure provides a compact and canonical representation for quantum circuits. By exploiting circuit partition, the TDD of a quantum circuit can be computed efficiently. Furthermore, we show that the operations of tensor networks essential in their applications (e.g., addition and contraction) can also be implemented efficiently in TDDs. A proof-of-concept implementation of TDDs is presented and its efficiency is evaluated on a set of benchmark quantum circuits. It is expected that TDDs will play an important role in various design automation tasks related to quantum circuits, including but not limited to equivalence checking, error detection, synthesis, simulation, and verification.

Journal ArticleDOI
TL;DR: CeMux as mentioned in this paper is a correlation-enhanced multiplexer that is significantly more accurate and uses much less area than traditional weighted adders, which can achieve a 50% to 73% area reduction for similar power and latency as the conventional design but at a slightly higher level of error.
Abstract: Stochastic computing (SC) is a low-cost computational paradigm that has promising applications in digital filter design, image processing, and neural networks. Fundamental to these applications is the weighted addition operation, which is most often implemented by a multiplexer (mux) tree. Mux-based adders have very low area but typically require long bitstreams to reach practical accuracy thresholds when the number of summands is large. In this work, we first identify the main contributors to mux adder error. We then demonstrate with analysis and experiment that two new techniques, precise sampling and full correlation, can target and mitigate these error sources. Implementing these techniques in hardware leads to the design of CeMux (Correlation-enhanced Multiplexer), a stochastic mux adder that is significantly more accurate and uses much less area than traditional weighted adders. We compare CeMux to other SC and hybrid designs for an electrocardiogram filtering case study that employs a large digital filter. One major result is that CeMux is shown to be accurate even for large input sizes. CeMux's higher accuracy leads to a latency reduction of 4× to 16× over other designs. Furthermore, CeMux uses about 35% less area than existing designs, and we demonstrate that a small amount of accuracy can be traded for a further 50% reduction in area. Finally, we compare CeMux to a conventional binary design and we show that CeMux can achieve a 50% to 73% area reduction for similar power and latency as the conventional design but at a slightly higher level of error.

Journal ArticleDOI
TL;DR: In this paper , a fine-grained structured pruning scheme and corresponding compiler optimizations are proposed to achieve high accuracy and hardware inference performance for real-time deep neural network (DNN) inference on mobile devices.
Abstract: Weight pruning is an effective model compression technique to tackle the challenges of achieving real-time deep neural network (DNN) inference on mobile devices. However, prior pruning schemes have limited application scenarios due to accuracy degradation, difficulty in leveraging hardware acceleration, and/or restriction on certain types of DNN layers. In this article, we propose a general, fine-grained structured pruning scheme and corresponding compiler optimizations that are applicable to any type of DNN layer while achieving high accuracy and hardware inference performance. With the flexibility of applying different pruning schemes to different layers enabled by our compiler optimizations, we further probe into the new problem of determining the best-suited pruning scheme considering the different acceleration and accuracy performance of various pruning schemes. Two pruning scheme mapping methods—one -search based and the other is rule based—are proposed to automatically derive the best-suited pruning regularity and block size for each layer of any given DNN. Experimental results demonstrate that our pruning scheme mapping methods, together with the general fine-grained structured pruning scheme, outperform the state-of-the-art DNN optimization framework with up to 2.48 \( \times \) and 1.73 \( \times \) DNN inference acceleration on CIFAR-10 and ImageNet datasets without accuracy loss.

Journal ArticleDOI
TL;DR: This paper significantly extend and combine two existing CNN memory reuse methodologies to offer efficient memory reuse for a wide range of CNN-based applications.
Abstract: Many modern applications require execution of Convolutional Neural Networks (CNNs) on edge devices, such as mobile phones or embedded platforms. This can be challenging, as the state-of-the art CNNs are memory costly, whereas the memory budget of edge devices is highly limited. To address this challenge, a variety of CNN memory reduction methodologies have been proposed. Typically, the memory of a CNN is reduced using methodologies such as pruning and quantization. These methodologies reduce the number or precision of CNN parameters, thereby reducing the CNN memory cost. When more aggressive CNN memory reduction is required, the pruning and quantization methodologies can be combined with CNN memory reuse methodologies. The latter methodologies reuse device memory allocated for storage of CNN intermediate computational results, thereby further reducing the CNN memory cost. However, the existing memory reuse methodologies are unfit for CNN-based applications that exploit pipeline parallelism available within the CNNs or use multiple CNNs to perform their functionality. In this article, we therefore propose a novel CNN memory reuse methodology. In our methodology, we significantly extend and combine two existing CNN memory reuse methodologies to offer efficient memory reuse for a wide range of CNN-based applications.

Journal ArticleDOI
TL;DR: A CNN acceleration architecture called MVP is proposed, which efficiently processes both compute- and memory-intensive operations with a small area overhead on top of the baseline systolic-array-based architecture, and suggests a specialized vector unit tailored for processing DW-CONV to meet the high memory bandwidth requirement.
Abstract: Mobile and edge devices become common platforms for inferring convolutional neural networks (CNNs) due to superior privacy and service quality. To reduce the computational costs of convolution (CONV), recent CNN models adopt depth-wise CONV (DW-CONV) and Squeeze-and-Excitation (SE). However, existing area-efficient CNN accelerators are sub-optimal for these latest CNN models because they were mainly optimized for compute-intensive standard CONV layers with abundant data reuse that can be pipelined with activation and normalization operations. In contrast, DW-CONV and SE are memory-intensive with limited data reuse. The latter also strongly depends on the nearby CONV layers, making an effective pipelining a daunting task. Therefore, DW-CONV and SE only occupy 10% of entire operations but become memory bandwidth bound, spending more than 60% of the processing time in systolic-array-based accelerators. We propose a CNN acceleration architecture called MVP, which efficiently processes both compute- and memory-intensive operations with a small area overhead on top of the baseline systolic-array-based architecture. We suggest a specialized vector unit tailored for processing DW-CONV, including multipliers, adder trees, and multi-banked buffers to meet the high memory bandwidth requirement. We augment the unified buffer with tiny processing elements to smoothly pipeline SE with the subsequent CONV, enabling concurrent processing of DW-CONV with standard CONV, thereby achieving the maximum utilization of arithmetic units. Our evaluation shows that MVP improves performance by 2.6 \( \times \) and reduces energy by 47% on average for EfficientNet-B0/B4/B7, MnasNet, and MobileNet-V1/V2 with only a 9% area overhead compared to the baseline.