scispace - formally typeset
Search or ask a question

Showing papers in "ACM Transactions on Design Automation of Electronic Systems in 2019"


Journal ArticleDOI
TL;DR: This article provides a comprehensive review of the state of the art with respect to locking/camouflaging techniques by defining a systematic threat model for these techniques and discussing how various real-world scenarios relate to each threat model.
Abstract: The globalization of the semiconductor supply chain introduces ever-increasing security and privacy risks. Two major concerns are IP theft through reverse engineering and malicious modification of the design. The latter concern in part relies on successful reverse engineering of the design as well. IC camouflaging and logic locking are two of the techniques under research that can thwart reverse engineering by end-users or foundries. However, developing low overhead locking/camouflaging schemes that can resist the ever-evolving state-of-the-art attacks has been a challenge for several years. This article provides a comprehensive review of the state of the art with respect to locking/camouflaging techniques. We start by defining a systematic threat model for these techniques and discuss how various real-world scenarios relate to each threat model. We then discuss the evolution of generic algorithmic attacks under each threat model eventually leading to the strongest existing attacks. The article then systematizes defences and along the way discusses attacks that are more specific to certain kinds of locking/camouflaging. The article then concludes by discussing open problems and future directions.

67 citations


Journal ArticleDOI
TL;DR: A blockchain-based certificate authority framework that can be used to manage critical chip information such as electronic chip identification, chip grade, and transaction time is proposed that can mitigate most threats of the electronics supply chain, such as recycling, remarking, cloning, and overproduction.
Abstract: Electronic systems are ubiquitous today, playing an irreplaceable role in our personal lives, as well as in critical infrastructures such as power grids, satellite communications, and public transportation. In the past few decades, the security of software running on these systems has received significant attention. However, hardware has been assumed to be trustworthy and reliable “by default” without really analyzing the vulnerabilities in the electronics supply chain. With the rapid globalization of the semiconductor industry, it has become challenging to ensure the integrity and security of hardware. In this article, we discuss the integrity concerns associated with a globalized electronics supply chain. More specifically, we divide the supply chain into six distinct entities: IP owner/foundry (OCM), distributor, assembler, integrator, end user, and electronics recycler, and analyze the vulnerabilities and threats associated with each stage. To address the concerns of the supply chain integrity, we propose a blockchain-based certificate authority framework that can be used to manage critical chip information such as electronic chip identification, chip grade, and transaction time. The decentralized nature of the proposed framework can mitigate most threats of the electronics supply chain, such as recycling, remarking, cloning, and overproduction.

51 citations


Journal ArticleDOI
TL;DR: The proposed solution automates hardware and software protocols using blockchain-powered Smart Contract that allows supply chain participants to authenticate, track, trace, analyze, and provision chips throughout their entire life cycle.
Abstract: Globalization of IC supply chain has increased the risk of counterfeit, tampered, and re-packaged chips in the market. Counterfeit electronics poses a security risk in safety critical applications like avionics, SCADA systems, and defense. It also affects the reputation of legitimate suppliers and causes financial losses. Hence, it becomes necessary to develop traceability solutions to ensure the integrity of supply chain, from the time of fabrication to the end of product-life, which allows a customer to verify the provenance of a device or a system. In this article, we present an IC traceability solution based on blockchain. A blockchain is a public immutable database that maintains a continuously growing list of data records secured from tampering and revision. Over the lifetime of an IC, all ownership transfer information is recorded and archived in a blockchain. This safe, verifiable method prevents any party from altering or challenging the legitimacy of the information being exchanged. However, a chain of sales record is not enough to ensure provenance of an IC. There is a need for clone-proof method for securely binding the identity of an IC to the blockchain information. In this article, we propose a method of IC supply chain traceability via blockchain pegged to embedded physically unclonable function (PUF). The blockchain provides ownership transfer record, while the PUF provides unique identification for an IC allowing it to be linked uniquely to a blockchain. Our proposed solution automates hardware and software protocols using blockchain-powered Smart Contract that allows supply chain participants to authenticate, track, trace, analyze, and provision chips throughout their entire life cycle.

40 citations


Journal ArticleDOI
TL;DR: It is shown that all electronics CAD tools—high-level synthesis, logic synthesis, physical design, verification, test, and post-silicon validation—are potential threat vectors to different degrees.
Abstract: Fabless semiconductor companies design system-on-chips (SoC) by using third-party intellectual property (IP) cores and fabricate them in offshore, potentially untrustworthy foundries. Owing to the globally distributed electronics supply chain, security has emerged as a serious concern. In this article, we explore electronics computer-aided design (CAD) software as a threat vector that can be exploited to introduce vulnerabilities into the SoC. We show that all electronics CAD tools—high-level synthesis, logic synthesis, physical design, verification, test, and post-silicon validation—are potential threat vectors to different degrees. We have demonstrated CAD-based attacks on several benchmarks, including the commercial ARM Cortex M0 processor [1].

29 citations


Journal ArticleDOI
TL;DR: This article proposes an efficient cache reconfiguration framework in NoC-based many-core architectures that considers all significant components, including NoC, caches, and main memory and proposes a machine learning--based framework that can reduce the exploration time by an order of magnitude with negligible loss in accuracy.
Abstract: Dynamic cache reconfiguration (DCR) is an effective technique to optimize energy consumption in many-core architectures. While early work on DCR has shown promising energy saving opportunities, prior techniques are not suitable for many-core architectures since they do not consider the interactions and tight coupling between memory, caches, and network-on-chip (NoC) traffic. In this article, we propose an efficient cache reconfiguration framework in NoC-based many-core architectures. The proposed work makes three major contributions. First, we model a distributed directory based many-core architecture similar to Intel Xeon Phi architecture. Next, we propose an efficient cache reconfiguration framework that considers all significant components, including NoC, caches, and main memory. Finally, we propose a machine learning--based framework that can reduce the exploration time by an order of magnitude with negligible loss in accuracy. Our experimental results demonstrate 18.5% energy savings on average compared to base cache configuration.

20 citations


Journal ArticleDOI
TL;DR: This review summarizes practical solutions that can mitigate the impact of nonideal device and circuit properties of resistive memory by static parameter and dynamic runtime co-optimization and portrays an unified reconfigurable computational memory architecture.
Abstract: Emerging computational resistive memory is promising to overcome the challenges of scalability and energy efficiency that DRAM faces and also break through the memory wall bottleneck. However, cell-level and array-level nonideal properties of resistive memory significantly degrade the reliability, performance, accuracy, and energy efficiency during memory access and analog computation. Cell-level nonidealities include nonlinearity, asymmetry, and variability. Array-level nonidealities include interconnect resistance, parasitic capacitance, and sneak current. This review summarizes practical solutions that can mitigate the impact of nonideal device and circuit properties of resistive memory. First, we introduce several typical resistive memory devices with focus on their switching modes and characteristics. Second, we review resistive memory cells and memory array structures, including 1T1R, 1R, 1S1R, 1TnR, and CMOL. We also overview three-dimensional (3D) cross-point arrays and their structural properties. Third, we analyze the impact of nonideal device and circuit properties during memory access and analog arithmetic operations with focus on dot-product and matrix-vector multiplication. Fourth, we discuss the methods that can mitigate these nonideal properties by static parameter and dynamic runtime co-optimization from the viewpoint of device and circuit interaction. Here, dynamic runtime operation schemes include line connection, voltage bias, logical-to-physical mapping, read reference setting, and switching mode reconfiguration. Then, we highlight challenges on multilevel cell cross-point arrays and 3D cross-point arrays during these operations. Finally, we investigate design considerations of memory array peripheral circuits. We also portray an unified reconfigurable computational memory architecture.

19 citations


Journal ArticleDOI
TL;DR: This is the first effort to compare the existing hardware topologies in terms of flexibility and functionality and highlights key challenges and open problems in this domain.
Abstract: In a reconfigurable battery pack, the connections among cells can be changed during operation to form different configurations. This can lead a battery, a passive two-terminal device, to a smart battery that can reconfigure itself according to the requirement to enhance operational performance. Several hardware architectures with different levels of complexities have been proposed. Some researchers have used existing hardware and demonstrated improved performance on the basis of novel optimization and scheduling algorithms. The possibility of software techniques to benefit the energy storage systems is exciting, and it is the perfect time for such methods as the need for high-performance and long-lasting batteries is on the rise. This novel field requires new understanding, principles, and evaluation metrics of proposed schemes. In this article, we systematically discuss and critically review the state of the art. This is the first effort to compare the existing hardware topologies in terms of flexibility and functionality. We provide a comprehensive review that encompasses all existing research works, starting from the details of the individual battery including modeling and properties as well as fixed-topology traditional battery packs. To stimulate further research in this area, we highlight key challenges and open problems in this domain.

19 citations


Journal ArticleDOI
TL;DR: The experimental results demonstrate that the proposed adaptive DSE strategies clearly outperform a state-of-the-art DSE approach known from literature in terms of the quality of the gained implementations as well as exploration times.
Abstract: State-of-the-art system synthesis techniques employ meta-heuristic optimization techniques for Design Space Exploration (DSE) to tailor application execution, e.g., defined by a dataflow graph, for a given target platform. Unfortunately, the performance evaluation of each implementation candidate is computationally very expensive, in particular on recent multi-core platforms, as this involves compilation to and extensive evaluation on the target hardware. Applying heuristics for performance evaluation on the one hand allows for a reduction of the exploration time but on the other hand may deteriorate the convergence of the optimization technique toward performance-optimal solutions with respect to the target platform. To address this problem, we propose DSE strategies that are able to dynamically trade off between (i) approximating heuristics to guide the exploration and (ii) accurate performance evaluation, i.e., compilation of the application and subsequent performance measurement on the target platform. Technically, this is achieved by introducing a set of additional, but easily computable guiding objective functions, and varying the set of objective functions that are evaluated during the DSE adaptively. One major advantage of these guiding objectives is that they are generically applicable for dataflow models without having to apply any configuration techniques to tailor their parameters to the specific use case. We show this for synthetic benchmarks as well as a real-world control application. Moreover, the experimental results demonstrate that our proposed adaptive DSE strategies clearly outperform a state-of-the-art DSE approach known from literature in terms of the quality of the gained implementations as well as exploration times. Amongst others, we show a case for a two-core implementation where after about 3 hours of exploration time one of our proposed adaptive DSE strategies already obtains a 60% higher performance value than obtained by the state-of-the-art approach. Even when the state-of-the-art approach is given a total exploration time of more than 2 weeks to optimize this value, the proposed adaptive DSE strategy features a 20% higher performance value after a total exploration time of about 4 days.

17 citations


Journal ArticleDOI
TL;DR: This article presents a comprehensive survey of time-multiplexed (TM) FPGA overlays from the research literature, and focuses on CGRA-like overlays, with either an array of interconnected processor-based functional units or medium-grained arithmetic functional units.
Abstract: This article presents a comprehensive survey of time-multiplexed (TM) FPGA overlays from the research literature. These overlays are categorized based on their implementation into two groups: processor-based overlays, as their implementation follows that of conventional silicon-based microprocessors, and; CGRA-like overlays, with either an array of interconnected processor-based functional units or medium-grained arithmetic functional units. Time-multiplexing the overlay allows it to change its behavior with a cycle-by-cycle execution of the application kernel, thus allowing better sharing of the limited FPGA hardware resource. However, most TM overlays suffer from large resource overheads, due to either the underlying processor-like architecture (for processor-based overlays) or due to the routing array and instruction storage requirements (for CGRA-like overlays). Reducing the area overhead for CGRA-like overlays, specifically that required for the routing network, and better utilizing the hard macros in the target FPGA are active areas of research.

17 citations


Journal ArticleDOI
TL;DR: A security-aware methodology for routing and scheduling for control applications in Ethernet networks is proposed to maximize the resilience of control applications within these networked control systems to malicious interference while guaranteeing the stability of all control plants, despite the stringent resource constraints in such cyber-physical systems.
Abstract: Today, it is common knowledge in the cyber-physical systems domain that the tight interaction between the cyber and physical elements provides the possibility of substantially improving the performance of these systems that is otherwise impossible. On the downside, however, this tight interaction with cyber elements makes it easier for an adversary to compromise the safety of the system. This becomes particularly important, since such systems typically are composed of several critical physical components, e.g., adaptive cruise control or engine control that allow deep intervention in the driving of a vehicle. As a result, it is important to ensure not only the reliability of such systems, e.g., in terms of schedulability and stability of control plants, but also resilience to adversarial attacks.In this article, we propose a security-aware methodology for routing and scheduling for control applications in Ethernet networks. The goal is to maximize the resilience of control applications within these networked control systems to malicious interference while guaranteeing the stability of all control plants, despite the stringent resource constraints in such cyber-physical systems. Our experimental evaluations demonstrate that careful optimization of available resources can significantly improve the resilience of these networked control systems to attacks.

15 citations


Journal ArticleDOI
TL;DR: This article proposes a novel cache management policy that attempts to maximize write-coalescing in the on-chip SRAM last-level cache (LLC) for the sake of reducing the number of costly writes to the off-chip NVM and shows that this proposal reduces thenumber of writebacks by 21%, on average, over the state-of-the-art method.
Abstract: Non-Volatile Memory (NVM) technology is a promising solution to fulfill the ever-growing need for higher capacity in the main memory of modern systems. Despite having many great features, however, NVM’s poor write performance remains a severe obstacle, preventing it from being used as a DRAM alternative in the main memory. Most of the prior work targeted optimizing writes at the main memory side and neglected the decisive role of upper-level cache management policies on reducing the number of writes. In this article, we propose a novel cache management policy that attempts to maximize write-coalescing in the on-chip SRAM last-level cache (LLC) for the sake of reducing the number of costly writes to the off-chip NVM. We decouple a few physical ways of the LLC to have a dedicated and exclusive storage for the dirty blocks after being evicted from the cache and before being sent to the off-chip memory. By displacing dirty blocks in exclusive storage, they are kept in the cache based on their rewrite distance and are evicted when they are unlikely to be reused shortly. To maximize the effectiveness of exclusive storage, we manage it as a Cuckoo Cache to offer associativity based on the various applications’ demands. Through detailed evaluations targeting various single- and multi-threaded applications, we show that our proposal reduces the number of writebacks by 21%, on average, over the state-of-the-art method and enhances both performance and energy efficiency.

Journal ArticleDOI
TL;DR: Experimental results demonstrate that the subcomponent timing model provides sufficient features to achieve high detection accuracy with low false-positive rates using a one-class support vector machine, considering sophisticated mimicry malware.
Abstract: Malware is a serious threat to network-connected embedded systems, as evidenced by the continued and rapid growth of such devices, commonly referred to as the Internet of Things. Their ubiquitous use in critical applications require robust protection to ensure user safety and privacy. That protection must be applied to all system aspects, extending beyond protecting the network and external interfaces. Anomaly detection is one of the last lines of defence against malware, in which data-driven approaches that require the least domain knowledge are popular. However, embedded systems, particularly edge devices, face several challenges in applying data-driven anomaly detection, including unpredictability of malware, limited tolerance to long data collection windows, and limited computing/energy resources. In this article, we utilize subcomponent timing information of software execution, including intrinsic software execution, instruction cache misses, and data cache misses as features, to detect anomalies based on ranges, multi-dimensional Euclidean distance, and classification at runtime. Detection methods based on lumped timing range are also evaluated and compared. We design several hardware detectors implementing these data-driven detection methods, which non-intrusively measuring lumped/subcomponent timing of all system/function calls of the embedded application. We evaluate the area, power, and detection latency of the presented detector designs. Experimental results demonstrate that the subcomponent timing model provides sufficient features to achieve high detection accuracy with low false-positive rates using a one-class support vector machine, considering sophisticated mimicry malware.

Journal ArticleDOI
TL;DR: This article proposes a novel writeback-aware Last Level Cache (LLC) management scheme named WALL to reduce the number of LLC writebacks and consequently improve performance, energy efficiency, and lifetime of a PCM-based main memory system.
Abstract: With the increase in the number of data-intensive applications on today's workloads, DRAM-based main memories are struggling to satisfy the growing data demand capacity. Phase Change Memory (PCM) is a type of non-volatile memory technology that has been explored as a promising alternative for DRAM-based main memories due to its better scalability and lower leakage energy. Despite its many advantages, PCM also has shortcomings such as long write latency, high write energy consumption, and limited write endurance, which are all related to the write operations. In this article, we propose a novel writeback-aware Last Level Cache (LLC) management scheme named WALL to reduce the number of LLC writebacks and consequently improve performance, energy efficiency, and lifetime of a PCM-based main memory system. First, we investigate the writeback behavior of LLC sets and show that writebacks are not uniformly distributed among sets; some sets observe much higher writeback rates than others. We then propose a writeback-aware set-balancing mechanism, which employs the underutilized LLC sets with few writebacks as an auxiliary storage for the evicted dirty lines from sets with frequent writebacks. We also propose a simple and effective writeback-aware replacement policy to avoid the eviction of the dirty blocks that are highly reused after being evicted from the cache. Our experimental results show that WALL achieves an average of 30.9% reduction in the total number of LLC writebacks, compared to the baseline scheme, which uses the LRU replacement policy. As a result, WALL can reduce the memory energy consumption by 23.1% and enhance PCM lifetime by 1.29×, on average, on an 8-core system with a 4GB PCM main memory, running memory-intensive applications.

Journal ArticleDOI
TL;DR: This article proposes a new SMART-based NoC design called SHARP, which increases throughput by up to 19% and average link utilization byup to 24% by avoiding false negatives and reduces the wiring and area overhead significantly.
Abstract: SMART-based NoC designs achieve ultra-low latencies by enabling flits to traverse multiple hops within a single clock cycle. Notwithstanding the clear performance benefits, SMART-based NoCs suffer from several shortcomings: each router must arbitrate among a quadratic number of requests, which leads to high costs; each router independently makes its own arbitration decisions, which leads to a problem called false negatives that causes throughput loss. In this article, we propose a new SMART-based NoC design called SHARP that overcomes these shortcomings. Our evaluation demonstrates that SHARP increases throughput by up to 19% and average link utilization by up to 24% by avoiding false negatives. By avoiding quadratic arbitration, our evaluation further demonstrates that SHARP reduces the wiring and area overhead significantly.

Journal ArticleDOI
TL;DR: Simulation results claim that, at a minimal penalty on the performance, proposed cache-based thermal management having 8MB centralised multi-banked shared LLC gives around 5°C reduction in peak and average chip temperature, which are comparable with a Greedy DVFS policy.
Abstract: In the era of short channel length, Dynamic Thermal Management (DTM) has become a challenging task for the architects and designers engineering modern Chip Multi-Processors (CMPs). Ever-increasing demand of processing power along with the developed integration technology produces CMPs with high power density, which in turn increases effective chip temperature. This increased temperature leads to increase in the reliability issues for the chip-circuitry with significant increment in leakage power consumption. Recent DTM techniques apply DVFS or Task Migration to reduce temperature at the cores, the hottest on-chip components, but often ignore the on-chip hot caches. To commensurate the high data demand of these cores, most of the modern CMPs are equipped with large multi-level on-chip caches, out of which on-chip Last Level Caches (LLCs) occupy the largest on-chip area. These LLCs are accounted for their significantly high leakage power consumption that can also potentially generate on-chip hotspots at the LLCs similar to the cores. As power consumption constructs the backbone of heat dissipation, hence, this work dynamically shrinks cache size while maintaining performance constraint to reduce LLC leakage, primarily. These turned-off cache portions further work as on-chip thermal buffers for reducing average and peak temperature of the CMP without affecting the computation. Simulation results claim that, at a minimal penalty on the performance, proposed cache-based thermal management having 8MB centralised multi-banked shared LLC gives around 5°C reduction in peak and average chip temperature, which are comparable with a Greedy DVFS policy.

Journal ArticleDOI
TL;DR: A hardware-efficient block matching algorithm with an efficient hardware design that is able to reduce the computational complexity of motion estimation while providing a sustained and steady coding performance for high-quality video encoding is presented.
Abstract: Variable block size motion estimation has contributed greatly to achieving an optimal interframe encoding, but involves high computational complexity and huge memory access, which is the most critical bottleneck in ultra-high-definition video encoding. This article presents a hardware-efficient block matching algorithm with an efficient hardware design that is able to reduce the computational complexity of motion estimation while providing a sustained and steady coding performance for high-quality video encoding. A three-level memory organization is proposed to reduce memory bandwidth requirement while supporting a predictive common search window. By applying multiple search strategies and early termination, the proposed design provides 1.8 to 3.7 times higher hardware efficiency than other works. Furthermore, on-chip memory has been reduced by 96.5% and off-chip bandwidth requirement has been reduced by 39.4% thanks to the proposed three-level memory organization. The corresponding power consumption is only 198mW at the highest working frequency of 500MHz. The proposed design is attractive for high-quality video encoding in real-time applications with low power consumption.

Journal ArticleDOI
TL;DR: Real-time and embedded systems are shifting from single-core to multi-core processors, on which the software must be parallelized to fully utilize the computation capacity of the hardware.
Abstract: Real-time and embedded systems are shifting from single-core to multi-core processors, on which the software must be parallelized to fully utilize the computation capacity of the hardware. Recently, much work has been done on real-time scheduling of parallel tasks modeled as directed acyclic graphs (DAG). However, most of these studies assume tasks to have implicit or constrained deadlines. Much less work considered the general case of arbitrary deadlines (i.e., the relative deadline is allowed to be larger than the period), which is more difficult to analyze due to intra-task interference among jobs. In this article, we study the analysis of Global Earliest Deadline First (GEDF) scheduling for DAG parallel tasks with arbitrary deadlines. We develop new analysis techniques for GEDF scheduling of a single DAG task and this new analysis techniques can guarantee a better capacity augmentation bound 2.41 (the best known result is 2.5) in the case of a single task. Furthermore, the proposed analysis techniques are also extended to the case of multiple DAG tasks under GEDF and federated scheduling. Finally, through empirical evaluation, we justify the out-performance of our schedulability tests compared to the state-of-the-art in general.

Journal ArticleDOI
TL;DR: Experimental results show that the proposed hybrid STT-MRAM cache combined with profiling-based and compiler-level analysis for the data re-arranging, on average, reduces the write energy per access and the system performance has been improved up to 8.1%.
Abstract: Spin Transfer Torque Magnetic Random Access Memory (STT-MRAM) is a promising candidate for large on-chip memories as a zero-leakage, high-density and non-volatile alternative to the present SRAM technology. Since memories are the dominating component of a System-on-Chip, the overall performance of the system is highly dependent on that memories. Nevertheless, the high write energy and latency of the emerging STT-MRAM are the most challenging design issues in a modern computing system. By relaxing the non-volatility of these devices, it is possible to reduce the write energy and latency costs, at the expense of reducing the retention time, which in turn may lead to loss of data. In this article, we propose a hybrid STT-MRAM design for caches with different retention capabilities. Then, based on the application requirements (i.e., execution time and memory access rate), program data layout is re-arranged at compilation time for achieving fast and energy-efficient hybrid STT-MRAM on-chip memory design with no reliability degradation. The application requirements have been defined at function granularity based on profiling and compiler-level analysis, which estimate the required retention time and memory access rate, respectively. Experimental results show that the proposed hybrid STT-MRAM cache combined with profiling-based and compiler-level analysis for the data re-arranging, on average, reduces the write energy per access by 49.7%. At system level, overall static and dynamic energy of the cache are reduced by 8.1% and 44%, respectively, whereas, the system performance has been improved up to 8.1%.

Journal ArticleDOI
TL;DR: This work presents a heuristic for scheduling tasks with potentially imprecise computations, represented with directed acyclic graphs, on multiprocessor platforms, and presents a mixed integer linear program formulation of the same problem, which provides the optimal reference scheduling solutions.
Abstract: Imprecise computations allow scheduling algorithms developed for energy-constrained computing devices to trade off output quality with utilization of system resources. The goal of such scheduling algorithms is to utilize imprecise computations to find a feasible schedule for a given task graph while maximizing the quality of service (QoS) and satisfying a hard deadline and an energy bound. This work presents a heuristic for scheduling tasks with potentially imprecise computations, represented with directed acyclic graphs, on multiprocessor platforms. Furthermore, it presents a mixed integer linear program formulation of the same problem, which provides the optimal reference scheduling solutions, enabling evaluation of the efficacy of the proposed heuristic. Both the heuristic and mathematical program take account of potentially imprecise inputs of tasks on their output quality. Furthermore, the presented heuristic is capable of finding feasible schedules even under tight energy budgets. Through extensive experiments, it is shown that in some cases, the proposed heuristic is capable of finding the same QoS as the ones found by MILP. Furthermore, for those task graphs that MILP outperforms the proposed heuristic, QoS values obtained with the proposed heuristic are, on average, within 1.24% of the optimal solutions while improving the runtime by a factor of 100 or so. This clearly demonstrates the advantage of the proposed heuristic over the exact solution, especially for large task graphs where solving the mathematical problem is hampered by its lengthy runtime.

Journal ArticleDOI
TL;DR: A topology-agnostic test mechanism capable of diagnosing on-line, coexistent channel-short, and stuck-at faults in these special NoCs as well as in traditional mesh architectures is proposed and an efficient scheduling scheme to reduce test time without compromising resource utilization during testing is presented.
Abstract: High--performance multiprocessor SoCs used in practice require a complex network-on-chip (NoC) as communication architecture, and the channels therein often suffer from various manufacturing defects. Such physical defects cause a multitude of system-level failures and subsequent degradation of reliability, yield, and performance of the computing platform. Most of the existing test approaches consider mesh-based NoC channels only and do not perform well for other regular topologies such as octagons or spidergons, with regard to test time and overhead issues. This article proposes a topology-agnostic test mechanism that is capable of diagnosing on-line, coexistent channel-short, and stuck-at faults in these special NoCs as well as in traditional mesh architectures. We introduce a new test model called Damaru to decompose the network and present an efficient scheduling scheme to reduce test time without compromising resource utilization during testing. Additionally, the proposed scheduling scheme scales well with network size, channel width, and topological diversity. Simulation results show that the method achieves nearly 92% fault coverage and improves area overhead by almost 60% and test time by 98% compared to earlier approaches. As a sequel, packet latency and energy consumption are also improved by 67.05% and 54.69%, respectively, and they are further improved with increasing network size.

Journal ArticleDOI
TL;DR: E3D-FNC is proposed, an enhanced three-dimesnional (3D) floorplanning framework for neuromorphic computing systems, in which the neuron clustering and the layer assignment are considered interactively and can achieve highly hardware-efficient designs compared to the state of the art.
Abstract: In recent years, neuromorphic computing systems based on memristive crossbar have provided a promising solution to enable acceleration of neural networks. However, most of the neural networks used in realistic applications are often sparse. If such sparse neural network is directly implemented on a single memristive crossbar, then it would result in inefficient hardware realizations. In this work, we propose E3D-FNC, an enhanced three-dimesnional (3D) floorplanning framework for neuromorphic computing systems, in which the neuron clustering and the layer assignment are considered interactively. First, in each iteration, hierarchical clustering partitions neurons into a set of clusters under the guidance of the proposed distance metric. The optimal number of clusters is determined by L-method. Then matrix re-ordering is proposed to re-arrange the columns of the weight matrix in each cluster. As a result, the reordered connection matrix can be easily mapped into a set of crossbars with high utilizations. Next, since the clustering results will in turn affect the floorplan, we perform the floorplanning of neurons and crossbars again. All the proposed methodologies are embedded in an iterative framework to improve the quality of NCS design. Finally, a 3D floorplan of neuromorphic computing systems is generated. Experimental results show that E3D-FNC can achieve highly hardware-efficient designs compared to the state of the art.

Journal ArticleDOI
TL;DR: Experimental results show that the coupling-aware M3D NoC reduces performance penalty by lowering the number of multi-tier routers significantly, and as a countermeasure, advocate the adoption of electrostatic coupling- aware M3d NoC design methodology.
Abstract: Monolithic-3D-integration (M3D) improves the performance and energy efficiency of 3D ICs over conventional through-silicon-vias-based counterparts. The smaller dimensions of monolithic inter-tier vias offer high-density integration, the flexibility of partitioning logic blocks across multiple tiers, and significantly reduced total wire-length enable high-performance and energy-efficiency. However, the performance of M3D ICs degrades due to the presence of electrostatic coupling when the inter-layer-dielectric thickness between two adjacent tiers is less than 50nm. In this work, we evaluate the performance of an M3D-enabled Network-on-chip (NoC) architecture in the presence of electrostatic coupling. Electrostatic coupling induces significant delay and energy overheads for the multi-tier NoC routers. This in turn results in considerable performance degradation if the NoC design methodology does not incorporate the effects of electrostatic coupling. We demonstrate that electrostatic coupling degrades the energy-delay-product of an M3D NoC by 18.1% averaged over eight different applications from SPLASH-2 and PARSEC benchmark suites. As a countermeasure, we advocate the adoption of electrostatic coupling-aware M3D NoC design methodology. Experimental results show that the coupling-aware M3D NoC reduces performance penalty by lowering the number of multi-tier routers significantly.

Journal ArticleDOI
Xu He1, Deng Yu1, Shizhe Zhou1, Li Rui1, Yao Wang2, Yang Guo2 
TL;DR: An Fast Fourier Transform--based feature extraction method is applied that can compress large-scale layout to a multi-dimensional representation with much smaller size while preserving the discriminative layout pattern information to improve the detection efficiency.
Abstract: With the increasing gap between transistor feature size and lithography manufacturing capability, the detection of lithography hotspots becomes a key stage of physical verification flow to enhance manufacturing yield. Although machine learning approaches are distinguished for their high detection efficiency, they still suffer from problems such as large-scale layout and class imbalance. In this article, we develop a hotspot detection model based on machine learning with high performance. In the proposed model, we first apply an Fast Fourier Transform--based feature extraction method that can compress large-scale layout to a multi-dimensional representation with much smaller size while preserving the discriminative layout pattern information to improve the detection efficiency. Second, addressing the class imbalance problem, we propose a new technique called imbalanced learning rate and embed it into the convolutional neural network model to further reduce false alarms without accuracy decay. Compared with the results of current state-of-the-art approaches on ICCAD 2012 Contest benchmarks, our proposed model can achieve better solutions in many evaluation metrics, including the official metrics.

Journal ArticleDOI
TL;DR: A novel obfuscation-based approach to achieve strong resistance against both random and targeted pre-configuration tampering of critical functions in an FPGA design and a redundancy-based technique is proposed to thwart targeted, rule-based, and random tampering.
Abstract: Field Programmable Gate Arrays (FPGAs) have become an attractive choice for diverse applications due to their reconfigurability and unique security features. However, designs mapped to FPGAs are prone to malicious modifications or tampering of critical functions. Besides, targeted modifications have demonstrably compromised FPGA implementations of various cryptographic primitives. Existing security measures based on encryption and authentication can be bypassed using their side-channel vulnerabilities to execute bitstream tampering attacks. Furthermore, numerous resource-constrained applications are now equipped with low-end FPGAs, which may not support power-hungry cryptographic solutions. In this article, we propose a novel obfuscation-based approach to achieve strong resistance against both random and targeted pre-configuration tampering of critical functions in an FPGA design. Our solution first identifies the unique structural and functional features that separate the critical function from the rest of the design using a machine learning guided framework. The selected features are eliminated by applying appropriate obfuscation techniques, many of which take advantage of “FPGA dark silicon”—unused lookup table resources—to mask the critical functions. Furthermore, following the same obfuscation principle, a redundancy-based technique is proposed to thwart targeted, rule-based, and random tampering. We have developed a complete methodology and custom software toolflow that integrates with commercial tools. By applying the masking technique on a design containing AES, we show the effectiveness of the proposed framework in hiding the critical S-Box function. We implement the redundancy integrated solution in various cryptographic designs to analyze the overhead. To protect 16.2% critical component of a design, the proposed approach incurs an average area overhead of only 2.4% over similar redundancy-based approaches, while achieving strong security.

Journal ArticleDOI
TL;DR: An algorithm for integrated timing-driven latch placement and cloning is presented, given a circuit placement, such that relocation and cloning are applied to some latches together with their neighbor logic gates.
Abstract: This article presents an algorithm for integrated timing-driven latch placement and cloning. Given a circuit placement, the proposed algorithm relocates some latches while circuit timing is improved. Some latches are replicated to further improve the timing; the number of replicated latches along with their locations are automatically determined. After latch cloning, each of the replicated latches is set to drive a subset of the fanouts that have been driven by the original single latch. The proposed algorithm is then extended such that relocation and cloning are applied to some latches together with their neighbor logic gates. Experimental results demonstrate that the worst negative slack and the total negative slack are improved by 24% and 59%, respectively, on average of test circuits. The negative impacts on circuit area and power consumption are both marginal, at 0.7% and 1.9% respectively.

Journal ArticleDOI
TL;DR: A dilution PMD algorithm and its architectural mapping scheme for addressing fluidic cells of such a device to perform dilution of a reagent fluid on-chip are proposed and simulation results show that the proposed DPMD scheme is comparative to the existing state-of-the-art dilution algorithm.
Abstract: Microfluidic lab-on-a-chip has emerged as a new technology for implementing biochemical protocols on small-sized portable devices targeting low-cost medical diagnostics. Among various efforts of fabrication of such chips, a relatively new technology is a programmable microfluidic device (PMD) for implementation of flow-based lab-on-a-chip. A PMD chip is suitable for automation due to its symmetric nature. In order to implement a bioprotocol on such a reconfigurable device, it is crucial to automate a sample preparation on-chip as well. In this article, we propose a dilution PMD algorithm (namely DPMD) and its architectural mapping scheme (namely generalized architectural mapping algorithm (GAMA)) for addressing fluidic cells of such a device to perform dilution of a reagent fluid on-chip. We used an optimization function that first minimizes the number of mixing steps and then reduces the waste generation and further reagent requirement. Simulation results show that the proposed DPMD scheme is comparative to the existing state-of-the-art dilution algorithm. The proposed design automation using the architectural mapping scheme reduces the required chip area and, hence, minimizes the valve switching that, in turn, increases the life span of the PMD-chip.

Journal ArticleDOI
TL;DR: This work proposes to “lock” biochemical assays by inserting dummy mix-split operations, experimentally evaluates the proposed locking mechanism, and shows how a high level of protection can be achieved even on bioassays with low complexity.
Abstract: It is expected that as digital microfluidic biochips (DMFBs) mature, the hardware design flow will begin to resemble the current practice in the semiconductor industry: design teams send chip layouts to third-party foundries for fabrication. These foundries are untrusted and threaten to steal valuable intellectual property (IP). In a DMFB, the IP consists of not only hardware layouts but also of the biochemical assays (bioassays) that are intended to be executed on-chip. DMFB designers therefore must defend these protocols against theft. We propose to “lock” biochemical assays by inserting dummy mix-split operations. We experimentally evaluate the proposed locking mechanism, and show how a high level of protection can be achieved even on bioassays with low complexity. We also demonstrate a new class of attacks that exploit the side-channel information to launch sophisticated attacks on the locked bioassay.

Journal ArticleDOI
TL;DR: This work proposes a sensitivity analysis method for data and branches in a program to identify the data load and branch instructions that can be executed without any rollback in the pipeline and yet can ensure a certain user-specified quality of service of the application with a probabilistic reliability.
Abstract: Speculative execution is an optimization technique used in modern processors by which predicted instructions are executed in advance with an objective of overlapping the latencies of slow operations. Branch prediction and load value speculation are examples of speculative execution used in modern pipelined processors to avoid execution stalls. However, speculative executions incur a performance penalty as an execution rollback when there is a misprediction. In this work, we propose to aid speculative execution with approximate computing by relaxing the execution rollback penalty associated with a misprediction. We propose a sensitivity analysis method for data and branches in a program to identify the data load and branch instructions that can be executed without any rollback in the pipeline and yet can ensure a certain user-specified quality of service of the application with a probabilistic reliability. Our analysis is based on statistical methods, particularly hypothesis testing and Bayesian analysis. We perform an architectural simulation of our proposed approximate execution and report the benefits in terms of CPU cycles and energy utilization on selected applications from the AxBench, ACCEPT, and Parsec 3.0 benchmarks suite.

Journal ArticleDOI
TL;DR: A functional model for the SADF MoC, as well as a set of abstract operations for simulating it, is introduced, implemented as an open source library in the functional framework ForSyDe.
Abstract: The tradeoff between analyzability and expressiveness is a key factor when choosing a suitable dataflow model of computation (MoC) for designing, modeling, and simulating applications considering a formal base. A large number of techniques and analysis tools exist for static dataflow models, such as synchronous dataflow. However, they cannot express the dynamic behavior required for more dynamic applications in signal streaming or to model runtime reconfigurable systems. On the other hand, dynamic dataflow models like Kahn process networks sacrifice analyzability for expressiveness. Scenario-aware dataflow (SADF) is an excellent tradeoff providing sufficient expressiveness for dynamic systems, while still giving access to powerful analysis methods. In spite of an increasing interest in SADF methods, there is a lack of formally-defined functional models for describing and simulating SADF systems. This article overcomes the current situation by introducing a functional model for the SADF MoC, as well as a set of abstract operations for simulating it. We present the first modeling and simulation tool for SADF so far, implemented as an open source library in the functional framework ForSyDe. We demonstrate the capabilities of the functional model through a comprehensive tutorial-style example of a RISC processor described as an SADF application, and a traditional streaming application where we model an MPEG-4 simple profile decoder. We also present a couple of alternative approaches for functionally modeling SADF on different languages and paradigms. One of such approaches is used in a performance comparison with our functional model using the MPEG-4 simple profile decoder as a test case. As a result, our proposed model presented a good tradeoff between execution time and implementation succinctness. Finally, we discuss the potential of our formal model as a frontend for formal system design flows regarding dynamic applications.

Journal ArticleDOI
TL;DR: This work provides a framework that given a gate-level netlist of a design and a block diagram for the design, outputs a matching between the partitions of the circuit and blocks in the block diagram, which can be analyzed for malicious insertions with much reduced complexity.
Abstract: Contemporary integrated circuits (ICs) are increasingly being constructed using intellectual property blocks (IPs) obtained from third parties in a globalized supply chain. The increased vulnerability to adversarial changes during this untrusted supply chain raises concerns about the integrity of the end product. The difference in the levels of abstraction between the initial specification and the final available circuit design poses a challenge for analyzing the final circuit for malicious insertions.Reverse engineering presents one way to help reduce the difficulty of circuit analysis and inspection. In this work, we provide a framework that given (i) a gate-level netlist of a design and (ii) a block diagram for the design with relative sizes of the blocks, outputs a matching between the partitions of the circuit and blocks in the block diagram. We first compute a geometric embedding for each node in the circuit and then apply a clustering algorithm on the embedding features to obtain circuit partitions. Each partition is then mapped to the high-level blocks in the block diagram. These partitions can then be further analyzed for malicious insertions with much reduced complexity in comparison with the full chip. We tested our algorithm on different designs with varying sizes to evaluate the efficacy of algorithm, including the open-source processor OpenSparc T1, and showed that we can successfully match over 90% of gates to their corresponding blocks.