scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems in 2017"


Journal ArticleDOI
TL;DR: This paper designs deep learning accelerator unit (DLAU), which is a scalable accelerator architecture for large-scale deep learning networks using field-programmable gate array (FPGA) as the hardware prototype and employs three pipelined processing units to improve the throughput.
Abstract: As the emerging field of machine learning, deep learning shows excellent ability in solving complex learning problems. However, the size of the networks becomes increasingly large scale due to the demands of the practical applications, which poses significant challenge to construct a high performance implementations of deep learning neural networks. In order to improve the performance as well as to maintain the low power cost, in this paper we design deep learning accelerator unit (DLAU), which is a scalable accelerator architecture for large-scale deep learning networks using field-programmable gate array (FPGA) as the hardware prototype. The DLAU accelerator employs three pipelined processing units to improve the throughput and utilizes tile techniques to explore locality for deep learning applications. Experimental results on the state-of-the-art Xilinx FPGA board demonstrate that the DLAU accelerator is able to achieve up to $36.1 {\times }$ speedup comparing to the Intel Core2 processors, with the power consumption at 234 mW.

268 citations


Journal ArticleDOI
TL;DR: This paper identifies a combination of characteristics that define the challenges unique to the design automation of CPS, and presents selected promising advances in depth, focusing on four foundational directions: combining model-based and data-driven design methods; design for human-in-the-loop systems; component-based design with contracts, and design for security and privacy.
Abstract: A cyber-physical system (CPS) is an integration of computation with physical processes whose behavior is defined by both computational and physical parts of the system. In this paper, we present a view of the challenges and opportunities for design automation of CPS. We identify a combination of characteristics that define the challenges unique to the design automation of CPS. We then present selected promising advances in depth, focusing on four foundational directions: combining model-based and data-driven design methods; design for human-in-the-loop systems; component-based design with contracts, and design for security and privacy. These directions are illustrated with examples from two application domains: smart energy systems and next-generation automotive systems.

119 citations


Journal ArticleDOI
TL;DR: A statistically rigorous and novel methodology for building accurate run-time power models using performance monitoring counters (PMCs) for mobile and embedded devices, and how these models make more efficient use of limited training data and better adapt to unseen scenarios by uniquely considering stability is presented.
Abstract: Modern mobile and embedded devices are required to be increasingly energy-efficient while running more sophisticated tasks, causing the CPU design to become more complex and employ more energy-saving techniques. This has created a greater need for fast and accurate power estimation frameworks for both run-time CPU energy management and design-space exploration. We present a statistically rigorous and novel methodology for building accurate run-time power models using performance monitoring counters (PMCs) for mobile and embedded devices, and demonstrate how our models make more efficient use of limited training data and better adapt to unseen scenarios by uniquely considering stability. Our robust model formulation reduces multicollinearity, allows separation of static and dynamic power, and allows a $100{\times }$ reduction in experiment time while sacrificing only 0.6% accuracy. We present a statistically detailed evaluation of our model, highlighting and addressing the problem of heteroscedasticity in power modeling. We present software implementing our methodology and build power models for ARM Cortex-A7 and Cortex-A15 CPUs, with 3.8% and 2.8% average error, respectively. We model the behavior of the nonideal CPU voltage regulator under dynamic CPU activity to improve modeling accuracy by up to 5.5% in situations where the voltage cannot be measured. To address the lack of research utilizing PMC data from real mobile devices, we also present our data acquisition method and experimental platform software. We support this paper with online resources including software tools, documentation, raw data and further results.

94 citations


Journal ArticleDOI
TL;DR: A Memristor-based dynamic (MD) synapse design with experiment-calibrated memristor models is proposed and a temporal pattern learning application was investigated to evaluate the use of MD synapses in spiking neural networks, under both spike-timing-dependent plasticity and remote supervised method learning rules.
Abstract: Recent advances in memristor technology lead to the feasibility of large-scale neuromorphic systems by leveraging the similarity between memristor devices and synapses For instance, memristor cross-point arrays can realize dense synapse network among hundreds of neuron circuits, which is not affordable for traditional implementations However, little progress was made in synapse designs that support both static and dynamic synaptic properties In addition, many neuron circuits require signals in specific pulse shape, limiting the scale of system implementation Last but not least, a bottom-up study starting from realistic memristor devices is still missing in the current research of memristor-based neuromorphic systems Here, we propose a memristor-based dynamic (MD) synapse design with experiment-calibrated memristor models The structure obtains both static and dynamic synaptic properties by using one memristor for weight storage and the other as a selector We overcame the device nonlinearities and demonstrated spike-timing-based recall, weight tunability, and spike-timing-based learning functions on MD synapse Furthermore, a temporal pattern learning application was investigated to evaluate the use of MD synapses in spiking neural networks, under both spike-timing-dependent plasticity and remote supervised method learning rules

83 citations


Journal ArticleDOI
TL;DR: This paper expands previous work on using incremental SAT solving to reconstruct the logical function of a circuit with camouflaged components and shows that this technique, previously applied only to a particular style of gate camouflaging, is general and can be used to deobfuscate three different proposed styles of camouflaging.
Abstract: Layout-level gate or routing camouflaging techniques have attracted interest as countermeasures against reverse engineering of combinational logic. In order to minimize area overhead, typically only a subset of gate or routing components are camouflaged, and each camouflaged component layout can implement one of a few different functions or connections. The security of camouflaging relies on the difficulty of learning the overall combinational logic function without knowing the functions implemented by the individual camouflaged components of the circuit. In this paper, we expand our previous work on using incremental SAT solving to reconstruct the logical function of a circuit with camouflaged components. Our algorithm uses the standard attacker model in which an adversary knows only the noncamouflaged component functions, and has the ability to query the circuit to learn the correct output vector for any input vector. Our results demonstrate a $10.5\times$ speedup in average runtime over the best known existing deobfuscation algorithm prior to this technique. The results presented go beyond our previous work by showing that this technique, previously applied only to a particular style of gate camouflaging, is general and can be used to deobfuscate three different proposed styles of camouflaging. We give results to quantify the effectiveness of camouflaging techniques on a variety of ISCAS-85 benchmark circuits.

66 citations


Journal ArticleDOI
TL;DR: A machine-learning-inspired predictive design methodology for energy-efficient and reliable many-core architectures enabled by 3-D integration and a computationally efficient spare-vertical link (sVL) allocation algorithm based on a state-space search formulation are proposed.
Abstract: A 3-D network-on-chip (NoC) enables the design of high performance and low power many-core chips. Existing 3-D NoCs are inadequate for meeting the ever-increasing performance requirements of many-core processors since they are simple extensions of regular 2-D architectures and they do not fully exploit the advantages provided by 3-D integration. Moreover, the anticipated performance gain of a 3-D NoC-enabled many-core chip may be compromised due to the potential failures of through-silicon-vias that are predominantly used as vertical interconnects in a 3-D IC. To address these problems, we propose a machine-learning-inspired predictive design methodology for energy-efficient and reliable many-core architectures enabled by 3-D integration. We demonstrate that a small-world network-based 3-D NoC (3-D SWNoC) performs significantly better than its 3-D MESH-based counterparts. On average, the 3-D SWNoC shows 35% energy-delay-product improvement over 3-D MESH for the PARSEC and SPLASH2 benchmarks considered in this paper. To improve the reliability of 3-D NoC, we propose a computationally efficient spare-vertical link (sVL) allocation algorithm based on a state-space search formulation. Our results show that the proposed sVL allocation algorithm can significantly improve the reliability as well as the lifetime of 3-D SWNoC.

65 citations


Journal ArticleDOI
TL;DR: A novel HLS methodology for constraint driven low cost hardware Trojan secured DMR schedule design for loop-based control data flow graphs (CDFGs) and experimental results over the standard benchmark indicate an average reduction in final cost of ~12% compared to recent approach.
Abstract: Security against hardware Trojan that is capable to change the computational output value is accomplished by employing dual modular redundant (DMR) schedule during high level synthesis (HLS). However, building a DMR for Trojan security is nontrivial and incurs extra delay and hardware. This paper proposes a novel HLS methodology for constraint driven low cost hardware Trojan secured DMR schedule design for loop-based control data flow graphs (CDFGs). Proposed approach simultaneously explores an optimal schedule and optimal loop unrolling factor (U) combination for a low cost Trojan security aware DMR schedule. As a specific example, proposed low cost Trojan secured HLS approach relies on particle swarm optimization algorithm to explore optimized Trojan secured schedule with optimal unrolling that provides security against specific Trojan (causing change in computational output) within user provided area and delay constraints. The novel contributions of this paper are, first an exploration of a low cost Trojan security aware HLS solution for loop-based CDFGs; second, proposed encoding scheme for representing design solution comprising candidate schedule resources, candidate loop unrolling factor and candidate vendor allocation information; third, a process for exploring the a low cost vendor assignment that provides Trojan security; finally, experimental results over the standard benchmark that indicates an average reduction in final cost of ~12% compared to recent approach.

62 citations


Journal ArticleDOI
TL;DR: This work proposes effective algorithms for exact synthesis of Boolean logic networks using satisfiability modulo theories (SMTs) solvers and uses majority-inverter graphs (MIGs) as underlying logic representation as they are simple and expressive and expressive at the same time.
Abstract: We propose effective algorithms for exact synthesis of Boolean logic networks using satisfiability modulo theories (SMTs) solvers. Since exact synthesis is a difficult problem, it can only be applied efficiently to very small functions, having up to six variables. Key in our approach is to use majority-inverter graphs (MIGs) as underlying logic representation as they are simple (homogeneous logic representation) and expressive (contain AND/OR-inverter graphs) at the same time. This has a positive impact on the problem formulation: it simplifies the encoding as SMT constraints and also allows for various techniques to break symmetries in the search space due to the regular data structure. Our algorithm optimizes with respect to the MIG’s size or depth and uses different ways to encode the problem and several methods to improve solving time, with symmetry breaking techniques being the most effective ones. We discuss several applications of exact synthesis and motivate them by experiments on a set of large arithmetic benchmarks. Using the proposed techniques, we are able to improve both area and delay after lookup table (LUT)-based technology mapping beyond the current results achieved by state-of-the-art logic synthesis algorithms.

59 citations


Journal ArticleDOI
TL;DR: Results show that compared to the traditional static and centralized energy-management system (EMS), and the recent multiagent EMS using price-demand competition, the proposed uncertainty-aware MG-EMS can achieve up to up to utilization rate improvements and balanced energy allocation improvements.
Abstract: This paper presents a cyber-physical management of smart buildings based on smart-gateway network with distributed and real-time energy data collection and analytics. We consider a building with multiple rooms supplied with one main electricity grid and one additional solar energy grid. Based on smart-gateway network, energy signatures of rooms are first extracted with consideration of uncertainty and further classified as different types of agents. Then, a multiagent minority-game (MG)-based demand-response management is introduced to reduce peak demand on the main electricity grid and also to fairly allocate solar energy on the additional grid. Experiment results show that compared to the traditional static and centralized energy-management system (EMS), and the recent multiagent EMS using price-demand competition, the proposed uncertainty-aware MG-EMS can achieve up to $50\times $ and $145\times $ utilization rate improvements, respectively, regarding to the fairness of solar energy resource allocation. More importantly, the peak load from the main electricity grid is reduced by 38.50% in summer and 15.83% in winter based on benchmarked energy data of building. Lastly, an average 23% uncertainty can be reduced with an according 37% balanced energy allocation improved comparing to the MG-EMS without consideration of uncertainty.

55 citations


Journal ArticleDOI
TL;DR: This paper uses the OpenSPARC T2 SoC as a case study, implements it in a 28-nm fully depleted silicon on insulator foundry process, and demonstrates that it can achieve up to 12% and 8% power savings for a single block and SoC, respectively, when compared with their 2-D counterparts implemented using commercial tools.
Abstract: Monolithic 3-D (M3D) integrated circuits (ICs) are an emerging technology that offer much higher integration densities than previous 3-D IC approaches. In this paper, we present a complete netlist-to-layout design flow to design an M3D block, as well as to integrate 2-D and 3-D blocks into an M3D SoC. This design flow is based on commercial tools built for 2-D ICs, and enhanced with our 3-D specific methodologies. We use the OpenSPARC T2 SoC as a case study, implement it in a 28-nm fully depleted silicon on insulator foundry process, and demonstrate that we can achieve up to 12% and 8% power savings for a single block and SoC, respectively, when compared with their 2-D counterparts implemented using commercial tools.

51 citations


Journal ArticleDOI
TL;DR: This paper presents a debugging architecture, which automatically records key hardware signals, and relates them back to the original software source code, and allows designers to debug HLS circuits in-system, in the context of the original source code.
Abstract: High-level synthesis (HLS) promises to increase designer productivity in the face of increasing field-programmable gate array sizes, and broaden the market of use, allowing software designers to reap the benefits of hardware implementation. One roadblock to HLS adoption is the lack of an in-system debugging infrastructure. Although designers can run their software code on a workstation, or simulate the register-transfer level, neither can reliably capture the behaviors, and therefore bugs, that may be present in the final system. Debugging hardware circuits in-system requires using signal-tracing to record circuit behavior for later offline analysis. In this paper, we present a debugging architecture, which automatically records key hardware signals, and relates them back to the original software source code. This architecture allows designers to debug HLS circuits in-system, in the context of the original source code. We present several signal-tracing techniques, tailored to HLS circuits, which allow a much longer execution trace to be captured. These techniques include signal compression, dynamically changing which signals are recorded cycle-by-cycle, and offline signal restoration. Compared to using an embedded logic analyzer to perform signal-tracing, our architecture increases the length of execution trace that can be recorded by 127X. For each 100 Kb of trace buffer memory, our architecture can record 15 369 executed lines of C code.

Journal ArticleDOI
TL;DR: A circuit structure that performs a stateful logic operation on memristor memory based on a nanocrossbar is proposed, which can perfect the M-IMP limitation and eliminate the influence.
Abstract: Memristor-based material implication (M-IMP) logic is popular with logic operations, which provides a possibility that memory is operated directly. However, there is a small limitation that memristor is not able to reach the lowest resistance in M-IMP. In this brief, the M-IMP limitation and its influence are analyzed briefly. In addition, a circuit structure that performs a stateful logic operation on memristor memory based on a nanocrossbar is proposed, which can perfect the M-IMP limitation and eliminate the influence. Moreover, we simulate the proposed circuit design and the simulation results verify the correctness of the analysis.

Journal ArticleDOI
TL;DR: Magnetic tunnel junction (MTJ) devices are leveraged to develop a novel full adder (FA) based on 3- and 5-input majority gates based on Spin Hall effect (SHE) for changing the MTJ states resulting in low-energy switching behavior.
Abstract: Magnetic tunnel junction (MTJ)-based devices have been studied extensively as a promising candidate to implement hybrid energy-efficient computing circuits due to their nonvolatility, high integration density, and CMOS compatibility. In this paper, MTJs are leveraged to develop a novel full adder (FA) based on 3- and 5-input majority gates. Spin Hall effect (SHE) is utilized for changing the MTJ states resulting in low-energy switching behavior. SHE-MTJ devices are modeled in Verilog-A using precise physical equations. SPICE circuit simulator is used to validate the functionality of 1-bit SHE-based FA. The simulation results show 76% and 32% improvement over previous voltage-mode MTJ-based FA in terms of energy consumption and device count, respectively. The concatanatability of our proposed 1-bit SHE-FA is investigated through developing a 4-bit SHE-FA. Finally, delay and power consumption of an ${ {n}}$ -bit SHE-based adder has been formulated to provide a basis for developing an energy efficient SHE-based ${n}$ -bit arithmetic logic unit.

Journal ArticleDOI
TL;DR: This is the first work that uses the approximate dynamic programming techniques to solve the DEWH’s load management problem and results indicate that these techniques will minimize the energy consumed during load peak periods.
Abstract: In this paper, two techniques based on ${Q}$ -learning and action dependent heuristic dynamic programming (ADHDP) are demonstrated for the demand-side management of domestic electric water heaters (DEWHs). The problem is modeled as a dynamic programming problem, with the state space defined by the temperature of output water, the instantaneous hot water consumption rate, and the estimated grid load. According to simulation, ${Q}$ -learning and ADHDP reduce the cost of energy consumed by DEWHs by approximately 26% and 21%, respectively. The simulation results also indicate that these techniques will minimize the energy consumed during load peak periods. As a result, the customers saved about $466 and $367 annually by using ${Q}$ -learning and ADHDP techniques to control their DEWHs (100 gallons tank size) operation, which is better than the cost reduction that resulted from using the state-of-the-art ($246) control technique under the same simulation parameters. To the best of the authors’ knowledge, this is the first work that uses the approximate dynamic programming techniques to solve the DEWH’s load management problem.

Journal ArticleDOI
TL;DR: The pressure-propagation delay, an intrinsic physical phenomenon in mVLSI biochips, is minimized in order to reduce the response time for valves, decrease the pattern set-up time, and synchronize valve actuation.
Abstract: Recent advances in flow-based microfluidic biochips have enabled the emergence of lab-on-a-chip devices for bimolecular recognition and point-of-care disease diagnostics. However, the adoption of flow-based biochips is hampered today by the lack of computer-aided design tools. Manual design procedures not only delay product development but they also inhibit the exploitation of the design complexity that is possible with current fabrication techniques. In this paper, we present the first practical problem formulation for automated control-layer design in flow-based microfluidic very large-scale integration (mVLSI) biochips and propose a systematic approach for solving this problem. Our goal is to find an efficient routing solution for control-layer design with a minimum number of control pins. The pressure-propagation delay, an intrinsic physical phenomenon in mVLSI biochips, is minimized in order to reduce the response time for valves, decrease the pattern set-up time, and synchronize valve actuation. Two fabricated flow-based devices and six synthetic benchmarks are used to evaluate the proposed optimization method. Compared with manual control-layer design and a baseline approach, the proposed approach leads to fewer control pins, better timing behavior, and shorter channel length in the control layer.

Journal ArticleDOI
TL;DR: This paper presents “tensor computation” as an alternative general framework for the development of efficient EDA algorithms and tools, and gives a basic tutorial on tensors, and suggests further open EDA problems where the use of tensor computation could be of advantage.
Abstract: Many critical electronic design automation (EDA) problems suffer from the curse of dimensionality, i.e., the very fast-scaling computational burden produced by large number of parameters and/or unknown variables. This phenomenon may be caused by multiple spatial or temporal factors (e.g., 3-D field solvers discretizations and multirate circuit simulation), nonlinearity of devices and circuits, large number of design or optimization parameters (e.g., full-chip routing/placement and circuit sizing), or extensive process variations (e.g., variability /reliability analysis and design for manufacturability). The computational challenges generated by such high-dimensional problems are generally hard to handle efficiently with traditional EDA core algorithms that are based on matrix and vector computation. This paper presents “tensor computation” as an alternative general framework for the development of efficient EDA algorithms and tools. A tensor is a high-dimensional generalization of a matrix and a vector, and is a natural choice for both storing and solving efficiently high-dimensional EDA problems. This paper gives a basic tutorial on tensors, demonstrates some recent examples of EDA applications (e.g., nonlinear circuit modeling and high-dimensional uncertainty quantification), and suggests further open EDA problems where the use of tensor computation could be of advantage.

Journal ArticleDOI
TL;DR: A new design methodology for radiofrequency circuits is presented that includes electromagnetic (EM) simulation of the inductors into the optimization flow and is illustrated both for a singleobjective and a multiobjective optimization of a low noise amplifier.
Abstract: A new design methodology for radiofrequency circuits is presented that includes electromagnetic (EM) simulation of the inductors into the optimization flow. This is achieved by previously generating the Pareto-optimal front (POF) of the inductors using EM simulation. Inductors are selected from the Pareto front and their ${S}$ -parameter matrix is included in the circuit netlist that is simulated using an RF simulator. Generating the EM-simulated POF of inductors is computationally expensive, but once generated, it can be used for any circuit design. The methodology is illustrated both for a single-objective and a multiobjective optimization of a low noise amplifier.

Journal ArticleDOI
TL;DR: A two-level finite-state machine (FSM) is proposed to correct erroneous bits generated by environmental variations (e.g., temperature, voltage, and aging variations) to enable lightweight, secure, and reliable PUF-based authentication.
Abstract: Physical unclonable functions (PUFs) can extract chip-unique signatures from integrated circuits (ICs) by exploiting the uncontrollable randomness due to manufacturing process variations. These signatures can then be used for many hardware security applications including authentication, anti-counterfeiting, IC metering, signature generation, and obfuscation. However, most of these applications require error correcting methods to produce consistent PUF responses across different environmental conditions. This paper presents a novel method to enable lightweight, secure, and reliable PUF-based authentication. A two-level finite-state machine (FSM) is proposed to correct erroneous bits generated by environmental variations (e.g., temperature, voltage, and aging variations). In the proposed method, each PUF response is mapped to a key during design phase. The actual key can be determined from the PUF response only after the chip is fabricated. Because the key is not known to the foundry, the proposed approach prevents counterfeiting. The performance of the proposed method and other applications are also discussed. Our experimental results show that the cost of the proposed self-correcting two-level FSM is significantly less than that of the commonly used error correcting codes. It is shown that the proposed self-correcting FSM consumes about $2{\boldsymbol \times }$ to $10{\boldsymbol \times }$ less area and about $20{\boldsymbol \times }$ to $100{\boldsymbol \times }$ less power than the Bose–Chaudhuri–Hochquenghem codes.

Journal ArticleDOI
TL;DR: In-vehicle network traffic monitoring is proposed to detect increased transmission rates of manipulated message streams and an automatic distribution of detection tasks among selected E/E architecture components, such as a subset of electronic control units.
Abstract: Due to the growing interconnectedness and complexity of in-vehicle networks, in addition to safety, security is becoming an increasingly important topic in the automotive domain. In this paper, we study techniques for detecting security infringements in automotive electrical and electronic (E/E) architectures. Toward this we propose in-vehicle network traffic monitoring to detect increased transmission rates of manipulated message streams. Attacks causing timing violations can disrupt safety-critical functions and have severe consequences. To reduce costs and prevent single points of failure, our approach enables an automatic distribution of detection tasks among selected E/E architecture components, such as a subset of electronic control units. First, we analyze a concrete E/E system architecture to determine the communication parameters and properties necessary for detecting security attacks. These are then used for a parametrization of the corresponding detection algorithms and the distribution of attack detection tasks. We use a lightweight message monitoring method and optimize the placement of detection tasks to ensure a full-coverage of the E/E system architecture and a timely detection of an attack.

Journal ArticleDOI
TL;DR: Simulation results show that the adaptive framework efficiently utilizes on-chip resources to reduce time-to-result without sacrificing the chip’s lifetime, the first design-automation framework for quantitative gene expression.
Abstract: Considerable effort has recently been directed toward the implementation of molecular bioassays on digital-microfluidic biochips (DMFBs). However, today’s solutions suffer from the drawback that multiple sample pathways are not supported and on-chip reconfigurable devices are not efficiently exploited. As a result, impractical manual intervention is needed to process protocols for gene-expression analysis. To overcome this problem, we first describe our benchtop experimental studies to understand gene-expression analysis and its relationship to the biochip design specification. We then introduce an integrated framework for quantitative gene-expression analysis using DMFBs. The proposed framework includes: 1) a spatial-reconfiguration technique that incorporates resource-sharing specifications into the synthesis flow; 2) an interactive firmware that collects and analyzes sensor data based on quantitative polymerase chain reaction; and 3) a real-time resource-allocation scheme that responds promptly to decisions about the protocol flow received from the firmware layer. This framework is combined with cyberphysical integration to develop the first design-automation framework for quantitative gene expression. Simulation results show that our adaptive framework efficiently utilizes on-chip resources to reduce time-to-result without sacrificing the chip’s lifetime.

Journal ArticleDOI
TL;DR: The proposed satisfiability-based dilution algorithm outperforms existing dilution algorithms in terms of mixing steps and waste production, and compares favorably with respect to reagent-usage (cost) when 4- and 8-segment rotary mixers are used.
Abstract: Albeit sample preparation is well-studied for digital microfluidic biochips, very few prior work addressed this problem in the context of continuous-flow microfluidics from an algorithmic perspective. In the latter class of chips, microvalves and micropumps are used to manipulate on-chip fluid flow through microchannels in order to execute a biochemical protocol. Dilution of a sample fluid is a special case of sample preparation, where only two input reagents (commonly known as sample and buffer ) are mixed in a desired volumetric ratio. In this paper, we propose a satisfiability-based dilution algorithm assuming the generalized mixing models supported by an ${ \boldsymbol {N}}$ -segment, continuous-flow, rotary mixer. Given a target concentration and an error limit, the proposed algorithm first minimizes the number of mixing operations, and subsequently, reduces reagent-usage. Simulation results demonstrate that the proposed method outperforms existing dilution algorithms in terms of mixing steps (assay time) and waste production, and compares favorably with respect to reagent-usage (cost) when 4- and 8-segment rotary mixers are used. Next, we propose two variants of an algorithm for handling the open problem of ${k}$ -reagent mixture-preparation ( ${k\geq 3}$ ) with an ${N}$ -segment continuous-flow rotary mixer, and report experimental results to evaluate their performance. A software tool called flow-based sample preparation algorithm has also been developed that can be readily used for running the proposed algorithms.

Journal ArticleDOI
TL;DR: This contribution highlights the improvements of ToPoliNano, which is now a innovative and complete tool for the development of iNML technology, like a circuit editor for custom design of field coupled nanodevices, improved algorithms for netlist optimization and new algorithms for the place and route of NML circuits.
Abstract: In the post-CMOS scenario, field coupled nanotechnologies represent an innovative and interesting new direction for electronic nanocomputing. Among these technologies, nanomagnet logic (NML) makes it possible to finally embed logic and memory in the same device. To fully analyze the potential of NML circuits, design tools that mimic the CMOS design-flow should be used for circuit design. We present, in this paper, the latest and improved version of Torino Politecnico Nanotechnology (ToPoliNano), our design and simulation framework for field coupled nanotechnologies. ToPoliNano emulates the top-down design process of CMOS technology. Circuits are described with a VHSIC hardware description language netlist and layout is then automatically generated considering in-plane NML (iNML) technology. The resulting circuits can be simulated and performance can be analyzed. In this paper, we describe several enhancements to the tool itself, like a circuit editor for custom design of field coupled nanodevices, improved algorithms for netlist optimization and new algorithms for the place and route of iNML circuits. We have validated and analyzed the tool by using extensive metrics, both by using standard circuits and ISCAS’85 benchmarks. This contribution highlights the improvements of ToPoliNano, which is now a innovative and complete tool for the development of iNML technology.

Journal ArticleDOI
TL;DR: This paper begins by introducing several key concepts in machine learning and data mining, followed by a review of different learning approaches, and describes the experience of developing a practical data mining application.
Abstract: Applying modern data mining in electronic design automation and test has become an area of growing interest in recent years This paper reviews some of the recent developments in the area It begins by introducing several key concepts in machine learning and data mining, followed by a review of different learning approaches Then, the experience of developing a practical data mining application is described, including promises demonstrated through positive results based on industrial settings and challenges explained in the respective application contexts Future research directions are summarized at the end

Journal ArticleDOI
TL;DR: This paper proposes workload-aware reliability management (WARM), a fast DRM technique adapting to diverse workload requirements to trade reliability and user experience, and develops an optimal policy for multicores using convex optimization.
Abstract: With CMOS scaling beyond 14 nm, reliability is a major concern for IC manufacturers. Reliability-aware design has a non-negligible overhead and cannot account for user experience in mobile devices. An alternative is dynamic reliability management (DRM), which counteracts degradation by adapting the operating conditions at runtime. In this paper, for the first time we formulate DRM as an optimization problem that accounts for reliability, temperature and performance. We develop an optimal policy for multicores using convex optimization, and show that it is not feasible to implement on real systems. For this reason, we propose workload-aware reliability management (WARM), a fast DRM technique adapting to diverse workload requirements to trade reliability and user experience. WARM is implemented and tested on a real Android device. WARM approximates the solution of the convex solver within 5% on average, while executing more than $400 {\times }$ faster. WARM integrates a thermal controller that allocates tasks to meet thermal constraints. This is required since degradation strongly depends on temperature. We show that WARM meets temperature constraints within 5% in 87.5% more cases than the state-of-the-art. We show that WARM task allocation achieves up to one year lifetime improvement for a multicore platform. It can achieve up to 100% of performance improvement on cluster architectures, such as big.LITTLE, while still guaranteeing the reliability target. Finally, we show that it achieves performance in the 4% of the maximum for a broad range of a applications, while meeting the reliability constraints.

Journal ArticleDOI
TL;DR: This paper proposes efficient error detection architectures including variants of recomputing with encoded operands and signature-based schemes to detect both transient and permanent faults and shows that the proposed schemes are applicable to the case study of simple lightweight CFB for providing authenticated encryption with associated data.
Abstract: Cryptographic architectures provide different security properties to sensitive usage models. However, unless reliability of architectures is guaranteed, such security properties can be undermined through natural or malicious faults. In this paper, two underlying block ciphers which can be used in authenticated encryption algorithms are considered, i.e., light encryption device and high security and lightweight block ciphers. The former is of the Advanced Encryption Standard type and has been considered area-efficient, while the latter constitutes a Feistel network structure and is suitable for low-complexity and low-power embedded security applications. In this paper, we propose efficient error detection architectures including variants of recomputing with encoded operands and signature-based schemes to detect both transient and permanent faults. Authenticated encryption is applied in cryptography to provide confidentiality, integrity, and authenticity simultaneously to the message sent in a communication channel. In this paper, we show that the proposed schemes are applicable to the case study of simple lightweight CFB for providing authenticated encryption with associated data. The error simulations are performed using Xilinx Integrated Synthesis Environment tool and the results are benchmarked for the Xilinx FPGA family Virtex-7 to assess the reliability capability and efficiency of the proposed architectures.

Journal ArticleDOI
TL;DR: The objective of this paper is to provide a unified perspective on the fundamental opportunities and challenges posed by 3-D ICs especially from the context of design tools and methods and conclude with a discussion of the remaining challenges and open problems that must be overcome to make 3- D IC technology commercially viable.
Abstract: Vertically integrated circuits (3-D ICs) may revitalize Moore’s law scaling which has slowed down in recent years. 3-D stacking is an emerging technology that stacks multiple dies vertically to achieve higher transistor density independent of device scaling. They provide high-density vertical interconnects, which can reduce interconnect power and delay. Moreover, 3-D ICs can integrate disparate circuit technologies into a single chip, thereby unlocking new system-on-chip architectures that do not exist in 2-D technology. While 3-D integration could bring new architectural opportunities and significant performance enhancement, new thermal, power delivery, signal integrity and reliability challenges emerge as power consumption grows, and device density increases. Moreover, the significant expansion of CPU design space in 3-D requires new architectural models and methodologies for design space exploration (DSE). New design tools and methods are required to address these 3-D-specific challenges. This keynote paper focuses on the state of the art, ongoing advances and future challenges of 3-D IC design tools and methods. The primary focus of this paper is TSV-based 3-D ICs, although we also discuss recent advances in monolithic 3-D ICs. The objective of this paper is to provide a unified perspective on the fundamental opportunities and challenges posed by 3-D ICs especially from the context of design tools and methods. We also discuss the methodology of co-design to address more complicated and interdependent design problems in 3-D IC, and conclude with a discussion of the remaining challenges and open problems that must be overcome to make 3-D IC technology commercially viable.

Journal ArticleDOI
TL;DR: This paper addresses the problem by proposing a mathematical transformation scheme to bound the spacing error and build a distributed control algorithm on such a basis that achieves a spacing error satisfying both uniformly boundedness and uniformly ultimate boundedness.
Abstract: Intelligent transportation has become an essential field of cyber-physical systems. Among various intelligent transportation technologies, the automated highway system (AHS) has its unique advantage of being able to coordinate a platoon of vehicles as a whole unit. The major challenge of building a robust AHS is the nonlinear and (potentially) fast time-varying uncertainty induced by parameter variations and external disturbances. Finally, reflected as the spacing between neighboring vehicles, such uncertainties can be a serious concern for maintaining safety. This paper addresses the problem by proposing a mathematical transformation scheme to bound the spacing error and build a distributed control algorithm on such a basis. The propose algorithm achieves a spacing error satisfying both uniformly boundedness and uniformly ultimate boundedness. Our decentralized algorithm is communication efficient in the sense that it only requires the state information of the preceding car and the acceleration feedback and does not need to communicate with all other cars.

Journal ArticleDOI
TL;DR: This paper explores the interactions between DRAM and PCM to improve both the performance and the endurance of a DRAM-PCM hybrid main memory and develops a proactive strategy to allocate pages taking both program segments and DRAM conflict misses into consideration.
Abstract: Phase change memory (PCM), given its nonvolatility, potential high density, and low standby power, is a promising candidate to be used as main memory in next generation computer systems. However, to hide its shortcomings of limited endurance and slow write performance, state-of-the-art solutions tend to construct a dynamic RAM (DRAM)-PCM hybrid memory and place write-intensive pages in DRAM. While existing optimizations to this hybrid architecture focus on tuning DRAM configurations to reduce the number of write operations to PCM, this paper explores the interactions between DRAM and PCM to improve both the performance and the endurance of a DRAM-PCM hybrid main memory. Specifically, it exploits the flexibility of mapping virtual pages to physical pages, and develops a proactive strategy to allocate pages taking both program segments and DRAM conflict misses into consideration, thus distributing those heavily written pages across different DRAM sets. Meanwhile, a lifetime-aware DRAM replacement algorithm and a conflict-aware page remapping strategy are proposed to further reduce DRAM misses and PCM writes. Experiments confirm that the proposed techniques are able to improve average memory hit time and reduce maximum PCM write counts thus enhancing both performance and lifetime of a DRAM-PCM hybrid main memory.

Journal ArticleDOI
TL;DR: A complete methodology for designing the private local memories (PLMs) of multiple accelerators based on the memory requirements of each accelerator, which automatically determines an area-efficient architecture for the PLMs to guarantee performance and reduce the memory cost based on technology-related information.
Abstract: In modern system-on-chip architectures, specialized accelerators are increasingly used to improve performance and energy efficiency. The growing complexity of these systems requires the use of system-level design methodologies featuring high-level synthesis (HLS) for generating these components efficiently. Existing HLS tools, however, have limited support for the system-level optimization of memory elements, which typically occupy most of the accelerator area. We present a complete methodology for designing the private local memories (PLMs) of multiple accelerators. Based on the memory requirements of each accelerator, our methodology automatically determines an area-efficient architecture for the PLMs to guarantee performance and reduce the memory cost based on technology-related information. We implemented a prototype tool, called Mnemosyne, that embodies our methodology within a commercial HLS flow. We designed 13 complex accelerators for selected applications from two recently-released benchmark suites (Perfect and CortexSuite). With our approach we are able to reduce the memory cost of single accelerators by up to 45%. Moreover, when reusing memory IPs across accelerators, we achieve area savings that range between 17% and 55% compared to the case where the PLMs are designed separately.

Journal ArticleDOI
TL;DR: This paper presents a novel method, using on-chip sensors based on ring oscillators (ROSCs), to detect the delay shifts in circuits as a result of aging, using presilicon analysis of the circuit to compute calibration factors that can translate BTI- and HCI-induced delay shift in the ROSC to those in the circuit of interest.
Abstract: The performance of nanometer-scale circuits is adversely affected by aging induced by bias temperature instability (BTI) and hot carrier injection (HCI). Both BTI and HCI impact transistor electrical parameters at a level that depends on the operating environment and usage of the circuit. This paper presents a novel method, using on-chip sensors based on ring oscillators (ROSCs), to detect the delay shifts in circuits as a result of aging. Our method uses presilicon analysis of the circuit to compute calibration factors that can translate BTI- and HCI-induced delay shifts in the ROSC to those in the circuit of interest. Our simulations show that the delay estimates are within 1% of the true values from presilicon analysis. Further, for post-silicon analysis, a refinement strategy is proposed where sensor measurements can be amalgamated with infrequent online delay measurements on the monitored circuit to partially capture its true workloads. This leads to about 8% lower delay guardbanding overheads compared to the conventional methods as demonstrated using benchmark circuits.