scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems in 2015"


Journal ArticleDOI
TL;DR: This work developed TrueNorth, a 65 mW real-time neurosynaptic processor that implements a non-von Neumann, low-power, highly-parallel, scalable, and defect-tolerant architecture, and successfully demonstrated the use of TrueNorth-based systems in multiple applications, including visual object recognition.
Abstract: The new era of cognitive computing brings forth the grand challenge of developing systems capable of processing massive amounts of noisy multisensory data. This type of intelligent computing poses a set of constraints, including real-time operation, low-power consumption and scalability, which require a radical departure from conventional system design. Brain-inspired architectures offer tremendous promise in this area. To this end, we developed TrueNorth, a 65 mW real-time neurosynaptic processor that implements a non-von Neumann, low-power, highly-parallel, scalable, and defect-tolerant architecture. With 4096 neurosynaptic cores, the TrueNorth chip contains 1 million digital neurons and 256 million synapses tightly interconnected by an event-driven routing infrastructure. The fully digital 5.4 billion transistor implementation leverages existing CMOS scaling trends, while ensuring one-to-one correspondence between hardware and software. With such aggressive design metrics and the TrueNorth architecture breaking path with prevailing architectures, it is clear that conventional computer-aided design (CAD) tools could not be used for the design. As a result, we developed a novel design methodology that includes mixed asynchronous–synchronous circuits and a complete tool flow for building an event-driven, low-power neurosynaptic chip. The TrueNorth chip is fully configurable in terms of connectivity and neural parameters to allow custom configurations for a wide range of cognitive and sensory perception applications. To reduce the system’s communication energy, we have adapted existing application-agnostic very large-scale integration CAD placement tools for mapping logical neural networks to the physical neurosynaptic core locations on the TrueNorth chips. With that, we have successfully demonstrated the use of TrueNorth-based systems in multiple applications, including visual object recognition, with higher performance and orders of magnitude lower power consumption than the same algorithms run on von Neumann architectures. The TrueNorth chip and its tool flow serve as building blocks for future cognitive systems, and give designers an opportunity to develop novel brain-inspired architectures and systems based on the knowledge obtained from this paper.

1,105 citations


Journal ArticleDOI
TL;DR: This paper is the first to provide an in-depth and comprehensive literature overview on HDAs, and does expose new threats regarding helper data leakage and manipulation.
Abstract: Security-critical products rely on the secrecy and integrity of their cryptographic keys. This is challenging for low-cost resource-constrained embedded devices, with an attacker having physical access to the integrated circuit (IC). Physically, unclonable functions are an emerging technology in this market. They extract bits from unavoidable IC manufacturing variations, remarkably analogous to unique human fingerprints. However, post-processing by helper data algorithms (HDAs) is indispensable to meet the stringent key requirements: reproducibility, high-entropy, and control. The novelty of this paper is threefold. We are the first to provide an in-depth and comprehensive literature overview on HDAs. Second, our analysis does expose new threats regarding helper data leakage and manipulation. Third, we identify several hiatuses/open problems in existing literature.

226 citations


Journal ArticleDOI
TL;DR: This work proposes a framework to mine requirements from a closed-loop model of an industrial-scale control system, such as one specified in Simulink, and demonstrates the scalability and utility of the technique on three complex case studies in the domain of automotive powertrain systems.
Abstract: Formal verification of a control system can be performed by checking if a model of its dynamical behavior conforms to temporal requirements. Unfortunately, adoption of formal verification in an industrial setting is a formidable challenge as design requirements are often vague, nonmodular, evolving, or sometimes simply unknown. We propose a framework to mine requirements from a closed-loop model of an industrial-scale control system, such as one specified in Simulink. The input to our algorithm is a requirement template expressed in parametric signal temporal logic: a logical formula in which concrete signal or time values are replaced with parameters. Given a set of simulation traces of the model, our method infers values for the template parameters to obtain the strongest candidate requirement satisfied by the traces. It then tries to falsify the candidate requirement using a falsification tool. If a counterexample is found, it is added to the existing set of traces and these steps are repeated; otherwise, it terminates with the synthesized requirement. Requirement mining has several usage scenarios: mined requirements can be used to formally validate future modifications of the model, they can be used to gain better understanding of legacy models or code, and can also help enhancing the process of bug finding through simulations. We demonstrate the scalability and utility of our technique on three complex case studies in the domain of automotive powertrain systems: a simple automatic transmission controller, an air-fuel controller with a mean-value model of the engine dynamics, and an industrial-size prototype airpath controller for a diesel engine. We include results on a bug found in the prototype controller by our method.

220 citations


Journal ArticleDOI
TL;DR: VeriTrust is a novel verification technique for hardware trust, namely VeriTrust, which facilitates to detect HTs inserted at design stage and is insensitive to the implementation style of HTs.
Abstract: Today's integrated circuit designs are vulnerable to a wide range of malicious alterations, namely hardware Trojans (HTs). HTs serve as backdoors to subvert or augment the normal operation of infected devices, which may lead to functionality changes, sensitive information leakages, or denial of service attacks. To tackle such threats, this paper proposes a novel verification technique for hardware trust, namely VeriTrust, which facilitates to detect HTs inserted at design stage. Based on the observation that HTs are usually activated by dedicated trigger inputs that are not sensitized with verification test cases, VeriTrust automatically identifies such potential HT trigger inputs by examining verification corners. The key difference between VeriTrust and existing HT detection techniques based on “unused circuit identification” is that VeriTrust is insensitive to the implementation style of HTs. Experimental results show that VeriTrust is able to detect all HTs evaluated in this paper (constructed based on various HT design methodologies shown in this paper) at the cost of moderate extra verification time.

165 citations


Journal ArticleDOI
TL;DR: This paper presents a detailed evaluation of performance and energy related parameters and compares the novel SOT-MRAM with several other memory technologies, and shows that a hybrid-combination of SRAM for the L1-Data-cache, Sot-MRam for theL1-Instruction-cache and L2-cache can reduce the energy consumption and performance while the performance increases by 1% compared to an SRAM-only configuration.
Abstract: Magnetic Random Access Memory (MRAM) is a very promising emerging memory technology because of its various advantages such as nonvolatility, high density and scalability. In particular, Spin Orbit Torque (SOT) MRAM is gaining interest as it comes along with all the benefits of its predecessor Spin Transfer Torque (STT) MRAM, but is supposed to eliminate some of its shortcomings. Especially the split of read and write paths in SOT-MRAM promises faster access times and lower energy consumption compared to STT-MRAM. In this paper, we provide a very detailed analysis of SOT-MRAM at both the circuit- and architecture-level. We present a detailed evaluation of performance and energy related parameters and compare the novel SOT-MRAM with several other memory technologies. Our architecture-level analysis shows that a hybrid -combination of SRAM for the L1-Data-cache, SOT-MRAM for the L1-Instruction-cache and L2-cache can reduce the energy consumption by 60% while the performance increases by 1% compared to an SRAM-only configuration. Moreover, the retention failure probability of SOT-MRAM is $27\boldsymbol {\times }$ smaller than the probability of radiation-induced Soft Errors in SRAM, for a 65nm technology node. All of these advantages together make SOT-MRAM a viable choice for microprocessor caches.

152 citations


Journal ArticleDOI
TL;DR: This work introduces a multiplexor-based locking strategy that preserves test response allowing IC testing by an untrusted party before activation, and demonstrates a simple yet effective attack against a locked circuit that does not preserve test response.
Abstract: The increasing IC manufacturing cost encourages a business model where design houses outsource IC fabrication to remote foundries. Despite cost savings, this model exposes design houses to IC piracy as remote foundries can manufacture in excess to sell on the black market. Recent efforts in digital hardware security aim to thwart piracy by using XOR-based chip locking, cryptography, and active metering. To counter direct attacks and lower the exposure of unlocked circuits to the foundry, we introduce a multiplexor-based locking strategy that preserves test response allowing IC testing by an untrusted party before activation. We demonstrate a simple yet effective attack against a locked circuit that does not preserve test response, and validate the effectiveness of our locking strategy on IWLS 2005 benchmarks.

152 citations


Journal ArticleDOI
TL;DR: This paper develops an efficient analysis of variance-based stochastic circuit/microelectromechanical systems simulator to efficiently extract the surrogate models at the low level and employs tensor-train decomposition at the high level to construct the basis functions and Gauss quadrature points.
Abstract: Hierarchical uncertainty quantification can reduce the computational cost of stochastic circuit simulation by employing spectral methods at different levels. This paper presents an efficient framework to simulate hierarchically some challenging stochastic circuits/systems that include high-dimensional subsystems. Due to the high parameter dimensionality, it is challenging to both extract surrogate models at the low level of the design hierarchy and to handle them in the high-level simulation. In this paper, we develop an efficient analysis of variance-based stochastic circuit/microelectromechanical systems simulator to efficiently extract the surrogate models at the low level. In order to avoid the curse of dimensionality, we employ tensor-train decomposition at the high level to construct the basis functions and Gauss quadrature points. As a demonstration, we verify our algorithm on a stochastic oscillator with four MEMS capacitors and 184 random parameters. This challenging example is efficiently simulated by our simulator at the cost of only 10min in MATLAB on a regular personal computer.

126 citations


Journal ArticleDOI
TL;DR: A power efficient framework for analog approximate computing with the emerging metal-oxide resistive switching random-access memory (RRAM) devices is proposed and an approximate computing framework with scalability is proposed on top of the RRAM-ACU.
Abstract: Approximate computing is a promising design paradigm for better performance and power efficiency. In this paper, we propose a power efficient framework for analog approximate computing with the emerging metal-oxide resistive switching random-access memory (RRAM) devices. A programmable RRAM-based approximate computing unit (RRAM-ACU ) is introduced first to accelerate approximated computation, and an approximate computing framework with scalability is then proposed on top of the RRAM-ACU. In order to program the RRAM-ACU efficiently, we also present a detailed configuration flow, which includes a customized approximator training scheme, an approximator-parameter-to-RRAM-state mapping algorithm, and an RRAM state tuning scheme. Finally, the proposed RRAM-based computing framework is modeled at system level. A predictive compact model is developed to estimate the configuration overhead of RRAM-ACU and help explore the application scenarios of RRAM-based analog approximate computing. The simulation results on a set of diverse benchmarks demonstrate that, compared with a x86–64 CPU at 2 GHz, the RRAM-ACU is able to achieve 4.06– $196.41 {\times }$ speedup and power efficiency of 24.59–567.98 GFLOPS/W with quality loss of 8.72% on average. And the implementation of hierarchical model and X application demonstrates that the proposed RRAM-based approximate computing framework can achieve >12.8 $\times$ power efficiency than its pure digital implementation counterparts (CPU, graphics processing unit, and field- programmable gate arrays).

114 citations


Journal ArticleDOI
TL;DR: A low-power and small footprint hybrid RO PUF with a very high temperature stability, which makes it an ideal candidate for lightweight applications.
Abstract: Ring oscillator (RO)-based physical unclonable function (PUF) is resilient against noise impacts, but its response is susceptible to temperature variations. This paper presents a low-power and small footprint hybrid RO PUF with a very high temperature stability, which makes it an ideal candidate for lightweight applications. The negative temperature coefficient of the low-power subthreshold operation of current starved inverters is exploited to mitigate the variations of differential RO frequencies with temperature. The new architecture uses conspicuously simplified circuitries to generate and compare a large number of pairs of RO frequencies. The proposed nine-stage hybrid RO PUF was fabricated using global foundry 65-nm CMOS technology. The PUF occupies only $250~ {\mu }\text{m}^{ {2}}$ of chip area and consumes only $32.3~ {\mu }\text{W}$ per challenge response pair at 1.2 V and 230 MHz. The measured average and worst-case reliability of its responses are 99.84% and 97.28%, respectively, over a wide range of temperature from −40 to 120 °C.

113 citations


Journal ArticleDOI
TL;DR: This paper shows that even protocols in which it would be computationally infeasible to compute enough challenge and response pairs for a direct machine learning attack can be attacked using machine learning.
Abstract: Physical unclonable functions (PUFs) have emerged as a promising solution for securing resource-constrained embedded devices such as RFID tokens. PUFs use the inherent physical differences of every chip to either securely authenticate the chip or generate cryptographic keys without the need of nonvolatile memory. However, PUFs have shown to be vulnerable to model building attacks if the attacker has access to challenge and response pairs. In these model building attacks, machine learning is used to determine the internal parameters of the PUF to build an accurate software model. Nevertheless, PUFs are still a promising building block and several protocols and designs have been proposed that are believed to be resistant against machine learning attacks. In this paper, we take a closer look at two such protocols, one based on reverse fuzzy extractors and one based on pattern matching. We show that it is possible to attack these protocols using machine learning despite the fact that an attacker does not have access to direct challenge and response pairs. The introduced attacks demonstrate that even highly obfuscated responses can be used to attack PUF protocols. Hence, this paper shows that even protocols in which it would be computationally infeasible to compute enough challenge and response pairs for a direct machine learning attack can be attacked using machine learning.

107 citations


Journal ArticleDOI
TL;DR: This work utilizes feedback learning and present redundant clip removal to reduce the false alarm and outperforms the 2012 CAD contest at International Conference on Computer-Aided Design (ICCAD) winner on accuracy and false alarm.
Abstract: Because of the widening sub-wavelength lithography gap in advanced fabrication technology, lithography hotspot detection has become an essential task in design for manufacturability. Unlike current state-of-the-art works, which unite pattern matching and machine-learning engines, we fully exploit the strengths of machine learning using novel techniques. By combing topological classification and critical feature extraction, our hotspot detection framework achieves very high accuracy. Furthermore, to speed-up the evaluation, we verify only possible layout clips instead of full-layout scanning. We utilize feedback learning and present redundant clip removal to reduce the false alarm. Experimental results show that the proposed framework is very accurate and demonstrates a rapid training convergence. Moreover, our framework outperforms the 2012 CAD contest at International Conference on Computer-Aided Design (ICCAD) winner on accuracy and false alarm.

Journal ArticleDOI
TL;DR: A power aware built-in self-test solution to detect and locate faults in memristors is developed and a hybrid diagnosis scheme that uses a combination of sneak-path and March testing to reduce diagnosis time is proposed.
Abstract: Memristors are an attractive option for use in future memory architectures but are prone to high defect densities due to the nondeterministic nature of nanoscale fabrication. Several works discuss memristor fault models and testing. However, none of them considers the memristor as a multilevel cell (MLC). The ability of memristors to function as an MLC allows for extremely dense, low-power memories. Using a memristor as an MLC introduces fault mechanisms that cannot occur in typical two-level memory cells. In this paper, we develop fault models for memristor-based MLC crossbars. The typical approach to testing a memory subsystem entails testing one memory cell at a time. However, this testing strategy is time consuming and does not scale for dense, memristor memories. We propose an efficient testing technique that exploits sneak-paths inherent in crossbar memories to test several memory cells simultaneously. In this paper, we integrate solutions for detecting and locating faults in memristors. We develop a power aware built-in self-test solution to detect these faults. We also propose a hybrid diagnosis scheme that uses a combination of sneak-path and March testing to reduce diagnosis time. The proposed schemes enable and leverage sneak-paths during fault detection and diagnosis modes, while disabling sneak-paths during normal operation. The proposed hybrid scheme reduces fault detection and diagnosis time by 24.69% and 28%, respectively, compared to traditional March tests.

Journal ArticleDOI
TL;DR: This paper investigates a novel attack vector against cryptography realized on FPGAs, which poses a serious threat to real-world applications and demonstrates how a targeted bitstream modification can seriously weaken cryptographic algorithms.
Abstract: This paper investigates a novel attack vector against cryptography realized on FPGAs, which poses a serious threat to real-world applications. We demonstrate how a targeted bitstream modification can seriously weaken cryptographic algorithms, which we show with the examples of AES and 3-DES. The attack is performed by modifying the FPGA bitstream that configures the hardware elements during initialization. Recently, it has been shown that cloning of FPGA designs is feasible, even if the bitstream is encrypted. However, due to its proprietary file format, a meaningful modification is challenging. While some previous work addressed bitstream reverse-engineering, so far it has not been evaluated how difficult it is to detect and modify cryptographic elements. We outline two possible practical attacks that have serious security implications. We target the S-boxes of block ciphers that can be implemented in look-up tables or stored as precomputed set of values in the memory of the FPGA. We demonstrate that it is possible to detect and apply meaningful changes to cryptographic elements inside an unknown, proprietary, and undocumented bitstream. Our proposed attack does not require any knowledge of the internal routing. Furthermore, we show how an AES key can be revealed within seconds. Finally, we discuss countermeasures that can raise the bar for an adversary to successfully perform this kind of attack.

Journal ArticleDOI
TL;DR: This paper develops two different mathematical attacks on previously proposed lightweight PUF circuits, namely composite PUF and the multibit output lightweight secure PUF (LSPUF), and elucidate a special property of the output network of LSPUF to show how it can be leveraged by an adversary to perform an intelligent model building attack.
Abstract: Due to their unique physical properties, physically unclonable functions (PUF) have been proposed widely as versatile cryptographic primitives. It is desirable that silicon PUF circuits should be lightweight, i.e., have low-hardware resource requirements. However, it is also of primary importance that such demands of low hardware overhead should not compromise the security aspects of PUF circuits. In this paper, we develop two different mathematical attacks on previously proposed lightweight PUF circuits, namely composite PUF and the multibit output lightweight secure PUF (LSPUF). We show that independence of various components of composite PUF can be used to develop divide and conquer attacks which can be used to determine the responses to unknown challenges. We reduce the complexity of the attack using a machine learning-based modeling analysis. In addition, we elucidate a special property of the output network of LSPUF to show how such feature can be leveraged by an adversary to perform an intelligent model building attack. The theoretical inferences are validated through experimental results. More specifically, proposed attacks on composite PUF are validated using the challenge-response pairs (CRPs) from its field programmable gate array (FPGA) implementation, and attack on LSPUF is validated using the CRPs of both simulated and FPGA implemented LSPUF.

Journal ArticleDOI
TL;DR: In this article, the authors explored the general framework and prospects for on-chip and off-chip wireless interconnects implemented for high-performance computing (HPC) systems in the context of micro power wireless design.
Abstract: This paper explores the general framework and prospects for on-chip and off-chip wireless interconnects implemented for high-performance computing (HPC) systems in the context of micro power wireless design. HPC interconnects demand very high (≥ 10 Gb/s) transmission rates using ultraefficient ( $\sim ~1$ pJ/bit) transceivers over extremely short (≤ 100 cm) ranges. In an attempt to design such wireless interconnects, first a model for the wireless communication channel properties is developed. The use of CMOS-based energy-efficient on–off keying (OOK) transceiver architectures operating in the 60–90 GHz bands is considered as a practical solution. In order to address strict performance requirements of wireless HPC interconnects, and taking advantage of the recent developments in device scaling, compact low-power and innovative circuits based on novel double-gate MOSFETs (DG-MOSFETs) are proposed in the implementation of the architecture. The performance of a compact low-noise amplifier (LNA) design using common source (CS) inductive degeneration with 32 nm DG-MOSFETs is investigated by quantitative analysis and simulation. The proposed inductor-less two-stage cascode cascade LNA is optimized for 90 GHz operation and has the advantage of gain switching over its CMOS counterpart without the use of additional switching transistors, which makes it remarkably power efficient and faster. As further examples of efficient and compact DG-MOSFET circuits for OOK transceiver design, a three-stage CS 5 dB tunable power amplifier operating up to 90 GHz, and a novel 90 GHz voltage controlled oscillator are also presented. This is followed by the proposal of an array of four monopole antennas studied using full-wave EM solver.

Journal ArticleDOI
TL;DR: This paper presents a novel protection method for fine-grained access management in complex reconfigurable scan networks based on a challenge-response authentication protocol that scales well with the number of protected instruments and offers a high level of security.
Abstract: Modern very large scale integration designs incorporate a high amount of instrumentation that supports post-silicon validation and debug, volume test and diagnosis, as well as in-field system monitoring and maintenance. Reconfigurable scan architectures, as allowed by the novel IEEE Std 1149.1-2013 (JTAG) and IEEE Std 1687-2014 [Internal JTAG (IJTAG)], emerge as a scalable mechanism for access to such on-chip instruments. While the on-chip instrumentation is crucial for meeting quality, dependability, and time-to-market goals, it is prone to abuse and threatens system safety and security. A secure access management method is mandatory to assure that critical instruments be accessible to authorized entities only. This paper presents a novel protection method for fine-grained access management in complex reconfigurable scan networks based on a challenge-response authentication protocol. The target scan network is extended with an authorization instrument and secure segment insertion bits that together control the accessibility of individual instruments. To the best of the authors’ knowledge, this is the first fine-grained access management scheme that scales well with the number of protected instruments and offers a high level of security. Compared with recent state-of-the-art techniques, this scheme is more favorable with respect to implementation cost, performance overhead, and provided security level.

Journal ArticleDOI
TL;DR: An isomorphism algorithm is developed, which reduces a given set of circuits to its unique being one of the first methodologies addressing this issue and demonstrating the claimed feasibility and applicability of the synthesis framework in general and in the context of system design.
Abstract: This paper proposes a new methodology for automated analog circuit synthesis, aiming to address the challenges known from other analog synthesis approaches: unsatisfactory time predictability due to stochastic-driven circuit generation methods, the dereliction of the creative part during the design process, and the inflexibility leading to synthesis tools, which mostly only handle just one circuit class. This contribution presents the underlying concepts and ideas to provide the predictability, flexibility, and creative freedom in order to elevate analog circuit design to the next step. A circuit generation algorithm is presented, which allows a full design-space exploration. Furthermore, an isomorphism algorithm is developed, which reduces a given set of circuits to its unique being one of the first methodologies addressing this issue. Thus, the algorithm handles vast amounts of circuits in a very efficient manner. The results demonstrate the claimed feasibility and applicability of the synthesis framework in general and in the context of system design.

Journal ArticleDOI
TL;DR: Three innovative low-overhead approaches for run-time Trojan detection which exploit the thermal sensors already available in many modern systems to detect deviations in power/thermal profiles caused by Trojan activation are proposed.
Abstract: The hardware Trojan threat has motivated development of Trojan detection schemes at all stages of the integrated circuit (IC) lifecycle. While the majority of existing schemes focus on ICs at test-time, there are many unique advantages offered by post-deployment/run-time Trojan detection. However, run-time approaches have been underutilized with prior work highlighting the challenges of implementing them with limited hardware resources. In this paper, we propose three innovative low-overhead approaches for run-time Trojan detection which exploit the thermal sensors already available in many modern systems to detect deviations in power/thermal profiles caused by Trojan activation. The first one is a local sensor-based approach that uses information from thermal sensors together with hypothesis testing to make a decision. The second one is a global approach that exploits correlation between sensors and maintains track of the ICs thermal profile using a Kalman filter (KF). The third approach incorporates leakage power into the system dynamic model and apply extended KF (EKF) to track ICs thermal profile. Simulation results using state-of-the-art tools on ten publicly available Trojan benchmarks verify that all three proposed approaches can detect active Trojans quickly and with few false positives. Among three approaches, EKF is flawless in terms of the ten benchmarks tested but would require the most overhead.

Journal ArticleDOI
TL;DR: An electrostatics-based placement algorithm for large-scale mixed-size circuits (ePlace-MS) is proposed, which outperforms all the related works in literature with better quality and efficiency.
Abstract: We propose an electrostatics-based placement algorithm for large-scale mixed-size circuits (ePlace-MS). ePlace-MS is generalized, flat, analytic and nonlinear. The density modeling method eDensity is extended to handle the mixed-size placement. We conduct detailed analysis on the correctness of the gradient formulation and the numerical solution, as well as the rationale of dc removal and the advantages over prior density functions. Nesterov’s method is used as the nonlinear solver, which shows high yet stable performance over mixed-size circuits. The steplength is set as the inverse of Lipschitz constant of the gradient function, while we develop a backtracking method to prevent overestimation. An approximated nonlinear preconditioner is developed to minimize the topological and physical differences between large macros and standard cells. Besides, we devise a simulated annealer to legalize the layout of macros and use a second-phase global placement to reoptimize the standard cell layout. All the above innovations are integrated into our mixed-size placement prototype ePlace-MS, which outperforms all the related works in literature with better quality and efficiency. Compared to the leading-edge mixed-size placer NTUplace3, ePlace-MS produces up to 22.98% and on average 8.22% shorter wirelength over all the 16 modern mixed-size benchmark circuits with the same runtime.

Journal ArticleDOI
TL;DR: An efficient compiler framework for cache bypassing on GPUs is proposed and efficient algorithms that judiciously select global load instructions for cache access or bypass are presented.
Abstract: Graphics processing units (GPUs) have become ubiquitous for general purpose applications due to their tremendous computing power. Initially, GPUs only employ scratchpad memory as on-chip memory. Though scratchpad memory benefits many applications, it is not ideal for those general purpose applications with irregular memory accesses. Hence, GPU vendors have introduced caches in conjunction with scratchpad memory in the recent generations of GPUs. The caches on GPUs are highly configurable. The programmer or compiler can explicitly control cache access or bypass for global load instructions. This highly configurable feature of GPU caches opens up the opportunities for optimizing the cache performance. In this paper, we propose an efficient compiler framework for cache bypassing on GPUs. Our objective is to efficiently utilize the configurable cache and improve the overall performance for general purpose GPU applications. In order to achieve this goal, we first characterize GPU cache utilization and develop performance metrics to estimate the cache reuses and memory traffic. Next, we present efficient algorithms that judiciously select global load instructions for cache access or bypass. Finally, we present techniques to explore the unified cache and shared memory design space. We integrate our techniques into an automatic compiler framework that leverages parallel thread execution instruction set architecture to enable cache bypassing for GPUs. Experiments evaluation on NVIDIA GTX680 using a variety of applications demonstrates that compared to cache-all and bypass-all solutions, our techniques improve the performance from 4.6% to 13.1% for 16 KB L1 cache.

Journal ArticleDOI
TL;DR: This work proposes two new TIs of AES that, starting from a common previously published implementation, illustrate possible trade-offs and provides concrete ASIC implementation results for all three designs using the same library, and evaluates the practical security of all three Designs on the same FPGA platform.
Abstract: Embedded cryptographic devices are vulnerable to power analysis attacks. Threshold implementations (TIs) provide provable security against first-order power analysis attacks for hardware and software implementations. Like masking, the approach relies on secret sharing but it differs in the implementation of logic functions. While masking can fail to provide protection due to glitches in the circuit, TIs rely on few assumptions about the hardware and are fully compatible with standard design flows. We investigate two important properties of TIs in detail and point out interesting trade-offs between circuit area and randomness requirements. We propose two new TIs of AES that, starting from a common previously published implementation, illustrate possible trade-offs. We provide concrete ASIC implementation results for all three designs using the same library, and we evaluate the practical security of all three designs on the same FPGA platform. Our analysis allow us to directly compare the security provided by the different trade-offs, and to quantify the associated hardware cost.

Journal ArticleDOI
TL;DR: A systematic study on triple patterning layout decomposition problem, which is shown to be NP-hard, and a novel semidefinite programming (SDP)-based algorithm that can achieve great speedup even compared with accelerated ILP.
Abstract: As minimum feature size and pitch spacing further scale down, triple patterning lithography is a likely 193 nm extension along the paradigm of double patterning lithography for 14-nm technology node. Layout decomposition, which divides input layout into several masks to minimize the conflict and stitch numbers, is a crucial design step for double/triple patterning lithography. In this paper, we present a systematic study on triple patterning layout decomposition problem, which is shown to be NP-hard. Because of the NP-hardness, the runtime required to exactly solve it increases dramatically with the problem size. We first propose a set of graph division techniques to reduce the problem size. Then, we develop integer linear programming (ILP) to solve it. For large layouts, even with the graph-division techniques, ILP may still suffer from serious runtime overhead. To achieve better trade-off between runtime and performance, we present a novel semidefinite programming (SDP)-based algorithm. Followed by a mapping process, we can translate the SDP solutions into the final decomposition solutions. Experimental results show that the graph division can reduce runtime dramatically. In addition, SDP-based algorithm can achieve great speed-up even compared with accelerated ILP, with very comparable results in terms of the stitch number and the conflict number.

Journal ArticleDOI
Xiaoqing Xu1, Brian Cline, Greg Yeric, Bei Yu1, David Z. Pan1 
TL;DR: A coherent framework is proposed that uses depth first search, mixed integer linear programming, and backtracking method to enable LELE friendly Via-1 design and simultaneously optimize SADP-based local pin access and within-cell connections and improves pin access of the SCs and maximizes the pin access flexibility for routing.
Abstract: Self-aligned double patterning (SADP) is being considered for use at the 10-nm technology node and below for routing layers with pitches down to $\boldsymbol {\sim }50$ nm because it has better line edge roughness and overlay control compared to other multiple patterning candidates. To date, most of the SADP-related literature has focused on enabling SADP-legal routing in physical design tools while few attempts have been made to address the impact SADP routing has on local, standard cell (SC) I/O pin access. At the same time, via layers are used to connect the local SADP routing layers to the I/O pins on lower metal layers. Due to the high via density on the Via-1 layer, the litho-etch-litho-etch (LELE)-aware Via-1 design becomes a necessity to achieve legal pin access at the SC level. In this paper, we present the first study on SADP-aware pin access and layout optimization at the SC level. Accounting for SADP-specific and Via-1 design rules, we propose a coherent framework that uses depth first search, mixed integer linear programming, and backtracking method to enable LELE friendly Via-1 design and simultaneously optimize SADP-based local pin access and within-cell connections. Our experimental results show that, compared with the conventional approach, our framework effectively improves pin access of the SCs and maximizes the pin access flexibility for routing.

Journal ArticleDOI
TL;DR: Analysis of the entire processor gives more insight into the relative contributions of combinational and sequential SER, and can assist circuit designers to adopt effective hardening techniques to reduce the overall SER while meeting the required power and performance constraints.
Abstract: Radiation-induced soft errors have become a key challenge in advanced commercial electronic components and systems. We present the results of a soft error rate (SER) analysis of an embedded processor. Our SER analysis platform accurately models generation, propagation, and masking effects starting from a technology response model derived using TCAD simulations at the device level all the way to application masking. The platform employs a combination of accurate models at the device level, analytical error propagation at gate level, and fault emulation at the architecture/application level to provide the detailed contribution of each component (flip-flops, combinational gates, and SRAMs) to the overall SER. At each stage in the modeling hierarchy, an appropriate level of abstraction is used to propagate the effect of errors to the next higher level. Unlike previous studies which are based on very simple test chips, analyzing the entire processor gives more insight into the relative contributions of combinational and sequential SER. The results of this analysis can assist circuit designers to adopt effective hardening techniques to reduce the overall SER while meeting the required power and performance constraints.

Journal ArticleDOI
TL;DR: A novel computer architecture is proposed, called homomophically encrypted one instruction computation, which contrary to the previous work in the area does not require a secret key installed inside the microprocessor chip and leverages the powerful properties of homomorphic encryption combined with the simplicity of one instruction set computing.
Abstract: Outsourcing computation to the cloud has recently become a very attractive option for enterprises and consumers, due to mostly reduced cost and extensive scalability. At the same time, however, concerns about the privacy of the data entrusted to cloud providers keeps rising. To address these concerns and thwart potential attackers, cloud providers today resort to numerous security controls as well as data encryption. Since the actual computation is still unencrypted inside cloud microprocessor chips, it is only a matter of time until new attacks and side channels are devised to leak sensitive information. To address the challenge of securing general-purpose computation inside microprocessor chips, we propose a novel computer architecture, and present a complete framework for general-purpose encrypted computation without shared keys, enabling secure data processing. This new architecture, called homomophically encrypted one instruction computation, contrary to the previous work in the area does not require a secret key installed inside the microprocessor chip. Instead, it leverages the powerful properties of homomorphic encryption combined with the simplicity of one instruction set computing. The proposed framework introduces: 1) a RTL implementation for reconfigurable hardware and 2) a ready-to-deploy virtual machine, which can be readily ported to existing server processor architectures.

Journal ArticleDOI
TL;DR: This paper enables DRV PUFs by proposing a DRV-based hash function that is insensitive to temperature, and introduces a new circuit-level reliability knob as an alternative to error correcting codes.
Abstract: Physical unclonable functions (PUFs) are circuits that produce outputs determined by random physical variations from fabrication. The PUF studied in this paper utilizes the variation sensitivity of static random access memory (SRAM) data retention voltage (DRV), the minimum voltage at which each cell can retain state. Prior work shows that DRV can uniquely identify circuit instances with 28% greater success than SRAM power-up states that are used in PUFs [1] . However, DRV is highly sensitive to temperature, and until now this makes it unreliable and unsuitable for use in a PUF. In this paper, we enable DRV PUFs by proposing a DRV-based hash function that is insensitive to temperature. The new hash function, denoted DRV-based hashing (DH), is reliable across temperatures because it utilizes the temperature-insensitive ordering of DRVs across cells, instead of using the DRVs in absolute terms. To evaluate the security and performance of the DRV PUF, we use DRV measurements from commercially available SRAM chips, and use data from a novel DRV prediction algorithm. The prediction algorithm uses machine learning for fast and accurate simulation-free estimation of any cell’s DRV, and the prediction error in comparison to circuit simulation has a standard deviation of 0.35 mV. We demonstrate the DRV PUF using two applications—secret key generation and identification. In secret key generation, we introduce a new circuit-level reliability knob as an alternative to error correcting codes. In the identification application, our approach is compared to prior work and shown to result in a smaller false-positive identification rate for any desired true-positive identification rate.

Journal ArticleDOI
TL;DR: Two statistical methods for identifying recycled integrated circuits through the use of one-class classifiers and degradation curve sensitivity analysis are introduced and experimental results confirm their effectiveness in distinguishing between new and aged ICs.
Abstract: We introduce two statistical methods for identifying recycled integrated circuits (ICs) through the use of one-class classifiers and degradation curve sensitivity analysis. Both methods rely on statistically learning the parametric behavior of known new devices and using it as a reference point to determine whether a device under authentication has previously been used. The proposed methods are evaluated using actual measurements and simulation data from digital and analog devices, with experimental results confirming their effectiveness in distinguishing between new and aged ICs and their superiority over previously proposed methods.

Journal ArticleDOI
TL;DR: Several circuit examples designed in nanoscale technologies demonstrate that the proposed scaled-sigma sampling method achieves superior accuracy over the traditional importance sampling technique when the dimensionality of the variation space is more than a few hundred.
Abstract: Accurately estimating the rare failure rates for nanoscale circuit blocks (e.g., static random-access memory, D flip-flop, etc.) is a challenging task, especially when the variation space is high-dimensional. In this paper, we propose a novel scaled-sigma sampling (SSS) method to address this technical challenge. The key idea of SSS is to generate random samples from a distorted distribution for which the standard deviation (i.e., sigma) is scaled up. Next, the failure rate is accurately estimated from these scaled random samples by using an analytical model derived from the theorem of “soft maximum.” Our proposed SSS method can simultaneously estimate the rare failure rates for multiple performances and/or specifications with only a single set of transistor-level simulations. To quantitatively assess the accuracy of SSS, we estimate the confidence interval of SSS based on bootstrap. Several circuit examples designed in nanoscale technologies demonstrate that the proposed SSS method achieves significantly better accuracy than the traditional importance sampling technique when the dimensionality of the variation space is more than a few hundred.

Journal ArticleDOI
TL;DR: A modified 2D placement technique coupled with a post-placement partitioning step is sufficient to produce high-quality M3D placement solutions and a commercial router-based monolithic intertier via insertion methodology that improves the routability of M3d ICs is presented.
Abstract: Monolithic 3D (M3D) is an emerging technology that enables integration density which is orders of magnitude higher than that offered by through-silicon-vias. In this paper, we demonstrate that a modified 2D placement technique coupled with a post-placement partitioning step is sufficient to produce high-quality M3D placement solutions. We also present a commercial router-based monolithic intertier via insertion methodology that improves the routability of M3D ICs. We demonstrate that, unlike in 2D ICs, the routing supply and demand in M3D ICs are not completely independent of each other. We develop a routing demand model for M3D ICs, and use it to develop an ${O}({N})$ min-overflow partitioner that enhances routability by off-loading demand from one tier to another. This technique reduces the routed wirelength and the power delay product by up to 7.44% and 4.31%, respectively. This allows a two-tier M3D IC to achieve, on average, 19.9% and 11.8% improvement in routed wirelength and power delay product over 2D, even with reduced metal layer usage.

Journal ArticleDOI
TL;DR: A general, transform-based approach to the analysis and synthesis of SC circuits, implemented in a program spectral transform use in stochastic circuit synthesis (STRAUSS), which also includes a method of optimizing Stochastic number-generation circuitry.
Abstract: 1 Stochastic computing (SC) is an approximate computing technique that processes data in the form of long pseudorandom bit-streams which can be interpreted as probabilities. Its key advantages are low-complexity hardware and high-error tolerance. SC has recently been finding application in several important areas, including image processing, artificial neural networks, and low-density parity check decoding. Despite a long history, SC still lacks a comprehensive design methodology, so existing designs tend to be either ad hoc or based on specialized design methods. In this paper, we demonstrate a fundamental relation between stochastic circuits and spectral transforms. Based on this, we propose a general, transform-based approach to the analysis and synthesis of SC circuits. We implemented this approach in a program spectral transform use in stochastic circuit synthesis (STRAUSS), which also includes a method of optimizing stochastic number-generation circuitry. Finally, we show that the area cost of the circuits generated by STRAUSS is significantly smaller than that of previous work. 1 Parts of this paper are based on “A spectral transform approach to stochastic circuits,” which was presented at the International Conference on Computer Design, Oct. 2012 [3] .