scispace - formally typeset
Search or ask a question

Showing papers in "ACM Transactions on Design Automation of Electronic Systems in 2015"


Journal ArticleDOI
TL;DR: A novel placement density function eDensity is developed, which models every object as positive charge and the density cost as the potential energy of the electrostatic system, which is more effective, generalized, simpler, and faster than previous works.
Abstract: We develop a flat, analytic, and nonlinear placement algorithm, ePlace, which is more effective, generalized, simpler, and faster than previous works. Based on the analogy between placement instance and electrostatic system, we develop a novel placement density function eDensity, which models every object as positive charge and the density cost as the potential energy of the electrostatic system. The electric potential and field distribution are coupled with density using a well-defined Poisson's equation, which is numerically solved by spectral methods based on fast Fourier transform (FFT). Instead of using the conjugate gradient (CG) nonlinear solver in previous placers, we propose to use Nesterov's method which achieves faster convergence. The efficiency bottleneck on line search is resolved by predicting the steplength using a closed-form equation of Lipschitz constant. The placement performance is validated through experiments on the ISPD 2005 and ISPD 2006 benchmark suites, where ePlace outperforms all state-of-the-art placers (Capo10.5, FastPlace3.0, RQL, MAPLE, ComPLx, BonnPlace, POLAR, APlace3, NTUPlace3, mPL6) with much shorter wirelength and shorter or comparable runtime. On average, of all the ISPD 2005 benchmarks, ePlace outperforms the leading placer BonnPlace with 2.83p shorter wirelength and runs 3.05× faster; and on average, of all the ISPD 2006 benchmarks, ePlace outperforms the leading placer MAPLE with 4.59p shorter wirelength and runs 2.84× faster.

68 citations


Journal ArticleDOI
Xing Huang1, Genggeng Liu1, Wenzhong Guo1, Yuzhen Niu1, Guolong Chen1 
TL;DR: This is the first time to specially solve the single-layer obstacle-avoiding problem in X-architecture for a given set of pins and obstacles and achieves the best solution quality in a reasonable runtime among the existing algorithms.
Abstract: Obstacle-avoiding Steiner minimal tree (OASMT) construction has become a focus problem in the physical design of modern very large-scale integration (VLSI) chips. In this article, an effective algorithm is presented to construct an OASMT based on X-architecturex for a given set of pins and obstacles. First, a kind of special particle swarm optimization (PSO) algorithm is proposed that successfully combines the classic genetic algorithm (GA), and greatly improves its own search capability. Second, a pretreatment strategy is put forward to deal with obstacles and pins, which can provide a fast information inquiry for the whole algorithm by generating a precomputed lookup table. Third, we present an efficient adjustment method, which enables particles to avoid all the obstacles by introducing some corner points of obstacles. Finally, an excellent refinement method is discussed to further enhance the quality of the final routing tree, which can improve the quality of the solution by 7.93p on average. To our best knowledge, this is the first time to specially solve the single-layer obstacle-avoiding problem in X-architecture. Experimental results show that the proposed algorithm can further shorten wirelength in the presence of obstacles. And it achieves the best solution quality in a reasonable runtime among the existing algorithms.

64 citations


Journal ArticleDOI
TL;DR: A general security-aware design methodology is proposed to address security with other design constraints in a holistic framework and optimize design objectives and indicates that it is necessary to consider security together with other metrics during design stages.
Abstract: In this article, we address both security and safety requirements and solve security-aware design problems for the controller area network (CAN) protocol and time division multiple access (TDMA)-based protocols. To provide insights and guidelines for other similar security problems with limited resources and strict timing constraints, we propose a general security-aware design methodology to address security with other design constraints in a holistic framework and optimize design objectives. The security-aware design methodology is further applied to solve a security-aware design problem for vehicle-to-vehicle (V2V) communications with dedicated short-range communication (DSRC) technology. Experimental results demonstrate the effectiveness of our approaches in system design without violating design constraints and indicate that it is necessary to consider security together with other metrics during design stages.

54 citations


Journal ArticleDOI
TL;DR: This article presents the first generalized formal model that considers structural and functional dependencies of reconfigurable scan networks and is directly applicable to 1687-2014-based and 1149.1-2013-based scan architectures, and enables efficient formal verification of complex scan networks, as well as automatic generation of access patterns.
Abstract: Efficient access to on-chip instrumentation is a key requirement for post-silicon validation, test, debug, bringup, and diagnosis. Reconfigurable scan networks, as proposed by, for example, IEEE Std 1687-2014 and IEEE Std 1149.1-2013, emerge as an effective and affordable means to cope with the increasing complexity of on-chip infrastructure.Reconfigurable scan networks are often hierarchical and may have complex structural and functional dependencies. Common approaches for scan verification based on static structural analysis and functional simulation are not sufficient to ensure correct operation of these types of architectures. To access an instrument in a reconfigurable scan network, a scan-in bit sequence must be generated according to the current state and structure of the network. Due to sequential and combinational dependencies, the access pattern generation process (pattern retargeting) poses a complex decision and optimization problem.This article presents the first generalized formal model that considers structural and functional dependencies of reconfigurable scan networks and is directly applicable to 1687-2014-based and 1149.1-2013-based scan architectures. This model enables efficient formal verification of complex scan networks, as well as automatic generation of access patterns. The proposed pattern generation method supports concurrent access to multiple target scan registers (access merging) and generates short scan-in sequences.

45 citations


Journal ArticleDOI
TL;DR: This article proposes to reconfigure both the physical unclonable functions (PUFs) and the locking scheme of the finite state machine (FSM) in order to defeat the replay attack, and demonstrates how replay attack would fail in attacking systems protected by the reconfigurable binding method.
Abstract: The FPGA replay attack, where an attacker downgrades an FPGA-based system to the previous version with known vulnerabilities, has become a serious security and privacy concern for FPGA design. Current FPGA intellectual property (IP) protection mechanisms target the protection of FPGA configuration bitstreams by watermarking or encryption or binding. However, these mechanisms fail to prevent replay attacks. In this article, based on a recently reported PUF-FSM binding method that protects the usage of configuration bitstreams, we propose to reconfigure both the physical unclonable functions (PUFs) and the locking scheme of the finite state machine (FSM) in order to defeat the replay attack. We analyze the proposed scheme and demonstrate how replay attack would fail in attacking systems protected by the reconfigurable binding method. We implement two ways to build reconfigurable PUFs and propose two practical methods to reconfigure the locking scheme. Experimental results show that the two reconfigurable PUFs can generate significantly distinct responses with average reconfigurability of more than 40p. The reconfigurable locking schemes only incur a timing overhead less than 1p.

32 citations


Journal ArticleDOI
TL;DR: This work proposes an adaptive wear-leveling mechanism to prevent any PCM cell from being worn out prematurely by selecting appropriate data for swapping with constant search/sort cost and the concept of indirect pointers is designed in the proposed mechanism to swap data without any modification to the file system's indexes.
Abstract: Improving the performance of storage systems without losing the reliability and sanity/integrity of file systems is a major issue in storage system designs. In contrast to existing storage architectures, we consider a PCM-based storage architecture to enhance the reliability of storage systems. In PCM-based storage systems, the major challenge falls on how to prevent the frequently updated (meta)data from wearing out their residing PCM cells without excessively searching and moving metadata around the PCM space and without extensively updating the index structures of file systems. In this work, we propose an adaptive wear-leveling mechanism to prevent any PCM cell from being worn out prematurely by selecting appropriate data for swapping with constant search/sort cost. Meanwhile, the concept of indirect pointers is designed in the proposed mechanism to swap data without any modification to the file system's indexes. Experiments were conducted based on well-known benchmarks and realistic workloads to evaluate the effectiveness of the proposed design, for which the results are encouraging.

32 citations


Journal ArticleDOI
TL;DR: This work presents an aging- and variationaware representative path selection technique based on machine learning that allows to measure the delay of a small set of paths and infer thedelay of a larger pool of paths that are likely to fail due to delay variations.
Abstract: Process together with runtime variations in temperature and voltage, as well as transistor aging, degrade path delay and may eventually induce circuit failure due to timing variations. Therefore, in-field tracking of path delays is essential, and to respond to this need, several delay sensor designs have been proposed in the literature. However, due to the significant overhead of these sensors and the large number of critical paths in today's IC, it is infeasible to monitor the delay of every critical path in silicon. We present an aging- and variationaware representative path selection technique based on machine learning that allows to measure the delay of a small set of paths and infer the delay of a larger pool of paths that are likely to fail due to delay variations. Simulation results for benchmark circuits highlight the accuracy of the proposed approach for predicting critical-path delay based on the selected representative paths.

30 citations


Journal ArticleDOI
TL;DR: This work combines for the first time the versatility of event-based timing simulation and multi-dimensional parallelism used in GPU-based gate-level simulators to provide a throughput-optimized timing simulation algorithm.
Abstract: Many EDA tasks such as test set characterization or the precise estimation of power consumption, power droop and temperature development, require a very large number of time-aware gate-level logic simulations. Until now, such characterizations have been feasible only for rather small designs or with reduced precision due to the high computational demands.The new simulation system presented here is able to accelerate such tasks by more than two orders of magnitude and provides for the first time fast and comprehensive timing simulations for industrial-sized designs. Hazards, pulse-filtering, and pin-to-pin delay are supported for the first time in a GPGPU accelerated simulator, and the system can easily be extended to even more realistic delay models and further applications.A sophisticated mapping with efficient memory utilization and access patterns as well as minimal synchronizations and control flow divergence is able to use the full potential of GPGPU architectures. To provide such a mapping, we combine for the first time the versatility of event-based timing simulation and multi-dimensional parallelism used in GPU-based gate-level simulators. The result is a throughput-optimized timing simulation algorithm, which runs many simulation instances in parallel and at the same time fully exploits gate-parallelism within the circuit.

29 citations


Journal ArticleDOI
TL;DR: This work investigates the conditional diagnosability of Cayley graphs generated by transposition trees under the PMC model and shows that it is 4n-11 for n ≥ 4 except for the n-dimensional star graph, for which it has been shown to be 8n-21 for n≥ 5.
Abstract: Processor fault diagnosis has played an essential role in measuring the reliability of a multiprocessor system. The diagnosability of many well-known multiprocessor systems has been widely investigated. Conditional diagnosability is a novel measure of diagnosability by adding a further condition that any fault set cannot contain all the neighbors of every node in the system. Several known structural properties of Cayley graphs are exhibited. Based on these properties, we investigate the conditional diagnosability of Cayley graphs generated by transposition trees under the PMC model and show that it is 4n-11 for n ≥ 4 except for the n-dimensional star graph for which it has been shown to be 8n-21 for n≥ 5 (refer to Chang and Hsieh [2014]).

29 citations


Journal ArticleDOI
TL;DR: Experimental results show that the Lazy-RTGC technique can significantly improve both the average and worst system performance with very low extra flash-space requirements.
Abstract: Due to many attractive and unique properties, NAND flash memory has been widely adopted in mission-critical hard real-time systems and some soft real-time systems. However, the nondeterministic garbage collection operation in NAND flash memory makes it difficult to predict the system response time of each data request. This article presents Lazy-RTGC, a real-time lazy garbage collection mechanism for NAND flash memory storage systems. Lazy-RTGC adopts two design optimization techniques: on-demand page-level address mappings, and partial garbage collection. On-demand page-level address mappings can achieve high performance of address translation and can effectively manage the flash space with the minimum RAM cost. On the other hand, partial garbage collection can provide the guaranteed system response time. By adopting these techniques, Lazy-RTGC jointly optimizes both the average and the worst system response time, and provides a lower bound of reclaimed free space. Lazy-RTGC is implemented in FlashSim and compared with representative real-time NAND flash memory management schemes. Experimental results show that our technique can significantly improve both the average and worst system performance with very low extra flash-space requirements.

26 citations


Journal ArticleDOI
TL;DR: A key finding of this modeling is that, counter to prevailing wisdom, wearout in the CMP's on-chip interconnect is correlated with lack of load observed in the NoC routers rather than high load, and a novel wearout-decelerating scheme is developed, which yields an ∼2,300× decrease in the rate of wear.
Abstract: Moore's Law scaling continues to yield higher transistor density with each succeeding process generation, leading to today's many-core chip multiprocessors (CMPs) with tens or even hundreds of interconnected cores or tiles. Unfortunately, deep submicron CMOS process technology is marred by increasing susceptibility to wear. Prolonged operational stress gives rise to accelerated wearout and failure due to several physical failure mechanisms, including hot-carrier injection (HCI) and negative-bias temperature instability (NBTI). Each failure mechanism correlates with different usage-based stresses, all of which can eventually generate permanent faults. While the wearout of an individual core in many-core CMPs may not necessarily be catastrophic, a single fault in the interprocessor network-on-chip (NoC) fabric could render the entire chip useless, as it could lead to protocol-level deadlocks, or even partition away vital components such as the memory controller or other critical I/O. In this article, we study HCI- and NBTI-induced wear due to actual stresses caused by real workloads, applied onto the interconnect microarchitecture and develop a critical path model for NBTI-induced wearout. A key finding of this modeling is that, counter to prevailing wisdom, wearout in the CMP's on-chip interconnect is correlated with lack of load observed in the NoC routers rather than high load. We then develop a novel wearout-decelerating scheme in which routers under low load have their wear-sensitive components exercised without significantly impacting cycle time, pipeline depth, area, or power consumption of the overall router. A novel deterministic approach is proposed for the generation of appropriate exercise-mode data, ensuring design parameter targets are met. We subsequently show that the proposed design yields an ∼2,300× decrease in the rate of wear.

Journal ArticleDOI
TL;DR: A novel methodology for multi-objective global routing based on fuzzy logic, called FuzzRoute, achieves balanced superiority in terms of routability, runtime, and wirelength over others.
Abstract: The high density of interconnects, closer proximity of modules, and routing phase are pivotal during the layout of a performance-centric three-dimensional integrated circuit (3D IC). Heuristic-based approaches are typically used to handle such NP-complete problems of global routing in 3D ICs. To overcome the inherent limitations of deterministic approaches, a novel methodology for multi-objective global routing based on fuzzy logic has been proposed in this article. The guiding information generated after the placement phase is used during routing with the help of a fuzzy expert system to achieve thermally efficient and congestion-free routing. A complete global routing solution is designed based on the proposed algorithms and the results are compared with selected fully established global routers, namely Labyrinth, FastRoute3.0, NTHU-R, BoxRouter 2.0, FGR, NTHU-Route2.0, FastRoute4.0, NCTU-GR, MGR, and NCTU-GR2.0. Experiments are performed over ISPD 1998 and 2008 benchmarks. The proposed router, called FuzzRoute, achieves balanced superiority in terms of routability, runtime, and wirelength over others. The improvements on routing time for Labyrinth, BoxRouter 2.0, and FGR are 91.81p, 86.87p, and 32.16p, respectively, for ISPD 1998 benchmarks. It may be noted that, though FastRoute3.0 achieves fastest runtime, it fails to generate congestion-free solutions for all benchmarks, which is overcome by the proposed FuzzRoute of the current article. It also shows wirelength improvements of 17.35p, 2.88p, 2.44p, 2.83p, and 2.10p, respectively, over others for ISPD 1998 benchmarks. For ISPD 2008 benchmark circuits it also provides 2.5p, 2.6p, 1 p, 1.1p, and 0.3p lesser wirelength and averagely runs 1.68×, 6.42×, 2.21×, 0.76×, and 1.54× faster than NTHU-Route2.0, FastRoute4.0, NCTU-GR, MGR, and NCTU-GR2.0, respectively.

Journal ArticleDOI
TL;DR: A new mixing algorithm based on a number-partitioning technique that determines a layout-aware mixing tree corresponding to a given target ratio of a number of fluids is presented and a routing-aware resource-allocation scheme is proposed that can be used to improve the performance of a given mixing algorithm on a chip layout.
Abstract: The recent proliferation of digital microfluidic (DMF) biochips has enabled rapid on-chip implementation of many biochemical laboratory assays or protocols. Sample preprocessing, which includes dilution and mixing of reagents, plays an important role in the preparation of assays. The automation of sample preparation on a digital microfluidic platform often mandates the execution of a mixing algorithm, which determines a sequence of droplet mix-split steps (usually represented as a mixing graph). However, the overall cost and performance of on-chip mixture preparation not only depends on the mixing graph but also on the resource allocation and scheduling strategy, for instance, the placement of boundary reservoirs or dispensers, mixer modules, storage units, and physical design of droplet-routing pathways. In this article, we first present a new mixing algorithm based on a number-partitioning technique that determines a layout-aware mixing tree corresponding to a given target ratio of a number of fluids. The mixing graph produced by the proposed method can be implemented on a chip with a fewer number of crossovers among droplet-routing paths as well as with a reduced reservoir-to-mixer transportation distance. Second, we propose a routing-aware resource-allocation scheme that can be used to improve the performance of a given mixing algorithm on a chip layout. The design methodology is evaluated on various test cases to demonstrate its effectiveness in mixture preparation with the help of two representative mixing algorithms. Simulation results show that on average, the proposed scheme can reduce the number of crossovers among droplet-routing paths by 89.7p when used in conjunction with the new mixing algorithm, and by 75.4p when an earlier algorithm [Thies et al. 2008] is used.

Journal ArticleDOI
TL;DR: This work presents an efficient built-in self-test (BIST) architecture for targeting defects in dies and in the interposer interconnects, and describes a test scheduling and optimization technique under power constraints to reduce the overall test cost.
Abstract: Interposer-based 2.5D integrated circuits (ICs) are seen today as a precursor to 3D ICs based on through-silicon vias (TSVs). All the dies and the interposer in a 2.5D IC must be adequately tested for product qualification. We present an efficient built-in self-test (BIST) architecture for targeting defects in dies and in the interposer interconnects. The proposed BIST architecture can also be used for fault diagnosis during interconnect testing. To reduce the overall test cost, we describe a test scheduling and optimization technique under power constraints. We present simulation results to validate the BIST architecture and demonstrate fault detection, synthesis results to evaluate the area overhead of the proposed BIST architecture, and test scheduling results to highlight the effectiveness of the optimization approach.

Journal ArticleDOI
TL;DR: This work uses several intra-pipeline combinational logic circuits at the 32nm technology node, investigates several different standard cell placements of each design, and analyzes them with a novel, physically realistic transient injection and simulation method.
Abstract: As fabrication technology scales towards smaller transistor sizes and lower critical charge, single-event radiation effects are more likely to cause errant behavior in multiple, physically adjacent devices in modern integrated circuits (ICs), and with higher operating frequencies, this risk increasingly impacts design logic over memory as well. In order to increase future system reliability, circuit designers need greater awareness of multiple-transient charge-sharing effects during the early stages of their design flow with standard cell placement and routing. To measure the propagation and observability of multiple transients from single radiation events, this work uses several intra-pipeline combinational logic circuits at the 32nm technology node, investigates several different standard cell placements of each design, and analyzes those placements with a novel, physically realistic transient injection and simulation method. It is shown that (1) this simulation methodology, informed by experimental data, provides an increased realism over other works in traditional fault injection fields, (2) different placements of the same circuit where standard cells are grouped by logical hierarchy can result in different reliability behavior and benefits especially useful within the area of approximate computing, and (3) improved reliability through charge-sharing transient mitigation can be gained with no area penalty and minimal speed and power penalties by adjusting the placement of standard cells.

Journal ArticleDOI
TL;DR: This article summarizes mechanisms of both soft and hard errors of ReRAM cells and proposes a unified model to characterize different failure behaviors, which can extend the lifetime of Re RAM up to 75% over a design without hard error detection and up to 12% over the design with a “write-verify” detection mechanism.
Abstract: Resistive random access memory (ReRAM) technology is an emerging candidate for next-generation nonvolatile memory (NVM) architecture due to its simple structure, low programming voltage, fast switching speed, high on/off ratio, excellent scalability, good endurance, and great compatibility with silicon CMOS technology. The most attractive of the characteristics of ReRAM is its cross-point structure, which features a 4F2 cell size. In a cross-point structure, the existence of sneak current and resulting voltage loss due to the wire's resistance might cause read and write failures if not designed properly. In addition, a robust ReRAM design needs to deal with both soft and hard errors. In this article, we summarize mechanisms of both soft and hard errors of ReRAM cells and propose a unified model to characterize different failure behaviors. We quantitatively analyze the impact of cell failure types on the reliability of the cross-point array. We also propose an error-resilient architecture, which avoids unnecessary writes in the hard error detection unit. Assuming constant soft error rate, our approach can extend the lifetime of ReRAM up to 75p over a design without hard error detection and up to 12p over the design with a “write-verify” detection mechanism. Our approach yields greater significant lifetime improvement when considering postcycling retention degradation.

Journal ArticleDOI
TL;DR: PAU is an application-specific instruction-set processor (ASIP) whose instruction set is customized to reflect common features of various DPA methods whose ASIP approach can be successfully applicable to complex DPA schemes while providing hardware-backed power in performance and software-based flexibility in analysis.
Abstract: In recent years, dynamic program analysis (DPA) has been widely used in various fields such as profiling, finding bugs, and security. However, existing solutions have their own weaknesses. Software solutions provide flexibility in DPA but they suffer from tremendous performance overhead. In contrast, core-level hardware engines rely on specialized integrated logics and attain extremely fast computation, but they have a limited functional extensibility because the logics are tightly coupled with the host processor. To mend this, a prior system-level approach utilizes an existing channel to integrate their hardware without necessitating the host architecture modification and introduced great potential in performance. Nevertheless, the prior work does not address the detailed design and implementation of the engine, which is quite essential to leverage the deployment on real systems. To address this, in this article, we propose an implementation of programmable DPA hardware engine, called program analysis unit (PAU). PAU is an application-specific instruction-set processor (ASIP) whose instruction set is customized to reflect common features of various DPA methods. With the specialized architecture and programmability of software, our PAU aims at fast computation and sufficient flexibility. In our case studies on several DPA techniques, we show that our ASIP approach can be successfully applicable to complex DPA schemes while providing hardware-backed power in performance and software-based flexibility in analysis. Recent experiments on our FPGA prototype revealed that the performance of PAU is 4.7-13.6 times faster than pure software DPA, and the power/area consumption is also acceptably small compared to today's mobile processors.

Journal ArticleDOI
TL;DR: An algorithm is presented that transforms component-based design spaces, expressible in CoDeL, to an SMT program, which determines the satisfiability of the synthesis problem, and delivers a correct-by-construction system configuration.
Abstract: Constraint programming solvers, such as Satisfiability Modulo Theory (SMT) solvers, are capable tools in finding preferable configurations for embedded systems from large design spaces. However, constructing SMT constraint programs is not trivial, in particular for complex systems that exhibit multiple viewpoints and models. In thisarticle we propose CoDeL: a component-based description language that allows system designers to express components as reusable building blocks of the system with their parameterizable properties, models, and interconnectivity. Systems are synthesized by allocating, connecting, and parameterizing the components to satisfy the requirements of an application. We present an algorithm that transforms component-based design spaces, expressible in CoDeL, to an SMT program, which, solved by state-of-the-art SMT solvers, determines the satisfiability of the synthesis problem, and delivers a correct-by-construction system configuration. Evaluation results for use cases in the domain of scheduling and mapping of distributed real-time processes confirm, first, the performance gain of SMT compared to traditional design space exploration approaches, second, the usability gains by expressing design problems in CoDeL, and third, the capability of the CoDeL/SMT approach to support the design of embedded systems.

Journal ArticleDOI
TL;DR: This article discusses how the optimal read voltage thresholds can be determined and assess the benefit of cancelling cell-to-cell interference in terms of cycling endurance, data retention, and resilience to read disturb.
Abstract: NAND flash memory is not only the ubiquitous storage medium in consumer applications but has also started to appear in enterprise storage systems as well. MLC and TLC flash technology made it possible to store multiple bits in the same silicon area as SLC, thus reducing the cost per amount of data stored. However, at current sub-20nm technology nodes, MLC flash devices fail to provide the levels of raw reliability, mainly cycling endurance, that are required by typical enterprise applications. Advanced signal processing and coding schemes are needed to improve the flash bit error rate and thus elevate the device reliability to the desired level. In this article, we report on the use of adaptive voltage thresholds and cell-to-cell interference cancellation in the read operation of NAND flash devices. We discuss how the optimal read voltage thresholds can be determined and assess the benefit of cancelling cell-to-cell interference in terms of cycling endurance, data retention, and resilience to read disturb.

Journal ArticleDOI
TL;DR: The proposed two-part STT-RAM-based L2 cache exploits a dynamic threshold regulator (DTR) to efficiently regulate the write threshold for migration of the data blocks from HR to LR, based on the behavior of the applications.
Abstract: Future GPUs should have larger L2 caches based on the current trends in VLSI technology and GPU architectures toward increase of processing core count. Larger L2 caches inevitably have proportionally larger power consumption. In this article, having investigated the behavior of GPGPU applications, we present an efficient L2 cache architecture for GPUs based on STT-RAM technology. Due to its high-density and low-power characteristics, STT-RAM technology can be utilized in GPUs where numerous cores leave a limited area for on-chip memory banks. They have, however, two important issues, high energy and latency of write operations, that have to be addressed. Low retention time STT-RAMs can reduce the energy and delay of write operations. Nevertheless, employing STT-RAMs with low retention time in GPUs requires a thorough study on the behavior of GPGPU applications. Based on this investigation, we have architectured a two-part STT-RAM-based L2 cache with low-retention (LR) and high-retention (HR) parts. The proposed two-part L2 cache exploits a dynamic threshold regulator (DTR) to efficiently regulate the write threshold for migration of the data blocks from HR to LR, based on the behavior of the applications. Also, a Data and Access type Aware Cache Search mechanism (DAACS) is hired for handling the search of the requested data blocks in two parts of the cache. The STT-RAM L2 cache architecture proposed in this article can improve IPC by up to 171p (20p on average), and reduce the average consumed power by 28.9p compared to a conventional L2 cache architecture with equal on-chip area.

Journal ArticleDOI
TL;DR: This article identifies a time-predictable network-on-chip architecture and shows that its timing behaviour can be predicted using models which are far less complex than the architecture itself.
Abstract: An increasingly time-consuming part of the design flow of on-chip multiprocessors is the simulation of the interconnect architecture. The accurate simulation of state-of-the art network-on-chip interconnects can take hours, and this process is repeated for each design iteration because it provides valuable insights on communication latencies that can greatly affect the overall performance of the system. In this article, we identify a time-predictable network-on-chip architecture and show that its timing behaviour can be predicted using models which are far less complex than the architecture itself. We then explore such a feature to produce simplified and lightweight simulation models that can produce latency figures with more than 90p accuracy and simulate more than 1,000 times faster when compared to a cycle-accurate model of the same interconnect.

Journal ArticleDOI
TL;DR: The feasibility of designing digitally programmable delay elements (PDEs) employing neuron-MOS mechanism is investigated and both types of suggested PDE circuits achieve improved or fair performances over the robustness, power consumption, and linearity.
Abstract: The feasibility of designing digitally programmable delay elements (PDEs) employing neuron-MOS mechanism is investigated in this work. By coupling the capacitors on the gate of the MOS transistor, the current flowing through the transistor can be digitally tuned without additional static power consumption. Various switching delays are generated by a clock buffer stage in this manner. Two types of neuron-MOS-based PDEs are suggested in this article. One of them is realized by directly applying capacitor-coupling technology on the transistors of an inverter as a clock buffer. The delay programmability is realized by tuning the charging/discharging current through the neuron-MOS inverter digitally. Since no additional transistor is introduced into the charging/discharging path, the performance fluctuation due to process variations on MOS transistors is reduced. The temperature effect is also partially compensated by the proposed neuron-MOS implementation. Another type of PDE circuit is proposed by employing a reliable reference-current-generator, where the neuron-MOS transistor acts as a linearly tunable resistance. A stable reference current is generated and used for charging/discharging the inverter as a clock buffer. As a result, the switching delay of the inverter is linearly programmed by digital input patterns. In general, both types of suggested PDE circuits achieve improved or fair performances over the robustness, power consumption, and linearity.

Journal ArticleDOI
TL;DR: An automated pipelining approach for optimally balanced pipeline implementation that achieves low area cost as well as meeting timing requirements while offering dramatically reduced computational complexity is described.
Abstract: We describe an automated pipelining approach for optimally balanced pipeline implementation that achieves low area cost as well as meeting timing requirements. Most previous automatic pipelining methods have focused on Instruction Set Architecture (ISA)-based designs and the main goal of such methods generally has been maximizing performance as measured in terms of instructions per clock (IPC). By contrast, we focus on datapath-oriented designs (e.g., DSP filters for image or communication processing applications) in ASIC design flows. The goal of the proposed pipelining approach is to find the optimally pipelined design that not only meets the user-specified target clock frequency, but also seeks to minimize area cost of a given design. Unlike most previous approaches, the proposed methods incorporate the use of accurate area and timing information (iteratively achieved by synthesizing every interim pipelined design) to achieve higher accuracy during design exploration. When compared with exhaustive design exploration that considers all possible pipeline patterns, the two heuristic pipelining methods presented here involve only a small area penalty (typically under 5p) while offering dramatically reduced computational complexity. Experimental validation is performed with commercial ASIC design tools and described for applications including polynomial function evaluation, FIR filters, matrix multiplication, and discrete wavelet transform filter designs with a 90nm standard cell library.

Journal ArticleDOI
TL;DR: This article identifies two types of interference, namely, queuing delay (QD) interference and garbage collection (GC) interference, in a shared SSD and proposes a framework called VSSD, which is effective in eliminating the interference and achieving performance isolation between users.
Abstract: Performance isolation is critical in shared storage systems, a popular storage solution In a shared storage system, interference between requests from different users can affect the accuracy of I/O cost accounting, resulting in poor performance isolation Recently, NAND flash-memory-based solid-state drives (SSDs) have been increasingly used in shared storage systems However, interference in SSD-based shared storage systems has not been addressed In this article, two types of interference, namely, queuing delay (QD) interference and garbage collection (GC) interference, are identified in a shared SSD Additionally, a framework called VSSD is proposed to address these types of interference VSSD is composed of two components: the FACO credit-based I/O scheduler designed to address QD interference and the ViSA flash translation layer designed to address GC interference The VSSD framework aims to be implemented in the firmware running on an SSD controller With VSSD, interference in an SSD can be eliminated and performance isolation can be ensured Both synthetic and application workloads are used to evaluate the effectiveness of the proposed VSSD framework The performance results show the following First, QD and GC interference exists and can result in poor performance isolation between users on SSD-based shared storage systems Second, VSSD is effective in eliminating the interference and achieving performance isolation between users Third, the overhead of VSSD is insignificant

Journal ArticleDOI
TL;DR: A new design-for-testability (DFT) scheme for launch-on-shift (LOS) testing, which ensures that the combinational logic remains undisturbed between the interleaved capture phases, providing computer-aided-design (CAD) tools with extra search space for minimizing launch-to-capture switching activity through test pattern ordering (TPO).
Abstract: Scan-based testing is crucial to ensuring correct functioning of chips. In this scheme, the scan and capture phases are interleaved. It is well known that for large designs, excessive switching activity during the launch-to-capture window leads to high voltage droop on the power grid, ultimately resulting in false delay failures during at-speed test. This article proposes a new design-for-testability (DFT) scheme for launch-on-shift (LOS) testing, which ensures that the combinational logic remains undisturbed between the interleaved capture phases, providing computer-aided-design (CAD) tools with extra search space for minimizing launch-to-capture switching activity through test pattern ordering (TPO). We further propose a new TPO algorithm that keeps track of the don't cares during the ordering process, so that the don't care filling step after the ordering process yields a better reduction in launch-to-capture switching activity compared to any other technique in the literature. The proposed DFT-assisted technique, when applied to circuits in ITC99 benchmark suite, produces an average reduction of 17.68p in peak launch-to-capture switching activity (CSA) compared to the best known lowpower TPO technique. Even for circuits whose test cubes are not rich in don't care bits, the proposed technique produces an average reduction of 15p in peak CSA, while for the circuits with test cubes rich in don't care bits (≥75p), the average reduction is 24p. The proposed technique also reduces the average power dissipation (considering both scan cells and combinational logic) during the scan phase by about 43.5p on an average, compared to the adjacent filling technique.

Journal ArticleDOI
TL;DR: This article shows that one of the inefficiency sources in current schemes, even when wear-leveling algorithms are used, is the nonuniform write endurance limit incurred by process variation, that is, when some memory pages have reached their endurance limit, other pages may be far from their limit.
Abstract: With current memory scalability challenges, Phase-Change Memory (PCM) is viewed as an attractive replacement to DRAM. The preliminary concern for PCM applicability is its limited write endurance that results in fast wear-out of memory cells. Worse, process variation in the deep-nanometer regime increases the variation in cell lifetime, resulting in an early and sudden reduction in main memory capacity due to the wear-out of a few cells. Recent studies have proposed redirection or correction schemes to alleviate this problem, but all suffer poor throughput or latency. In this article, we show that one of the inefficiency sources in current schemes, even when wear-leveling algorithms are used, is the nonuniform write endurance limit incurred by process variation, that is, when some memory pages have reached their endurance limit, other pages may be far from their limit. In this line, we present a technique that aims to displace a faulty page to a healthy page. This technique, called On-Demand Page Paired PCM (OD3P, for short), when applied at page level, can improve PCM time-to-failure by 20p on average for different multithreaded and multiprogrammed workloads while also improving IPC by 14p on average compared to previous page-level techniques. The comparison between line-level OD3P and previous line-level techniques reveals about 2× improvement of lifetime and performance.

Journal ArticleDOI
TL;DR: A new technology mapping algorithm for parameterised designs, called TCONMAP, is presented that can be used to produce parameterised configurations in which both the configuration of the logic blocks and routing is a function of the parameters.
Abstract: Parameterised configurations are FPGA configuration bitstreams in which the bits are defined as functions of user-defined parameters. From a parameterised configuration, it is possible to quickly and efficiently derive specialised, regular configuration bitstreams by evaluating these functions. The specialised bitstreams have different properties and functionality depending on the chosen values of the parameters. The most important application of parameterised configurations is the generation of specialised configuration bitstreams for Dynamic Circuit Specialisation, a technique for optimising circuits at runtime using partial reconfiguration of the FPGA. Generating and using parameterised configurations requires a new FPGA tool flow. In this article, we present a new technology mapping algorithm for parameterised designs, called TCONMAP, that can be used to produce parameterised configurations in which both the configuration of the logic blocks and routing is a function of the parameters. In our experiments, we demonstrate that in using TCONMAP, the depth and area of the mapped circuit is close to the minimal depth and area attainable. Both Dynamic Circuit Specialisation and fine-grained modular reconfiguration are extracted by TCONMAP from the HDL description of the design requiring only simple parameter annotations.

Journal ArticleDOI
TL;DR: This article proposes a cascade fault localization method to help speed up this labor-intensive process via a combination of weakest precondition computation and constraint solving, and produces a cause tree, where each node is a potential cause of the failure and each edge represents a casual relationship between two causes.
Abstract: During software debugging, a significant amount of effort is required for programmers to identify the root cause of a manifested failure. In this article, we propose a cascade fault localization method to help speed up this labor-intensive process via a combination of weakest precondition computation and constraint solving. Our approach produces a cause tree, where each node is a potential cause of the failure and each edge represents a casual relationship between two causes. There are two main contributions of this article that differentiate our approach from existing methods. First, our method systematically computes all potential causes of a failure and augments each cause with a proper context for ease of comprehension by the user. Second, our method organizes the potential causes in a tree structure to enable on-the-fly pruning based on domain knowledge and feedback from the user. We have implemented our new method in a software tool called CaFL, which builds upon the LLVM compiler and KLEE symbolic virtual machine. We have conducted experiments on a large set of public benchmarks, including real applications from GNU Coreutils and Busybox. Our results show that in most cases the user has to examine only a small fraction of the execution trace before identifying the root cause of the failure.

Journal ArticleDOI
TL;DR: This work presents a technique that aims to increase the effective yield of FPGA manufacturing by re-claiming a portion of chips that would be ordinarily classified as unusable by modifying existing commercial toolchain flows to make them fault aware.
Abstract: As the size and density of silicon chips continue to increase, maintaining acceptable manufacturing yields has become increasingly difficult. Recent works suggest that lithography techniques are reaching their limits with respect to enabling high yield fabrication of small-scale devices, thus there is an increasing need for techniques that can tolerate fabrication time defects. One candidate technology to help combat these defects is reconfigurable hardware. The flexible nature of reconfigurable devices, such as Field Programmable Gate Arrays (FPGAs), makes it possible for them to route around defective areas of a chip after the device has been packaged and deployed into the field.This work presents a technique that aims to increase the effective yield of FPGA manufacturing by re-claiming a portion of chips that would be ordinarily classified as unusable. In brief, we propose a modification to existing commercial toolchain flows to make them fault aware. A phase is added to identify faults within the chip. The locations of these faults are then used by the toolchain to avoid faults during the placement and routing phase.Specifically, we have applied our approach to the Xilinx commercial toolchain flow and evaluated its tolerance to both logic and routing resource faults. Our findings show that, at a cost of 5--10p in device frequency performance, the modified toolchain flow can tolerate up to 30p of logic resources being faulty and, depending on the nature of the target application, can tolerate 1--30p of the device's routing resources being faulty. These results provide strong evidence that commercial toolchains not designed for the purpose of tolerating faults can still be greatly leveraged in the presence of faults to place and route circuits in an efficient manner.

Journal ArticleDOI
TL;DR: Based on network calculus, this work presents and proves theorems to derive per-flow end-to-end Equivalent Service Curves (ESC), which are in turn used for computing Least Upper Delay Bounds (LUDBs) of individual flows.
Abstract: Real-time applications such as multimedia and gaming require stringent performance guarantees, usually enforced by a tight upper bound on the maximum end-to-end delay. For FIFO multiplexed on-chip packet switched networks we consider worst-case delay bounds for Variable Bit-Rate (VBR) flows with aggregate scheduling, which schedules multiple flows as an aggregate flow. VBR Flows are characterized by a maximum transfer size (L), peak rate (p), burstiness (σ), and average sustainable rate (ρ). Based on network calculus, we present and prove theorems to derive per-flow end-to-end Equivalent Service Curves (ESC), which are in turn used for computing Least Upper Delay Bounds (LUDBs) of individual flows. In a realistic case study we find that the end-to-end delay bound is up to 46.9p more accurate than the case without considering the traffic peak behavior. Likewise, results also show similar improvements for synthetic traffic patterns. The proposed methodology is implemented in C++ and has low run-time complexity, enabling quick evaluation for large and complex SoCs.