scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems in 2013"


Journal ArticleDOI
TL;DR: This paper proposes logic complexity reduction at the transistor level as an alternative approach to take advantage of the relaxation of numerical accuracy, and demonstrates the utility of these approximate adders in two digital signal processing architectures with specific quality constraints.
Abstract: Low power is an imperative requirement for portable multimedia devices employing various signal processing algorithms and architectures. In most multimedia applications, human beings can gather useful information from slightly erroneous outputs. Therefore, we do not need to produce exactly correct numerical outputs. Previous research in this context exploits error resiliency primarily through voltage overscaling, utilizing algorithmic and architectural techniques to mitigate the resulting errors. In this paper, we propose logic complexity reduction at the transistor level as an alternative approach to take advantage of the relaxation of numerical accuracy. We demonstrate this concept by proposing various imprecise or approximate full adder cells with reduced complexity at the transistor level, and utilize them to design approximate multi-bit adders. In addition to the inherent reduction in switched capacitance, our techniques result in significantly shorter critical paths, enabling voltage scaling. We design architectures for video and image compression algorithms using the proposed approximate arithmetic units and evaluate them to demonstrate the efficacy of our approach. We also derive simple mathematical models for error and power consumption of these approximate adders. Furthermore, we demonstrate the utility of these approximate adders in two digital signal processing architectures (discrete cosine transform and finite impulse response filter) with specific quality constraints. Simulation results indicate up to 69% power savings using the proposed approximate adders, when compared to existing implementations using accurate adders.

637 citations


Journal ArticleDOI
TL;DR: An algorithm for computing depth-optimal decompositions of logical operations, leveraging a meet-in-the-middle technique to provide a significant speedup over simple brute force algorithms is presented.
Abstract: We present an algorithm for computing depth-optimal decompositions of logical operations, leveraging a meet-in-the-middle technique to provide a significant speedup over simple brute force algorithms. As an illustration of our method, we implemented this algorithm and found factorizations of commonly used quantum logical operations into elementary gates in the Clifford+T set. In particular, we report a decomposition of the Toffoli gate over the set of Clifford and T gates. Our decomposition achieves a total T-depth of 3, thereby providing a 40% reduction over the previously best known decomposition for the Toffoli gate. Due to the size of the search space, the algorithm is only practical for small parameters, such as the number of qubits, and the number of gates in an optimal implementation.

495 citations


Journal ArticleDOI
TL;DR: The proposed SPICE model builds on existing models and is correlated against several published device characterization data with an average error of 6.04%.
Abstract: This paper presents a SPICE model for memristive devices. It builds on existing models and is correlated against several published device characterization data with an average error of 6.04%. When compared to existing alternatives, the proposed model can more accurately simulate a wide range of published memristors. The model is also tested in large circuits with up to 256 memristors, and was less likely to cause convergence errors when compared to other models. We show that the model can be used to study the impact of memristive device variation within a circuit. We examine the impact of nonuniformity in device state variable dynamics and conductivity on individual memristors as well as a four memristor read/write circuit. These studies show that the model can be used to predict how variation in a memristor wafer may impact circuit performance.

217 citations


Journal ArticleDOI
TL;DR: In this article, an intrusive spectral simulator for statistical circuit analysis is presented, which employs the recently developed generalized polynomial chaos expansion to perform uncertainty quantification of nonlinear transistor circuits with both Gaussian and non-Gaussian random parameters.
Abstract: Uncertainties have become a major concern in integrated circuit design. In order to avoid the huge number of repeated simulations in conventional Monte Carlo flows, this paper presents an intrusive spectral simulator for statistical circuit analysis. Our simulator employs the recently developed generalized polynomial chaos expansion to perform uncertainty quantification of nonlinear transistor circuits with both Gaussian and non-Gaussian random parameters. We modify the nonintrusive stochastic collocation (SC) method and develop an intrusive variant called stochastic testing (ST) method. Compared with the popular intrusive stochastic Galerkin (SG) method, the coupled deterministic equations resulting from our proposed ST method can be solved in a decoupled manner at each time point. At the same time, ST requires fewer samples and allows more flexible time step size controls than directly using a nonintrusive SC solver. These two properties make ST more efficient than SG and than existing SC methods, and more suitable for time-domain circuit simulation. Simulation results of several digital, analog and RF circuits are reported. Since our algorithm is based on generic mathematical models, the proposed ST algorithm can be applied to many other engineering problems.

167 citations


Journal ArticleDOI
TL;DR: Specific sensing mechanisms that have been developed and their potential use in building underdesigned and opportunistic computing machines, including software stack that opportunistically adapts to a sensed or modeled hardware.
Abstract: Microelectronic circuits exhibit increasing variations in performance, power consumption, and reliability parameters across the manufactured parts and across use of these parts over time in the field. These variations have led to increasing use of overdesign and guardbands in design and test to ensure yield and reliability with respect to a rigid set of datasheet specifications. This paper explores the possibility of constructing computing machines that purposely expose hardware variations to various layers of the system stack including software. This leads to the vision of underdesigned hardware that utilizes a software stack that opportunistically adapts to a sensed or modeled hardware. The envisioned underdesigned and opportunistic computing (UnO) machines face a number of challenges related to the sensing infrastructure and software interfaces that can effectively utilize the sensory data. In this paper, we outline specific sensing mechanisms that we have developed and their potential use in building UnO machines.

153 citations


Journal ArticleDOI
TL;DR: Two bounded-length maze routing (BLMR) algorithms are presented that perform much faster routing than traditional maze routing algorithms and a rectilinear Steiner minimum tree aware routing scheme is proposed to guide heuristic-BLMR and monotonic routing to build a routing tree with shorter wirelength.
Abstract: Modern global routers employ various routing methods to improve routing speed and quality Maze routing is the most time-consuming process for existing global routing algorithms This paper presents two bounded-length maze routing (BLMR) algorithms (optimal-BLMR and heuristic-BLMR) that perform much faster routing than traditional maze routing algorithms In addition, a rectilinear Steiner minimum tree aware routing scheme is proposed to guide heuristic-BLMR and monotonic routing to build a routing tree with shorter wirelength This paper also proposes a parallel multithreaded collision-aware global router based on a previous sequential global router (SGR) Unlike the partitioning-based strategy, the proposed parallel router uses a task-based concurrency strategy Finally, a 3-D wirelength optimization technique is proposed to further refine the 3-D routing results Experimental results reveal that the proposed SGR uses less wirelength and runs faster than most of other state-of-the-art global routers with a different set of parameters , , , Compared to the proposed SGR, the proposed parallel router yields almost the same routing quality with average 271 and 312-fold speedup on overflow-free and hard-to-route cases, respectively, when running on a 4-core system

142 citations


Journal ArticleDOI
TL;DR: A physical-aware system reconfiguration technique that uses sensor data at intermediate checkpoints to dynamically reconfigure the biochip and a cyberphysical resynthesis technique is used to recompute electrode-actuation sequences, thereby deriving new schedules, module placement, and droplet routing pathways, with minimum impact on the time-to-response.
Abstract: Droplet-based digital microfluidics technology has now come of age, and software-controlled biochips for healthcare applications are starting to emerge. However, today's digital microfluidic biochips suffer from the drawback that there is no feedback to the control software from the underlying hardware platform. Due to the lack of precision inherent in biochemical experiments, errors are likely during droplet manipulation; error recovery based on the repetition of experiments leads to wastage of expensive reagents and hard-to-prepare samples. By exploiting recent advances in the integration of optical detectors (sensors) into a digital microfluidics biochip, we present a physical-aware system reconfiguration technique that uses sensor data at intermediate checkpoints to dynamically reconfigure the biochip. A cyberphysical resynthesis technique is used to recompute electrode-actuation sequences, thereby deriving new schedules, module placement, and droplet routing pathways, with minimum impact on the time-to-response.

126 citations


Journal ArticleDOI
TL;DR: The DVFS transition overhead is redefined including the underclocking-related losses in a DVFS-enabled microprocessor, additional inductor IR losses, and power losses due to discontinuous-mode DC-DC conversion.
Abstract: Dynamic voltage and frequency scaling (DVFS) has been studied for well over a decade. Nevertheless, existing DVFS transition overhead models suffer from significant inaccuracies; for example, by incorrectly accounting for the effect of DC–DC converters, frequency synthesizers, voltage, and frequency change policies on energy losses incurred during mode transitions. Incorrect and/or inaccurate DVFS transition overhead models prevent one from determining the precise break-even time and thus forfeit some of the energy saving that is ideally achievable. This paper introduces accurate DVFS transition overhead models for both energy consumption and delay. In particular, we redefine the DVFS transition overhead including the underclocking-related losses in a DVFS-enabled microprocessor, additional inductor IR losses, and power losses due to discontinuous-mode DC–DC conversion. We report the transition overheads for a desktop, a mobile and a low-power representative processor. We also present DVFS transition overhead macromodel for use by high-level DVFS schedulers.

122 citations


Journal ArticleDOI
TL;DR: This paper surveys key design for manufacturing issues for extreme scaling with emerging nanolithography technologies, including double/multiple patterning lithography, extreme ultraviolet lithographic, and electron-beam lithography.
Abstract: In this paper, we survey key design for manufacturing issues for extreme scaling with emerging nanolithography technologies, including double/multiple patterning lithography, extreme ultraviolet lithography, and electron-beam lithography. These nanolithography and nanopatterning technologies have different manufacturing processes and their unique challenges to very large scale integration (VLSI) physical design, mask synthesis, and so on. It is essential to have close VLSI design and underlying process technology co-optimization to achieve high product quality (power/performance, etc.) and yield while making future scaling cost-effective and worthwhile. Recent results and examples will be discussed to show the enablement and effectiveness of such design and process integration, including lithography model/analysis, mask synthesis, and lithography friendly physical design.

113 citations


Journal ArticleDOI
TL;DR: This paper presents a 3-D mesh-based ONoC for MPSoCs, and new low-cost nonblocking 4 × 4, 5 × 5, 6 × 6, and 7 × 7 optical routers for dimension-order routing in the 3- D mesh- based O noC, and proposes an optimized floorplan.
Abstract: Optical networks-on-chip (ONoCs) are emerging communication architectures that can potentially offer ultrahigh communication bandwidth and low latency to multiprocessor systems-on-chip (MPSoCs). In addition to ONoC architectures, 3-D integrated technologies offer an opportunity to continue performance improvements with higher integration densities. In this paper, we present a 3-D mesh-based ONoC for MPSoCs, and new low-cost nonblocking 4 × 4, 5 × 5, 6 × 6, and 7 × 7 optical routers for dimension-order routing in the 3-D mesh-based ONoC. Besides, we propose an optimized floorplan for the 3-D mesh-based ONoC. The floorplan follows the regular 3-D mesh topology but implements all optical routers in a single optical layer. The floorplan is optimized to minimize the number of extra waveguide crossings caused when merging the 3-D ONoC to one optical layer. Based on a set of real applications and uniform traffic pattern, we develop a SystemC-based cycle-accurate NoC simulator and compare the 3-D mesh-based ONoC with the matched 2-D mesh-based ONoC and 2-D electronic NoC for performance and energy efficiency. Additionally, we quantitatively analyze thermal effects on the 3-D 8 × 8 × 2 mesh-based ONoC.

104 citations


Journal ArticleDOI
TL;DR: This work proposes a smart diagnosis method based on two ML classification models, namely, artificial neural networks (ANNs) and support-vector machines (SVMs) that can learn from repair history and accurately localize the root cause of a failure.
Abstract: Increasing integration densities and high operating speeds lead to subtle manifestation of defects at the board level. Functional fault diagnosis is, therefore, necessary for board-level product qualification. However, ambiguous diagnosis results lead to long debug times and even wrong repair actions, which significantly increase repair cost and adversely impact yield. Advanced machine-learning (ML) techniques offer an unprecedented opportunity to increase the accuracy of board-level functional diagnosis and reduce high-volume manufacturing cost through successful repair. We propose a smart diagnosis method based on two ML classification models, namely, artificial neural networks (ANNs) and support-vector machines (SVMs) that can learn from repair history and accurately localize the root cause of a failure. Fine-grained fault syndromes extracted from failure logs and corresponding repair actions are used to train the classification models. We also propose a decision machine based on weighted-majority voting, which combines the benefits of ANNs and SVMs. Three complex boards from the industry, currently in volume production, and additional synthetic data, are used to validate the proposed methods in terms of diagnostic accuracy, resolution, and quantifiable improvement over current diagnostic software.

Journal ArticleDOI
TL;DR: The automatic layout generation is demonstrated here using the LAYGEN II tool for typical analog circuit structures, and the results in GDSII format were validated using the industrial grade verification tool Calibre®.
Abstract: This paper describes an innovative design automation tool, LAYGEN II, for analog integrated circuit (IC) layout generation based on template descriptions and on evolutionary computation techniques. LAYGEN II was developed giving special emphasis to the reusability of expert knowledge and to the efficiency of retargeting operations. The designer specifies the sized circuit-level structure, the required technology and also, the layout template consisting of technology and specification independent high-level layout guidelines. For placement, the topological relations present in the template are extracted to a nonslicing B*-tree layout representation, and the tool automatically merges devices and improves the floorplan quality. For routing an optimization kernel consisting of a tailored version of the multiobjective multiconstraint evolutionary algorithm NSGA-II is used. The Router optimizes all nets simultaneously and uses a built-in engine to evaluate each of the layout solutions. The automatic layout generation is demonstrated here using the LAYGEN II tool for typical analog circuit structures, and the results in GDSII format were validated using the industrial grade verification tool Calibre®.

Journal ArticleDOI
TL;DR: A floating random walk (FRW) solver, called RWCap, is presented for the capacitance extraction of very-large-scale integration (VLSI) interconnects and it is demonstrated that the parallel RWCap is over 6× faster than its serial-computing version.
Abstract: A floating random walk (FRW) solver, called RWCap, is presented for the capacitance extraction of very-large-scale integration (VLSI) interconnects. An approach, including the numerical characterization of the cross-interface transition probability and weight value, is proposed to accelerate the extraction of structures with multiple dielectric layers. A comprehensive variance reduction scheme based on the importance sampling and stratified sampling is proposed to improve the convergence rate of the FRW algorithm. Finally, the space management technique using an octree data structure and the parallel computing technique are presented to further improve the efficiency. Numerical experiments are carried out with the test cases generated under the 180 and 45-nm process technologies. They demonstrate that the proposed multidielectric FRW algorithm achieves up to 160× speedup over the FRW algorithm using spherical transition domains to cross dielectric interface, with very small memory overhead. The variance reduction techniques further bring 3× or more speedup without memory overhead and the loss of accuracy. The RWCap also outperforms other existing FRW algorithm and fast boundary element method solvers in terms of computational time or scalability. The experiments on an 8-core CPU machine show that the parallel RWCap is over 6× faster than its serial-computing version.

Journal ArticleDOI
TL;DR: The results show that GoldMine can generate complex, high coverage assertions for sequential as well as combinational designs in RTL, thereby minimizing human effort in this process.
Abstract: We present GoldMine, a methodology for generating assertions automatically in hardware. Our method involves a combination of data mining and static analysis of the register transfer level (RTL) design. The RTL design is first simulated to generate data about the design's dynamic behavior. The generated data is then mined for “candidate assertions” that are likely to be invariants. The data mining algorithm is a decision-tree-based supervised learning algorithm. These candidate assertions are then passed through a formal verification engine to filter out the spurious candidates. The assertions that are attested as true by the formal engine are system invariants. These are then evaluated by a process of designer ranking that is provided as feedback to the data mining engine. We demonstrate the scalability of GoldMine by showing assertion generation of the RTL of Sun's OpenSparc T2 many-threaded processor. Our results show that GoldMine can generate complex, high coverage assertions for sequential as well as combinational designs in RTL, thereby minimizing human effort in this process. GoldMine assertions distill the random input stimulus space and can be used for calibrating directed tests. They can be used in a regression test suite of an evolving RTL. They are also useful in providing differing perspectives from the designer, as well as hints to designers for manually writing assertions.

Journal ArticleDOI
TL;DR: A novel TSV repair framework is presented, including a hardware redundancy architecture that enables faulty TSVs to be repaired by redundant TSVs that are farther apart, the corresponding repair algorithm and the redundancy architecture construction, which can improve the manufacturing yield for 3-D-stacked ICs.
Abstract: 3-D-stacked integrated circuits (ICs) that employ through-silicon vias (TSVs) to connect multiple dies vertically have gained wide-spread interest in the semiconductor industry. In order to be commercially viable, the assembly yield for 3-D-stacked ICs must be as high as possible, requiring TSVs to be reparable. Existing techniques typically assume TSV faults to be uniformly distributed and use neighboring TSVs to repair faulty ones, if any. In practice, however, clustered TSV faults are quite common due to the fact that the TSV bonding quality depends on surface roughness and cleanness of silicon dies, rendering prior TSV redundancy solutions less effective. Furthermore, existing techniques consume a lot of redundant TSVs that are still costly in the current TSV process. This inefficient TSV redundancy can limit the amount of TSVs that is allowed to use and may even become the obstacle to commercial production. To resolve this problem, we present a novel TSV repair framework, including a hardware redundancy architecture that enables faulty TSVs to be repaired by redundant TSVs that are farther apart, the corresponding repair algorithm and the redundancy architecture construction. By doing so, the manufacturing yield for 3-D-stacked ICs can be dramatically improved, as demonstrated in our experimental results.

Journal ArticleDOI
TL;DR: This paper addresses formal verification of combinational arithmetic circuits over Galois fields of the type F2k using a computer-algebra/algebraic-geometry-based approach and demonstrates the ability of this approach to verify the correctness of, and detect bugs in, up to 163-bit circuits in F2163-whereas verification utilizing contemporary techniques proves infeasible.
Abstract: Galois field arithmetic is a critical component in communication and security-related hardware, requiring dedicated arithmetic architectures for better performance. In many Galois field applications, such as cryptography, the data-path size in the circuits can be very large. Formal verification of such circuits is beyond the capabilities of contemporary verification techniques. This paper addresses formal verification of combinational arithmetic circuits over Galois fields of the type F2k using a computer-algebra/algebraic-geometry-based approach. The verification problem is formulated as membership testing of a given specification polynomial in a corresponding ideal generated by the circuit constraints. Ideal membership testing requires the computation of a Grobner basis, which is computationally very expensive. To overcome this limitation, we analyze the circuit topology and derive a term order to represent the polynomials. Subsequently, using the theory of Grobner bases over F2k, we show that this term order renders the set of polynomials itself a minimal Grobner basis of this ideal. Consequently, the verification test reduces to a much simpler case of Grobner basis reduction via polynomial division, significantly enhancing verification efficiency. To further improve our approach, we exploit the concepts presented in the F4 algorithm for Grobner basis, and show that the verification test can be formulated as Gaussian elimination on a matrix representation of the problem. Finally, we demonstrate the ability of our approach to verify the correctness of, and detect bugs in, up to 163-bit circuits in F2163-whereas verification utilizing contemporary techniques proves infeasible.

Journal ArticleDOI
TL;DR: An adaptive sparse matrix solver called NICSLU is proposed, which uses a multithreaded parallel LU factorization algorithm on shared-memory computers with multicore/multisocket central processing units to accelerate circuit simulation.
Abstract: The sparse matrix solver has become a bottleneck in simulation program with integrated circuit emphasis (SPICE)-like circuit simulators. It is difficult to parallelize the solver because of the high data dependency during the numeric LU factorization and the irregular structure of circuit matrices. This paper proposes an adaptive sparse matrix solver called NICSLU, which uses a multithreaded parallel LU factorization algorithm on shared-memory computers with multicore/multisocket central processing units to accelerate circuit simulation. The solver can be used in all the SPICE-like circuit simulators. A simple method is proposed to predict whether a matrix is suitable for parallel factorization, such that each matrix can achieve optimal performance. The experimental results on 35 matrices reveal that NICSLU achieves speedups of 2.08× ~ 8.57×(on the geometric mean), compared with KLU, with 1-12 threads, for the matrices which are suitable for the parallel algorithm. NICSLU can be downloaded from http://nicslu.weebly.com.

Journal ArticleDOI
TL;DR: This paper proposes a new 3-D cell placement algorithm that can additionally consider the sizes of TSVs and the physical positions for TSV insertion during placement, and can achieve the best routed wirelength, TSV counts, and total silicon area, in shortest running time.
Abstract: Through-silicon vias (TSVs) are required for transmitting signals among different dies for the 3-D integrated circuit (IC) technology. The significant silicon areas occupied by TSVs bring critical challenges for 3-D IC placement. Unlike most published 3-D placement works that only minimize the number of TSVs during placement due to the limitations in their techniques, this paper proposes a new 3-D cell placement algorithm that can additionally consider the sizes of TSVs and the physical positions for TSV insertion during placement. The algorithm consists of three stages: 1) 3-D analytical global placement with density optimization and whitespace reservation for TSVs; 2) TSV insertion and TSV-aware legalization; and 3) layer-by-layer detailed placement. In particular, the global placement is based on a novel weighted-average (WA) wirelength model, giving the first published model that can outperform the well-known log-sum-exp wirelength model theoretically and empirically. Also, a scheme is proposed to enhance the numerical stability of the WA wirelength model. Furthermore, 3-D routing can easily be accomplished by traditional 2-D routers since the physical positions of TSVs are determined during placement. Experimental results show the effectiveness of our algorithm. Compared with state-of-the-art 3-D cell placement works, our algorithm can achieve the best routed wirelength, TSV counts, and total silicon area, in shortest running time.

Journal ArticleDOI
TL;DR: It is argued that a split manufacturing approach to hardware trust based on 3-D integration is viable and provides several advantages over other approaches.
Abstract: Securing the supply chain of integrated circuits is of utmost importance to computer security. In addition to counterfeit microelectronics, the theft or malicious modification of designs in the foundry can result in catastrophic damage to critical systems and large projects. In this letter, we describe a 3-D architecture that splits a design into two separate tiers: one tier that contains critical security functions is manufactured in a trusted foundry; another tier is manufactured in an unsecured foundry. We argue that a split manufacturing approach to hardware trust based on 3-D integration is viable and provides several advantages over other approaches.

Journal ArticleDOI
TL;DR: This paper examines the GC process and proposes a semipreemptible GC (PGC) scheme that allows GC processing to be preempted while pending I/O requests in the queue are serviced and further enhance flash performance by pipelining internal GC operations and merging them with pending I-O requests whenever possible.
Abstract: Unlike hard disks, flash devices use out-of-place updates operations and require a garbage collection (GC) process to reclaim invalid pages to create free blocks. This GC process is a major cause of performance degradation when running concurrently with other I/O operations as internal bandwidth is consumed to reclaim these invalid pages. The invocation of the GC process is generally governed by a low watermark on free blocks and other internal device metrics that different workloads meet at different intervals. This results in an I/O performance that is highly dependent on workload characteristics. In this paper, we examine the GC process and propose a semipreemptible GC (PGC) scheme that allows GC processing to be preempted while pending I/O requests in the queue are serviced. Moreover, we further enhance flash performance by pipelining internal GC operations and merge them with pending I/O requests whenever possible. Our experimental evaluation of this semi-PGC scheme with realistic workloads demonstrates both improved performance and reduced performance variability. Write-dominant workloads show up to a 66.56% improvement in average response time with a 83.30% reduced variance in response time compared to the non-PGC scheme. In addition, we explore opportunities of a new NAND flash device that supports suspend/resume commands for read, write, and erase operations for fully PGC (F-PGC). Our experiments with an F-PGC enabled flash device show that request response time can be improved by up to 14.57% compared to semi-PGC.

Journal ArticleDOI
TL;DR: A concurrent error detection technique called recomputing with permuted operands (REPO) is developed that is cost effective in advanced encryption standard (AES) and a secure hash function Grøstl and achieves close to 100% fault coverage for multiple byte faults.
Abstract: Naturally occurring and maliciously injected faults reduce the reliability of cryptographic hardware and may leak confidential information. We develop a concurrent error detection technique (CED) called recomputing with permuted operands (REPO). We show that it is cost effective in advanced encryption standard (AES) and a secure hash function Grostl. We provide experimental results and formal proofs to show that REPO detects all single-bit and single-byte faults. Experimental results show that REPO achieves close to 100% fault coverage for multiple byte faults. The hardware and throughput overheads are compared with those of previously reported CED techniques on two Xilinx Virtex FPGAs. The hardware overhead is 12.4%-27.3%, and the throughput is 1.2-23 Gbps, depending on the AES architecture, FPGA family, and detection latency. The performance overhead ranges from 10% to 100% depending on the security level. Moreover, the proposed technique can be integrated into various block cipher modes of operation. We also discuss the limitation of REPO and its potential vulnerabilities.

Journal ArticleDOI
TL;DR: Two polynomial time algorithms, RDPM and RDPM-DUP, have been proposed to generate near-optimal data placement with minimum total cost to effectively utilizing SPMs on multicore systems.
Abstract: Scratch pad memories (SPM) are attractive alternatives for caches on multicore systems since caches are relatively expensive in terms of area and energy consumption. The key to effectively utilizing SPMs on multicore systems is the data placement algorithm. In this paper, two polynomial time algorithms, regional data placement for multicore (RDPM) and regional data placement for multicore with duplication (RDPM-DUP), have been proposed to generate near-optimal data placement with minimum total cost. There is only one copy for each data in RDPM, while RDPM-DUP allows data duplication. Experimental results show that the proposed RDPM algorithm alone can reduce the time cost of memory accesses by 32.68% on average compared with existing algorithms. With data duplication, the RDPM-DUP algorithm further reduces the time cost by 40.87%. In terms of energy consumption, the proposed RDPM algorithm with exclusive copy can reduce the total cost by 33.47% on average. When RDPM-DUP is applied, the improvement increases up to 38.15% on average.

Journal ArticleDOI
TL;DR: A rigorous analytical thermal model has been formulated for the analysis of self-heating effects in FinFETs, under both steady-state and transient stress conditions, which is critical for improving circuit performance and electrical overstress/electrostatic discharge (ESD) reliability.
Abstract: A rigorous analytical thermal model has been formulated for the analysis of self-heating effects in FinFETs, under both steady-state and transient stress conditions. 3-D self-consistent electrothermal simulations, tuned with experimentally measured electrical characteristics, were used to understand the nature of self-heating in FinFETs and calibrate the proposed model. The accuracy of the model has been demonstrated for a wide range of multifin devices by comparing it against finite element simulations. The model has been applied to carry out a detailed sensitivity analysis of self-heating with respect to various FinFET parameters and structures, which are critical for improving circuit performance and electrical overstress/electrostatic discharge (ESD) reliability. The transient model has been used to estimate the thermal time constants of these devices and predict the sensitivity of power-to-failure to various device parameters, for both long and short pulse ESD situations. Suitable modifications to the model are also proposed for evaluating the thermal characteristics of production level FinFET (or Tri-gate FET) structures involving metal-gates, body-tied bulk FinFETs, and trench contacts.

Journal ArticleDOI
TL;DR: A comprehensive comparative analysis of virtual channels and multiple physical networks, including an analytical model, synthesis-based designs with both FPGAs and standard-cell libraries, and system-level simulations identifies the scenarios where each method is best suited to achieve high performance, very low power dissipation, and increased design flexibility.
Abstract: Virtual channels (VC) and multiple physical (MP) networks are two alternative methods to provide better performance, support quality-of-service, and avoid protocol deadlocks in packet-switched network-on-chip design. Since contention can be dynamically resolved, VCs give lower zero-load packet latency than MPs; however, MPs can be built with simpler routers and narrower channels, which improves the target clock frequency, power dissipation, and area occupation. In this paper, we present a comprehensive comparative analysis of these two design approaches, including an analytical model, synthesis-based designs with both FPGAs and standard-cell libraries, and system-level simulations. The result of our analysis shows that one solution does not outperform the other in all the tested scenarios. Instead, each approach has its own specific strengths and weaknesses. Hence, we identify the scenarios where each method is best suited to achieve high performance, very low power dissipation, and increased design flexibility.

Journal ArticleDOI
TL;DR: Experimental results demonstrate that the proposed technique can achieve reliability-oriented placement for DMFBs without excessive actuation in each electrode, while optimizing bioassay completion time.
Abstract: In recent studies, digital microfluidic biochips (DMFBs) have been a promising solution for lab-on-a-chip and bio-assay experiments because of their flexible application and low fabrication cost. However, the reliability problem is an imperative issue to guarantee the valid function of DMFBs. The reliability of DMFBs decreases when electrodes are excessively actuated, preventing droplets on DMFBs controlled successfully. Because the placement for bio-assays in DMFBs is a key step in generating corresponding actuating signals, the reliability of DMFBs must be considered during biochip placement to avoid excessive actuation. Although researchers have proposed several DMFB placement algorithms, they have failed to consider the reliability issue. In addition, previous algorithms were all based on the simulated-annealing (SA) method, which is time consuming and does not guarantee to obtain an optimal solution. This paper proposes the first reliability-oriented non-SA placement algorithm for DMFBs. This approach considers the reliability problem during placement, and uses the 3-D deferred decision making (3D-DDM) technique to enumerate only possible placement solutions. Large-scale DMFB placement can be synthesized efficiently by partitioning the operation sequential graph of bioassays. Experimental results demonstrate that the proposed technique can achieve reliability-oriented placement for DMFBs without excessive actuation in each electrode, while optimizing bioassay completion time.

Journal ArticleDOI
TL;DR: A multitarget sample preparation algorithm that extensively exploits the ideas of waste recycling and intermediate droplet sharing to reduce both reactant usage and waste amount for digital microfluidic biochips is proposed.
Abstract: Sample preparation is one of essential processes in biochemical reactions. Raw reactants are diluted in this process to achieve given target concentrations. A bioassay may require several different target concentrations of a reactant. Both the dilution operation count and the reactant usage can be minimized if multiple target concentrations are considered simultaneously during sample preparation. Hence, in this paper, we propose a multitarget sample preparation algorithm that extensively exploits the ideas of waste recycling and intermediate droplet sharing to reduce both reactant usage and waste amount for digital microfluidic biochips. Experimental results show that our waste recycling algorithm can reduce the waste and operation count by 48% and 37%, respectively, as compared to an existing state-of-the-art multitarget sample preparation method if the number of target concentrations is ten. The reduction can be up to 97% and 73% when the number of target concentrations goes even higher.

Journal ArticleDOI
TL;DR: This paper presents a versatile prebond TSV test method applicable before wafer thinning when the deep end of the TSV is inaccessible as buried in the still-thick wafer.
Abstract: Testing the quality of prebond through-silicon vias (TSV) is a vital part of the Known-Good-Die test that is often necessary to retain a high compound yield for 3-D stacked integrated circuits. In this paper, we present a versatile prebond TSV test method applicable before wafer thinning when the deep end of the TSV is inaccessible as buried in the still-thick wafer. Technical merits include: 1) the ability to handle both the resistive open fault and the leakage fault in the same test structure; 2) a capability that allows an user to have a better measure of the severity of the fault; and 3) an all-digital and easy to implement design-for-testability circuit.

Journal ArticleDOI
TL;DR: This paper proposes an accurate model of SystemC and three complementary encodings of systemC to finite-state processes, sequential and threaded programming models and shows the effectiveness of the threaded and of the finite-modelencodings to prove and disprove properties, respectively.
Abstract: SystemC is an increasingly used language for writing executable specifications of systems-on-chip. The verification of SystemC, however, is a very difficult challenge. Simulation features great scalability, but can miss important defects. On the other hand, formal verification of SystemC is extremely hard because of the presence of threads, and the intricacies of the communication and scheduling mechanisms. In this paper, we explore formal verification for SystemC by means of software model checking techniques, which have demonstrated substantial progress in recent years. We propose an accurate model of SystemC and three complementary encodings of SystemC to finite-state processes, sequential and threaded programming models. We implement the proposed approaches in a tool chain and carry out a thorough experimental evaluation using several benchmarks taken from the literature on SystemC verification, and experimenting with different state-of-the-art software model checkers. The results clearly show the applicability and efficiency of the proposed approaches. In particular, the results show the effectiveness of the threaded and of the finite-model encodings to prove and disprove properties, respectively.

Journal ArticleDOI
TL;DR: This paper is the first to formally describe the global charge allocation problem in HEES systems, namely, distributing a specified level of incoming power to a subset of destination EES banks so that maximum charge allocation efficiency is achieved.
Abstract: A hybrid electrical energy storage (HEES) system consists of multiple banks of heterogeneous electrical energy storage (EES) elements placed between a power source and some load devices and providing charge storage and retrieval functions For an HEES system to perform its desired functions of 1) reducing electricity costs by storing electricity obtained from the power grid at off-peak times when its price is lower, for use at peak times instead of electricity that must be bought then at higher prices, and 2) alleviating problems, such as excessive power fluctuation and undependable power supply, which are associated with the use of large amounts of renewable energy on the grid, appropriate charge management policies must be developed in order to efficiently store and retrieve electrical energy while attaining performance metrics that are close to the respective best values across the constituent EES banks in the HEES system This paper is the first to formally describe the global charge allocation problem in HEES systems, namely, distributing a specified level of incoming power to a subset of destination EES banks so that maximum charge allocation efficiency is achieved The problem is formulated as a mixed integer nonlinear program with the objective function set to the global charge allocation efficiency and the constraints capturing key requirements and features of the system such as the energy conservation law, power conversion losses in the chargers, the rate capacity, and self-discharge effects in the EES elements A rigorous algorithm is provided to obtain near-optimal charge allocation efficiency under a daily charge allocation schedule A photovoltaic array is used as an example of the power source for the charge allocation process and a heuristic is provided to predict the solar radiation level with a high accuracy Simulation results using this photovoltaic cell array and a representative HEES system demonstrate up to 25% gain in the charge allocation efficiency by employing the proposed algorithm

Journal ArticleDOI
TL;DR: The experimental results show that this analytical approach is effective for achieving tradeoffs between the wirelength and the through-silicon-via (TSV) number, and suggest that considering the thermal effects of TSVs is necessary and effective during the placement stage.
Abstract: In this paper, we present a high-quality analytical 3-D placement framework. We propose using a Huber-based local smoothing technique to work with a Helmholtz-based global smoothing technique to handle the nonoverlapping constraints. The experimental results show that this analytical approach is effective for achieving tradeoffs between the wirelength and the through-silicon-via (TSV) number. Compared to the state-of-the-art 3-D placer ntuplace3d, our placer achieves more than 20% wirelength reduction, on average, with a similar number of TSVs. Furthermore, we extend this analytical 3-D placement framework with thermal awareness. While 2-D thermal-aware placement simply follows uniform power distribution to minimize temperature, we show that the same criterion does not work for 3-D ICs. Instead, we are able to prove that when the TSV area in each bin is proportional to the lumped power consumption of that bin and the bins in all tiers directly above it, the peak temperature is minimized. Based on this criterion, we implement thermal awareness in our analytical 3-D placement framework. Compared with a TSV oblivious method, which only results in an 8% peak temperature reduction, our method reduces the peak temperature by 34%, on average, with slightly less wirelength overhead. These results suggest that considering the thermal effects of TSVs is necessary and effective during the placement stage.