scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems in 2011"


Journal ArticleDOI
Jason Cong, Bin Liu, Stephen Neuendorffer1, Juanjo Noguera1, Kees Vissers1, Zhiru Zhang 
TL;DR: AutoESL's AutoPilot HLS tool coupled with domain-specific system-level implementation platforms developed by Xilinx are used as an example to demonstrate the effectiveness of state-of-art C-to-FPGA synthesis solutions targeting multiple application domains.
Abstract: Escalating system-on-chip design complexity is pushing the design community to raise the level of abstraction beyond register transfer level. Despite the unsuccessful adoptions of early generations of commercial high-level synthesis (HLS) systems, we believe that the tipping point for transitioning to HLS msystem-on-chip design complexityethodology is happening now, especially for field-programmable gate array (FPGA) designs. The latest generation of HLS tools has made significant progress in providing wide language coverage and robust compilation technology, platform-based modeling, advancement in core HLS algorithms, and a domain-specific approach. In this paper, we use AutoESL's AutoPilot HLS tool coupled with domain-specific system-level implementation platforms developed by Xilinx as an example to demonstrate the effectiveness of state-of-art C-to-FPGA synthesis solutions targeting multiple application domains. Complex industrial designs targeting Xilinx FPGAs are also presented as case studies, including comparison of HLS solutions versus optimized manual designs. In particular, the experiment on a sphere decoder shows that the HLS solution can achieve an 11-31% reduction in FPGA resource usage with improved design productivity compared to hand-coded design.

728 citations


Journal ArticleDOI
TL;DR: QLMOR demonstrates that Volterra-kernel based nonlinear MOR techniques can in fact have far broader applicability than previously suspected, possibly being competitive with trajectory-based methods (e.g., trajectory piece-wise linear reduced order modeling) and nonlinear-projection based methods ( e.g, maniMOR).
Abstract: We present a projection-based nonlinear model order reduction method, named model order reduction via quadratic-linear systems (QLMOR). QLMOR employs two novel ideas: 1) we show that nonlinear ordinary differential equations, and more generally differential-algebraic equations (DAEs) with many commonly encountered nonlinear kernels can be rewritten equivalently in a special representation, quadratic-linear differential algebraic equations (QLDAEs), and 2) we perform a Volterra analysis to derive the Volterra kernels, and we adapt the moment-matching reduction technique of nonlinear model order reduction method (NORM) [1] to reduce these QLDAEs into QLDAEs of much smaller size. Because of the generality of the QLDAE representation, QLMOR has significantly broader applicability than Taylor-expansion-based methods [1]-[3] since there is no approximation involved in the transformation from original DAEs to QLDAEs. Because the reduced model has only quadratic nonlinearities, its computational complexity is less than that of similar prior methods. In addition, QLMOR, unlike NORM, totally avoids explicit moment calculations, hence it has improved numerical stability properties as well. We compare QLMOR against prior methods [1]-[3] on a circuit and a biochemical reaction-like system, and demonstrate that QLMOR-reduced models retain accuracy over a significantly wider range of excitation than Taylor-expansion-based methods [1]-[3]. QLMOR, therefore, demonstrates that Volterra-kernel based nonlinear MOR techniques can in fact have far broader applicability than previously suspected, possibly being competitive with trajectory-based methods (e.g., trajectory piece-wise linear reduced order modeling [4]) and nonlinear-projection based methods (e.g., maniMOR [5]).

173 citations


Journal ArticleDOI
TL;DR: A methodology for characterizing and modeling fundamental photonic building blocks which can subsequently be combined to form full photonic network architectures is presented and a set of tools which can be utilized to assess the physical-layer and system-level performance properties of a photonics network are described.
Abstract: Photonic technology is becoming an increasingly attractive solution to the problems facing today's electronic chip-scale interconnection networks. Recent progress in silicon photonics research has enabled the demonstration of all the necessary optical building blocks for creating extremely high-bandwidth density and energy-efficient links for on-chip and off-chip communications. From the feasibility and architecture perspective however, photonics represents a dramatic paradigm shift from traditional electronic network designs due to fundamental differences in how electronics and photonics function and behave. As a result of these differences, new modeling and analysis methods must be employed in order to properly realize a functional photonic chip-scale interconnect design. In this paper, we present a methodology for characterizing and modeling fundamental photonic building blocks which can subsequently be combined to form full photonic network architectures. We also describe a set of tools which can be utilized to assess the physical-layer and system-level performance properties of a photonic network. The models and tools are integrated in a novel open-source design and simulation environment. We present a case study of two different photonic networks-on-chip to demonstrate how our improved understanding and modeling of the physical-layer details of photonic communications can be used to better understand the system-level performance impact.

117 citations


Journal ArticleDOI
TL;DR: Techniques inspired by computational intelligence are used to speed up yield optimization without sacrificing accuracy, and the resulting ORDE algorithm can achieve approximately a tenfold improvement in computational effort compared to an improved MC-based yield optimization algorithm integrating the infeasible sampling and Latin-hypercube sampling techniques.
Abstract: In nanometer complementary metal-oxide-semiconductor technologies, worst-case design methods and response-surface-based yield optimization methods face challenges in accuracy. Monte-Carlo (MC) simulation is general and accurate for yield estimation, but its efficiency is not high enough to make MC-based analog yield optimization, which requires many yield estimations, practical. In this paper, techniques inspired by computational intelligence are used to speed up yield optimization without sacrificing accuracy. A new sampling-based yield optimization approach, which determines the device sizes to optimize yield, is presented, called the ordinal optimization (OO)-based random-scale differential evolution (ORDE) algorithm. By proposing a two-stage estimation flow and introducing the OO technique in the first stage, sufficient samples are allocated to promising solutions, and repeated MC simulations of non-critical solutions are avoided. By the proposed evolutionary algorithm that uses differential evolution for global search and a random-scale mutation operator for fine tunings, the convergence speed of the yield optimization can be enhanced significantly. With the same accuracy, the resulting ORDE algorithm can achieve approximately a tenfold improvement in computational effort compared to an improved MC-based yield optimization algorithm integrating the infeasible sampling and Latin-hypercube sampling techniques. Furthermore, ORDE is extended from plain yield optimization to process-variation-aware single-objective circuit sizing.

111 citations


Journal ArticleDOI
TL;DR: A theoretical basis for deriving the optimal policies and computationally efficient implementations of online thermal management techniques for multicore processors is provided and the effectiveness of the DVFS and task-to-core allocation techniques are demonstrated by numerical simulations.
Abstract: Extracting high performance from multi-core processors requires increased use of thermal management techniques. In contrast to offline thermal management techniques, online techniques are capable of sensing changes in the workload distribution and setting the processor controls accordingly. Hence, online solutions are more accurate and are able to extract higher performance than the offline techniques. This paper presents performance optimal online thermal management techniques for multicore processors. The techniques include dynamic voltage and frequency scaling and task-to-core allocation or task migration. The problem formulation includes accurate power and thermal models, as well as leakage dependence on temperature. This paper provides a theoretical basis for deriving the optimal policies and computationally efficient implementations. The effectiveness of our DVFS and task-to-core allocation techniques are demonstrated by numerical simulations. The proposed task-to-core allocation method showed a 20.2% improvement in performance over a power-based thread migration approach. The techniques have been incorporated in a thermal-aware architectural-level simulator called MAGMA that allows for design space exploration, offline, and online dynamic thermal management. The simulator is capable of handling simulations of hundreds of cores within reasonable time.

106 citations


Journal ArticleDOI
TL;DR: A library of static ambipolar carbon nanotube field effect transistor (CNTFET) gates based on generalized NOR-NAND-AOI-OAI primitives, which efficiently implements XOR-based functions are proposed, which results in ambipolar gates with a higher expressive power than conventional complementary metal-oxidesemiconductor (CMOS) libraries.
Abstract: Recently, several emerging technologies have been reported as potential candidates for controllable ambipolar devices. Controllable ambipolarity is a desirable property that enables the on-line configurability of n-type and p-type device polarity. In this paper, we introduce a new design methodology for logic gates based on controllable ambipolar devices, with an emphasis on carbon nanotubes as the candidate technology. Our technique results in ambipolar gates with a higher expressive power than conventional complementary metal-oxidesemiconductor (CMOS) libraries. We propose a library of static ambipolar carbon nanotube field effect transistor (CNTFET) gates based on generalized NOR-NAND-AOI-OAI primitives, which efficiently implements XOR-based functions. Technology mapping of several multi-level logic benchmarks that extensively use the XOR function, including multipliers, adders, and linear circuits, with ambipolar CNTFET logic gates indicates that on average, it is possible to reduce the number of logic levels by 42%, the delay by 26%, and the power consumption by 32%, resulting in a energy-delay-product (EDP) reduction of 59 % over the same circuits mapped with unipolar CNTFET logic gates. Based on the projections in [1], where it is stated that defectfree CNTFETs will provide a 5x performance improvement over metal-oxide-semiconductor field effect transistors, the ambipolar library provides a performance improvement of 7x, a 57% reduction in power consumption, and a 20x improvement in EDP over the CMOS library.

99 citations


Journal ArticleDOI
TL;DR: The optimized self-tuning approach satisfies performance constraints at all times, and maximizes a lifetime computational power efficiency (LCPE) metric, which is defined as the total number of clock cycles achieved over lifetime divided by the total energy consumed over lifetime.
Abstract: This paper presents an integrated framework, together with control policies, for optimizing dynamic control of self-tuning parameters of a digital system over its lifetime in the presence of circuit aging. A variety of self-tuning parameters such as supply voltage, operating clock frequency, and dynamic cooling are considered, and jointly optimized using efficient algorithms described in this paper. Our optimized self-tuning approach satisfies performance constraints at all times, and maximizes a lifetime computational power efficiency (LCPE) metric, which is defined as the total number of clock cycles achieved over lifetime divided by the total energy consumed over lifetime. We present three control policies: 1) progressive-worst-case-aging (PWCA), which assumes worst-case aging at all times; 2) progressive-on-state-aging (POSA), which estimates aging by tracking active/sleep modes, and then assumes worst-case aging in active mode and long recovery effects in sleep mode; and 3) progressive-real-time-aging-assisted (PRTA), which acquires real-time information and initiates optimized control actions. Various flavors of these control policies for systems with dynamic voltage and frequency scaling (DVFS) are also analyzed. Simulation results on benchmark circuits, using aging models validated by 45 nm measurements, demonstrate the effectiveness and practicality of our approach in significantly improving LCPE and/or lifetime compared to traditional one-time worst-case guardbanding. We also derive system design guidelines to maximize self-tuning benefits.

89 citations


Journal ArticleDOI
TL;DR: Experimental results show that the algorithms are very effective in reducing not only flip-flop power consumption but also clock tree and signal net wirelength, and the power consumption of the clock network is minimized.
Abstract: Optimization for power is always one of the most important design objectives in modern nanometer integrated circuit design Recent studies have shown the effectiveness of applying multi-bit flip-flops to save the power consumption of the clock network This paper presents: 1) a novel design methodology of applying multi-bit flip-flops at the post-placement stage, which can be seamlessly integrated in modern design flow; 2) a new problem formulation for post-placement optimization with multi-bit flip-flops; 3) flip-flop clustering and placement algorithms to simultaneously minimize flip-flop power consumption and interconnecting wirelength; and 4) a progressive window-based optimization technique to reduce placement deviation and improve runtime efficiency of our algorithms Experimental results show that our algorithms are very effective in reducing not only flip-flop power consumption but also clock tree and signal net wirelength Consequently, the power consumption of the clock network is minimized

88 citations


Journal ArticleDOI
TL;DR: This work proposes a statistical physics inspired approach to capture the traffic dynamics in multicore systems and opens up new research directions into NoC optimization which require accurate models of time-dependent and space-dependent traffic behavior.
Abstract: Networks-on-chip (NoCs) have been proposed as a viable solution to solving the communication problem in multicore systems. In this new setup, mapping multiple applications on available computational resources leads to interaction and contention at various network resources. Consequently, taking into account the traffic characteristics becomes of crucial importance for performance analysis and optimization of the communication infrastructure, as well as proper resource management. Although queuing-based approaches have been traditionally used for performance analysis purposes, they cannot properly account for many of the traffic characteristics (e.g., non-stationarity, self-similarity) that are crucial for multicore platform design. To overcome these limitations, we propose a statistical physics inspired approach to capture the traffic dynamics in multicore systems. As shown later in this paper, this is of fundamental significance for re-thinking the very basis of multicore systems design; it also opens up new research directions into NoC optimization which require accurate models of time-dependent and space-dependent traffic behavior.

86 citations


Journal ArticleDOI
TL;DR: This paper describes the electrode work-function, oxide thickness, gate-source/drain underlap, and silicon thick ness optimization required to realize dual-Vth independent-gate FinFETs, enabling a new class of compact logic gates with higher expressive power and flexibility than conventional CMOS gates.
Abstract: This paper describes the electrode work-function, oxide thickness, gate-source/drain underlap, and silicon thick ness optimization required to realize dual-Vth independent-gate FinFETs. Optimum values for these FinFET design parameters are derived using the physics-based University of Florida SPICE model for double-gate devices, and the optimized FinFETs are simulated and validated using Sentaurus TCAD simulations. Dual-Vth FinFETs with independent gates enable series and parallel merge transformations in logic gates, realizing compact low power alternative gates with competitive performance and reduced input capacitance in comparison to conventional FinFET gates. Furthermore, they also enable the design of a new class of compact logic gates with higher expressive power and flexibility than conventional CMOS gates, e.g., implementing 12 unique Boolean functions using only four transistors. Circuit designs that balance and improve the performance of the novel gates are described. The gates are designed and calibrated using the University of Florida double-gate model into conventional and enhanced technology libraries. Synthesis results for 16 benchmark circuits from the ISCAS and OpenSPARC suites indicate that on average at 2 GHz, the enhanced library reduces total power and the number of fins by 36% and 37%, respectively, over a conventional library designed using shorted-gate FinFETs in 32 nm technology.

86 citations


Journal ArticleDOI
TL;DR: A design flow to efficiently map multiple multi-core applications on a dynamically reconfigurable SoC is presented and is actually able to extract similarities among the applications, as it achieves an average improvement in terms of reconfiguration latency with respect to a communication-oriented approach.
Abstract: Nowadays, multi-core systems-on-chip (SoCs) are typically required to execute multiple complex applications, which demand a large set of heterogeneous hardware cores with different sizes. In this context, the popularity of dynamically reconfigurable platforms is growing, as they increase the ability of the initial design to adapt to future modifications. This paper presents a design flow to efficiently map multiple multi-core applications on a dynamically reconfigurable SoC. The proposed methodology is tailored for a reconfigurable hardware architecture based on a flexible communication infrastructure, and exploits applications similarities to obtain an effective mapping. We also introduce a run-time mapper that is able to introduce new applications that were not known at design-time, preserving the mapping of the original system. We apply our design flow to a real-world multimedia case study and to a set of synthetic benchmarks, showing that it is actually able to extract similarities among the applications, as it achieves an average improvement of 29% in terms of reconfiguration latency with respect to a communication-oriented approach, while preserving the same communication performance.

Journal ArticleDOI
TL;DR: In this paper, a new FSM watermarking scheme is proposed by making the authorship information a non-redundant property of the FSM to overcome the vulnerability to state removal attack and minimize the design overhead.
Abstract: Finite state machines (FSMs) are the backbone of sequential circuit design. In this paper, a new FSM watermarking scheme is proposed by making the authorship information a non-redundant property of the FSM. To overcome the vulnerability to state removal attack and minimize the design overhead, the watermark bits are seamlessly interwoven into the outputs of the existing and free transitions of state transition graph (STG). Unlike other transition-based STG watermarking, pseudo input variables have been reduced and made functionally indiscernible by the notion of reserved free literal. The assignment of reserved literals is exploited to minimize the overhead of watermarking and make the watermarked FSM fallible upon removal of any pseudo input variable. A direct and convenient detection scheme is also proposed to allow the watermark on the FSM to be publicly detectable. Experimental results on the watermarked circuits from the ISCAS'89 and IWLS'93 benchmark sets show lower or acceptably low overheads with higher tamper resilience and stronger authorship proof in comparison with related watermarking schemes for sequential functions.

Journal ArticleDOI
TL;DR: This work proposes a high performance hotspot detection methodology consisting of a fast layout analyzer; 2) powerful hotspot pattern identifiers; and 3) a generic and efficient flow with successive performance refinements that achieves higher prediction accuracy for hotspots that are not previously characterized.
Abstract: Under the real and evolving manufacturing conditions, lithography hotspot detection faces many challenges. First, real hotspots become hard to identify at early design stages and hard to fix at post-layout stages. Second, false alarms must be kept low to avoid excessive and expensive post-processing hotspot removal. Third, full chip physical verification and optimization require very fast turn-around time. Last but not least, rapid technology advancement favors generic hotspot detection methodologies to avoid exhaustive pattern enumeration and excessive development/update as technology evolves. To address the above issues, we propose a high performance hotspot detection methodology consisting of: 1) a fast layout analyzer; 2) powerful hotspot pattern identifiers; and 3) a generic and efficient flow with successive performance refinements. We implement our algorithms with industry-strength engine under real manufacturing conditions and show that it significantly outperforms state-of-the-art algorithms in false alarms (2.4X to 2300X reduction) and runtime (5X to 237X reduction), meanwhile achieving similar or better hotspot accuracies. Compared with pattern matching, our method achieves higher prediction accuracy for hotspots that are not previously characterized, therefore, more detection generality when exhaustive pattern enumeration is too expensive to perform a priori. Such high performance hotspot detection is especially suitable for lithography-friendly physical design.

Journal ArticleDOI
TL;DR: This is the first piece of work that can handle symmetry constraint, common centroid constraint, and other general placement constraints, simultaneously, simultaneously.
Abstract: In today's system-on-chip designs, both digital and analog parts of a circuit will be implemented on the same chip. Parasitic mismatch induced by layout will affect circuit performance significantly for analog designs. Consideration of symmetry and common centroid constraints during placement can help to reduce these errors. Besides these two specific types of placement constraints, other constraints, such as alignment, abutment, preplace, and maximum separation, are also essential in circuit placement. In this paper, we will present a placement methodology that can handle all these constraints at the same time. To the best of our knowledge, this is the first piece of work that can handle symmetry constraint, common centroid constraint, and other general placement constraints, simultaneously. Experimental results do confirm the effectiveness and scalability of our approach in solving this mixed constraint-driven placement problem.

Journal ArticleDOI
TL;DR: Compared with available methods with the best solution quality, MMLDE can obtain comparable results, and has approximately a tenfold improvement in computational efficiency, which makes the computational time for optimized component synthesis acceptable.
Abstract: State-of-the-art synthesis methods for microwave passive components suffer from the following drawbacks. They either have good efficiency but highly depend on the accuracy of the equivalent circuit models, which may fail the synthesis when the frequency is high, or they fully depend on electromagnetic (EM) simulations, with a high solution quality but are too time consuming. To address the problem of combining high solution quality and good efficiency, a new method, called memetic machine learning-based differential evolution (MMLDE), is presented. The key idea of MMLDE is the proposed online surrogate model-based memetic evolutionary optimization mechanism, whose training data are generated adaptively in the optimization process. In particular, by using the differential evolution algorithm as the optimization kernel and EM simulation as the performance evaluation method, high-quality solutions can be obtained. By using Gaussian process and artificial neural network in the proposed search mechanism, surrogate models are constructed online to predict the performances, saving a lot of expensive EM simulations. Compared with available methods with the best solution quality, MMLDE can obtain comparable results, and has approximately a tenfold improvement in computational efficiency, which makes the computational time for optimized component synthesis acceptable. Moreover, unlike many available methods, MMLDE does not need any equivalent circuit models or any coarse-mesh EM models. Experiments of 60 GHz syntheses and comparisons with the state-of-art methods provide evidence of the important advantages of MMLDE.

Journal ArticleDOI
TL;DR: The proposed encoding scheme exploits the wormhole switching techniques and works on an end-to-end basis, showing that it is possible to reduce the power contribution of both the self-switching activity and the coupling switching activity in inter-routers links.
Abstract: An ever more significant fraction of the overall power dissipation of a network-on-chip (NoC) based system-on-chip (SoC) is due to the interconnection system. In fact, as technology shrinks, the power contribute of NoC links starts to compete with that of NoC routers. In this paper, we propose the use of data encoding techniques as a viable way to reduce both power dissipation and energy consumption of NoC links. The proposed encoding scheme exploits the wormhole switching techniques and works on an end-to-end basis. That is, flits are encoded by the network interface (NI) before they are injected in the network and are decoded by the destination NI. This makes the scheme transparent to the underlying network since the encoder and decoder logic is integrated in the NI and no modification of the routers architecture is required. We assess the proposed encoding scheme on a set of representative data streams (both synthetic and extracted from real applications) showing that it is possible to reduce the power contribution of both the self-switching activity and the coupling switching activity in inter-routers links. As results, we obtain a reduction in total power dissipation and energy consumption up to 37% and 18%, respectively, without any significant degradation in terms of both performance and silicon area.

Journal ArticleDOI
TL;DR: A semi-automated design flow for 3-D NoCs including a defect-tolerance scheme to increase the global yield of3-D stacked chips and an adopted fault tolerance scheme for TSV-based multi-bit links.
Abstract: Through silicon vias (TSVs) provide an efficient way to support vertical communication among different layers of a vertically stacked chip, enabling scalable 3-D networks-on-chip (NoC) architectures. Unfortunately, low TSV yields significantly impact the feasibility of high-bandwidth vertical connectivity. In this paper, we present a semi-automated design flow for 3-D NoCs including a defect-tolerance scheme to increase the global yield of 3-D stacked chips. Starting from an accurate physical and geometrical model of TSVs: 1) we extract a circuit-level model for vertical interconnections; 2) we use it to evaluate the design implications of extending switch architectures with ports in the vertical direction; moreover, 3) we present a defect-tolerance technique for TSV-based multi-bit links through an effective use of redundancy; and finally, 4) we present a design flow allowing for post-layout simulation of NoCs with links in all three physical dimensions. Experimental results show that a 3-D NoC implementation yields around 10% frequency improvement over a 2-D one, thanks to the propagation delay advantage of TSVs and the shorter links. In addition, the adopted fault tolerance scheme demonstrates a significant yield improvement, ranging from 66% to 98%, with a low area cost (20.9% on a vertical link in a NoC switch, which leads a modest 2.1% increase in the total switch area) in 130 nm technology, with minimal impact on very large-scale integrated design and test flows.

Journal ArticleDOI
TL;DR: This paper proposes a new technique, referred to as virtual probe (VP), to efficiently measure, characterize, and monitor spatially-correlated inter-die and/or intra-die variations in nanoscale manufacturing process, thereby reducing the cost of silicon characterization.
Abstract: In this paper, we propose a new technique, referred to as virtual probe (VP), to efficiently measure, characterize, and monitor spatially-correlated inter-die and/or intra-die variations in nanoscale manufacturing process. VP exploits recent breakthroughs in compressed sensing to accurately predict spatial variations from an exceptionally small set of measurement data, thereby reducing the cost of silicon characterization. By exploring the underlying sparse pattern in spatial frequency domain, VP achieves substantially lower sampling frequency than the well-known Nyquist rate. In addition, VP is formulated as a linear programming problem and, therefore, can be solved both robustly and efficiently. Our industrial measurement data demonstrate the superior accuracy of VP over several traditional methods, including 2-D interpolation, Kriging prediction, and k-LSE estimation.

Journal ArticleDOI
TL;DR: This paper presents the first design automation flow that considers the cross-contamination problems on pin-constrained biochips, and proposes early crossing minimization algorithms during placement and systematic wash droplet scheduling and routing that require only one extra control pin and zero assay completion time overhead for practical bioassays.
Abstract: Digital microfluidic biochips have emerged as a popular alternative for laboratory experiments. Pin-count reduction and cross-contamination avoidance are key design considerations for practical applications with different droplets being transported and manipulated on highly integrated biochips. This paper presents the first design automation flow that considers the cross-contamination problems on pin-constrained biochips. The factors that make the problems harder on pin-constrained biochips are explored. To cope with these cross contaminations, this paper proposes: 1) early crossing minimization algorithms during placement, and 2) systematic wash droplet scheduling and routing that require only one extra control pin and zero assay completion time overhead for practical bioassays. Experimental results show the effectiveness and scalability of our algorithms for practical bioassays.

Journal ArticleDOI
TL;DR: This paper addresses test architecture optimization for 3-D stacked ICs implemented using TSVs and shows that shorter test lengths are generally achieved with the larger, more complex dies lower in the stack.
Abstract: Through-silicon via (TSV)-based 3-D stacked ICs (SICs) are becoming increasingly important in the semiconductor industry. In this paper, we address test architecture optimization for 3-D stacked ICs implemented using TSVs. We consider two cases, namely 3-D SICs with die-level test architectures that are either fixed or still need to be designed. We next present mathematical programming techniques to derive optimal solutions for the architecture optimization problem for both cases. Experimental results for three handcrafted 3-D SICs comprising of various systems-on-a-chip (SoCs) from the ITC'02 SoC test benchmarks show that compared to the baseline method of sequentially testing all dies, the proposed solutions can achieve significant reduction in test length. This is achieved through optimal test schedules enabled by the test architecture. We also show that increasing the number of test pins typically provides a greater reduction in test length compared to an increase in the number of test TSVs. Furthermore, we show that shorter test lengths are generally achieved with the larger, more complex dies lower in the stack. This is because test data must pass through every die lower in a stack in order to reach its target die, and with the larger dies lower in the stack, more test bandwidth may be provided to these dies using fewer routing resources.

Journal ArticleDOI
TL;DR: A new asynchronous interconnection network is introduced for globally-asynchronous locally-synchronous (GALS)chip multiprocessors that eliminates the need for global clock distribution, and can interface multiple synchronous timing domains operating at unrelated clock rates.
Abstract: A new asynchronous interconnection network is introduced for globally-asynchronous locally-synchronous (GALS) chip multiprocessors. The network eliminates the need for global clock distribution, and can interface multiple synchronous timing domains operating at unrelated clock rates. In particular, two new highly-concurrent asynchronous components are introduced which provide simple routing and arbitration/merge functions. Post-layout simulations in identical commercial 90 nm technology indicate that comparable recent synchronous router nodes have 5.6-10.7 more energy per packet and 2.8-6.4 greater area than the new asynchronous nodes. Under random traffic, the network provides significantly lower latency and identical throughput over the entire operating range of the 800 MHz network and through mid-range traffic rates for the 1.36 GHz network, but with degradation at higher traffic rates. Preliminary evaluations are also presented for a mixed-timing (GALS) network in a shared-memory parallel architecture, running both random traffic and parallel benchmark kernels, as well as directions for further improvement.

Journal ArticleDOI
TL;DR: It is shown that the new approach generates more placement rules and can lead to better circuit performance and parametric yield according to post-layout simulation.
Abstract: This paper presents a new method to automatically generate hierarchical placement rules, which are crucial for a successful analog placement. The method is based on a novel symmetry computation method, introducing the structural signal flow graph. Five types of proximity, matching and symmetry constraints are determined. According to the priority of the constraint types, a constraint requirement graph and a hierarchical partitioning of the circuit into matching, proximity and symmetry groups is then automatically computed. Based on experimental results with a state-of-the-art placement tool, we show that the new approach generates more placement rules and can lead to better circuit performance and parametric yield according to post-layout simulation.

Journal ArticleDOI
TL;DR: A flow is presented for the automatic synthesis of an analog circuit layout based on a schematic and a list of circuit design parameter values, integrated with a deterministic nonlinear optimization algorithm to perform layout-driven circuit sizing.
Abstract: A flow is presented for the automatic synthesis of an analog circuit layout based on a schematic and a list of circuit design parameter values. The flow is driven by design, placement, and routing constraints-no layout template is necessary. Every possible layout for each device in the circuit is investigated; the layouts with the best geometric features and smallest quantization error (due to manufacturing grid alignment) are kept. For circuit placement, a complete enumeration of possible circuit placements, limited only by usual constraints of symmetry, proximity, and common centroid, is performed. Out of this enumeration a final circuit placement is selected and routed. The new flow is integrated with a deterministic nonlinear optimization algorithm to perform layout-driven circuit sizing; layouts are synthesized during both gradient approximation and next step determination. Layout-driven circuit sizing was applied to two example circuits. Sizing of the first circuit example took 8× the amount of CPU time needed for traditional circuit sizing, but remained feasible at 2.1 h of wall clock time on a contemporary workstation.

Journal ArticleDOI
TL;DR: Test simulations of a 1-D metal-oxide semiconductor diode demonstrate that the DG approach discretized using the new, second-order differential (SOD) scheme can be accurately calibrated against Schrödinger-Poisson calculations exhibiting lower discretization error than the previous schemes when using coarse grids.
Abstract: An efficient implementation of the density-gradient (DG) approach for the finite element and finite difference methods and its application in drift-diffusion (D-D) simulations is described in detail. The new, second-order differential (SOD) scheme is compatible with relatively coarse grids even for large density variations thus applicable to device simulations with complex 3-D geometries. Test simulations of a 1-D metal-oxide semiconductor diode demonstrate that the DG approach discretized using our SOD scheme can be accurately calibrated against Schrodinger-Poisson calculations exhibiting lower discretization error than the previous schemes when using coarse grids and the same results for very fine meshes. 3-D test D-D simulations using the finite element method are performed on two devices: a 10 nm gate length double gate metal-oxide-semiconductor field-effect transistor (MOSFET) and a 40 nm gate length Tri-Gate fin field-effect transistor (FinFET). In 3-D D-D simulations, the SOD scheme is able to converge to physical solutions at high voltages even if the previous schemes fail when using the same mesh and equivalent conditions. The quantum corrected D-D simulations using the SOD scheme also converge with an atomistic mesh used for the 10 nm double gate MOSFET saving computational resources and can be accurately calibrated against the results from non-equilibrium Green's functions approach. Finally, the simulated ID-VG characteristics for the 40 nm gate length Tri-Gate are in an excellent agreement with experimental data.

Journal ArticleDOI
TL;DR: Novel techniques for synthesizing combinational logic that transforms source probabilities into different target probabilities are demonstrated, showing that for any integer n ≥ 2, there exists a single probability that can be transformed into arbitrary base-n fractional probabilities.
Abstract: Schemes for probabilistic computation can exploit physical sources to generate random values in the form of bit streams. Generally, each source has a fixed bias and so provides bits with a specific probability of being one. If many different probability values are required, it can be expensive to generate all of these directly from physical sources. This paper demonstrates novel techniques for synthesizing combinational logic that transforms source probabilities into different target probabilities. We consider three scenarios in terms of whether the source probabilities are specified and whether they can be duplicated. In the case that the source probabilities are not specified and can be duplicated, we provide a specific choice, the set {0.4, 0.5} ; we show how to synthesize logic that transforms probabilities from this set into arbitrary decimal probabilities. Further, we show that for any integer n ≥ 2, there exists a single probability that can be transformed into arbitrary base-n fractional probabilities. In the case that the source probabilities are specified and cannot be duplicated, we provide two methods for synthesizing logic to transform them into target probabilities. In the case that the source probabilities are not specified, but once chosen cannot be duplicated, we provide an optimal choice.

Journal ArticleDOI
TL;DR: A novel design-time/run-time thermal management strategy for improving energy efficiency in 3-D MPSoCs through liquid cooling management and dynamic voltage and frequency scaling (DVFS).
Abstract: 3-D stacked systems reduce communication delay in multiprocessor system-on-chips (MPSoCs) and enable heterogeneous integration of cores, memories, sensors, and RF devices. However, vertical integration of layers exacerbates temperature-induced problems such as reliability degradation. Liquid cooling is a highly efficient solution to overcome the accelerated thermal problems in 3-D architectures; however, it brings new challenges in modeling and run-time management for such 3-D MPSoCs with multitier liquid cooling. This paper proposes a novel design-time/run-time thermal management strategy. The design-time phase involves a rigorous thermal impact analysis of various thermal control variables. We then utilize this analysis to design a run-time fuzzy controller for improving energy efficiency in 3-D MPSoCs through liquid cooling management and dynamic voltage and frequency scaling (DVFS). The fuzzy controller adjusts the liquid flow rate dynamically to match the cooling demand of the chip for preventing overcooling and for maintaining a stable thermal profile. The DVFS decisions increase chip-level energy savings and help balance the temperature across the system. Our controller is used in conjunction with temperature-aware load balancing and dynamic power management strategies. Experimental results on 2-tier and 4-tier 3-D MPSoCs show that our strategy prevents the system from exceeding the given threshold temperature. At the same time, we reduce cooling energy by up to 63% and system-level energy by up to 21% in comparison to statically setting a flow rate setting to handle worst-case temperatures.

Journal ArticleDOI
TL;DR: A novel algorithm to construct a linear-sized obstacle-avoiding spanning graph which guarantees to contain a rectilinear minimum spanning tree if there is no obstacle is proposed.
Abstract: In this paper, we present an algorithm called FOARS for obstacle-avoiding rectilinear Steiner minimal tree (OARSMT) construction. FOARS applies a top-down approach which first partitions the set of pins into several subsets uncluttered by obstacles. Then an obstacle-avoiding Steiner tree is generated for each subset by an obstacle aware version of the rectilinear Steiner minimal tree algorithm FLUTE. Finally, the trees are merged and refined to form the OARSMT. To guide the partitioning of pins, we propose a novel algorithm to construct a linear-sized obstacle-avoiding spanning graph which guarantees to contain a rectilinear minimum spanning tree if there is no obstacle. Experimental results show that FOARS is among the best algorithms in terms of both wirelength and runtime for testcases both with and without obstacles.

Journal ArticleDOI
TL;DR: This paper proposes a novel stateful logic pipeline architecture based on memristive switches, and addresses some of the issues, in particular logic representation using OR-inverter graphs, two-level optimization synthesis strategy, data synchronization with data forwarding, stall-free pipelined finite state machines, and constraints for synthesis and mapping onto the fabric.
Abstract: Recently, researchers have demonstrated that memristive switches can be used to implement logic and latches as well as memory and programmable interconnects. In this paper, we propose a novel stateful logic pipeline architecture based on memristive switches. The proposed architecture mapped to the field programmable nanowire interconnect fabric produces a field programmable stateful logic array, in which general-purpose computation functions can be implemented by configuring only nonvolatile nanowire crossbar switches. CMOS control switches are used to isolate stateful logic units so that multiple operations can be executed in parallel. Since basic operation of the stateful logic, namely, material implication, cannot fan out, a new basic AND operation which can duplicate output is proposed. The basic unit of the proposed architecture is designed to execute multiple basic operations concurrently in a step so that each basic unit implements a large fan-in OR or NOR gate. The fine-grain ultradeep constant-throughput pipeline properties pose new design automation problems. We address some of the issues, in particular logic representation using OR-inverter graphs, two-level optimization synthesis strategy, data synchronization with data forwarding, stall-free pipelined finite state machines, and constraints for synthesis and mapping onto the fabric.

Journal ArticleDOI
TL;DR: Novel on-chip router architecture is developed to support dynamic self-reconfiguration of the bidirectional traffic flow and exhibits consistent and significant performance advantage over conventional NoC equipped with hard-wired unidirectional channels.
Abstract: A bidirectional channel network-on-chip (BiNoC) architecture is proposed to enhance the performance of on-chip communication. In a BiNoC, each communication channel allows to be dynamically self-reconfigured to transmit flits in either direction. This added flexibility promises better bandwidth utilization, lower packet delivery latency, and higher packet consumption rate. Novel on-chip router architecture is developed to support dynamic self-reconfiguration of the bidirectional traffic flow. This area-efficient BiNoC router delivers better performance and requires smaller buffer size than that of a conventional network-on-chip (NoC). The flow direction at each channel is controlled by a channel direction control (CDC) algorithm. Implemented with a pair of finite state machines, this CDC algorithm is shown to be high performance, free of deadlock, and free of starvation. Extensive cycle-accurate simulations using synthetic and real-world traffic patterns have been conducted to evaluate the performance of the BiNoC. These results exhibit consistent and significant performance advantage over conventional NoC equipped with hard-wired unidirectional channels.

Journal ArticleDOI
TL;DR: A hardware/software co-simulation environment capable of running a full-fledged OS at the early stage of the electronic system level design flow at an acceptable simulation speed is proposed and a virtual platform constructed using the proposed CA-ISS as the processor model can be used to estimate the performance of a target system from system perspective.
Abstract: In this paper, we present a fast cycle-accurate instruction set simulator (CA-ISS) for system-on-chip development based on QEMU and SystemC. Even though most state-of-the-art commercial tools have tried very hard to provide all the levels of details to satisfy the different requirements of the software designer, the hardware designer, and even the system architect, the hardware/software co-simulation speed is dramatically slow when co-simulating the hardware models at the register-transfer level (RTL) with a full-fledged operating system (OS). Our experimental results show that the combination of QEMU and SystemC can make the co-simulation at the CA level much faster than the conventional RTL simulation, even with a full-fledged operating system up and running. Furthermore, the statistics indicate that with every instruction executed and every memory accessed since power-on traced at the CA level, it takes 28m15.804s on average to boot up a full-fledged Linux kernel, even on a personal computer. Compared to the kernel boot time reported by Xilinx and SiCortex, the proposed CA-ISS is about 6.09 times faster compared to “SystemC without trace” of Xilinx and about 30.32 times faster compared to “SystemC models converted from RTL” of SiCortex. The main contributions of this paper are threefold: 1) a hardware/software co-simulation environment capable of running a full-fledged OS at the early stage of the electronic system level design flow at an acceptable simulation speed is proposed; 2) a virtual platform constructed using the proposed CA-ISS as the processor model can be used to estimate the performance of a target system from system perspective, which all the previous works, such as QEMU-SystemC, do not provide; and 3) such a virtual platform also provides the modeling capability from the transaction level down to the CA level or the other way around.