scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems in 2001"


Journal Article•DOI•
TL;DR: A new method for determining component values and transistor dimensions for CMOS operational amplifiers (op-amps) is described, showing in detail how the method can be used to size robust designs, i.e., designs guaranteed to meet the specifications for a variety of process conditions and parameters.
Abstract: We describe a new method for determining component values and transistor dimensions for CMOS operational amplifiers (op-amps). We observe that a wide variety of design objectives and constraints have a special form, i.e., they are posynomial functions of the design variables. As a result, the amplifier design problem can be expressed as a special form of optimization problem called geometric programming, for which very efficient global optimization methods have been developed. As a consequence we can efficiently determine globally optimal amplifier designs or globally optimal tradeoffs among competing performance measures such as power, open-loop gain, and bandwidth. Our method, therefore, yields completely automated sizing of (globally) optimal CMOS amplifiers, directly from specifications. In this paper, we apply this method to a specific widely used operational amplifier architecture, showing in detail how to formulate the design problem as a geometric program. We compute globally optimal tradeoff curves relating performance measures such as power dissipation, unity-gain bandwidth, and open-loop gain. We show how the method can he used to size robust designs, i.e., designs guaranteed to meet the specifications for a variety of process conditions and parameters.

540 citations


Journal Article•DOI•
TL;DR: The theory of latency-insensitive design is presented as the foundation of a new correct-by-construction methodology to design complex systems by assembling intellectual property components to design large digital integrated circuits by using deep submicrometer technologies.
Abstract: The theory of latency-insensitive design is presented as the foundation of a new correct-by-construction methodology to design complex systems by assembling intellectual property components. Latency-insensitive designs are synchronous distributed systems and are realized by composing functional modules that exchange data on communication channels according to an appropriate protocol. The protocol works on the assumption that the modules are stallable, a weak condition to ask them to obey. The goal of the protocol is to guarantee that latency-insensitive designs composed of functionally correct modules behave correctly independently of the channel latencies. This allows us to increase the robustness of a design implementation because any delay variations of a channel can be "recovered" by changing the channel latency while the overall system functionality remains unaffected. As a consequence, an important application of the proposed theory is represented by the latency-insensitive methodology to design large digital integrated circuits by using deep submicrometer technologies.

435 citations


Journal Article•DOI•
TL;DR: A new test-data compression method and decompression architecture based on variable-to-variable-length Golomb codes that is especially suitable for encoding precomputed test sets for embedded cores in a system-on-a-chip (SoC).
Abstract: We present a new test-data compression method and decompression architecture based on variable-to-variable-length Golomb codes. The proposed method is especially suitable for encoding precomputed test sets for embedded cores in a system-on-a-chip (SoC). The major advantages of Golomb coding of test data include very high compression, analytically predictable compression results, and a low-cost and scalable on-chip decoder. In addition, the novel interleaving decompression architecture allows multiple cores in an SoC to be tested concurrently using a single automatic test equipment input-output channel. We demonstrate the effectiveness of the proposed approach by applying it to the International Symposium on Circuits and Systems' benchmark circuits and to two industrial production circuits. We also use analytical and experimental means to highlight the superiority of Golomb codes over run-length codes.

379 citations


Journal Article•DOI•
TL;DR: Watermarking-based IP protection as mentioned in this paper addresses IP protection by tracing unauthorized reuse and making untraceable unauthorized reuse as difficult as recreating given pieces of IP from scratch, where a watermark is a mechanism for identification that is nearly invisible to human and machine inspection; difficult to remove; and permanently embedded as an integral part of the design.
Abstract: Digital system designs are the product of valuable effort and know-how. Their embodiments, from software and hardware description language program down to device-level netlist and mask data, represent carefully guarded intellectual property (IP). Hence, design methodologies based on IP reuse require new mechanisms to protect the rights of IP producers and owners. This paper establishes principles of watermarking-based IP protection, where a watermark is a mechanism for identification that is: (1) nearly invisible to human and machine inspection; (2) difficult to remove; and (3) permanently embedded as an integral part of the design. Watermarking addresses IP protection by tracing unauthorized reuse and making untraceable unauthorized reuse as difficult as recreating given pieces of IP from scratch. We survey related work in cryptography and design methodology, then develop desiderata, metrics, and concrete protocols for constraint-based watermarking at various stages of the very large scale integration (VLSI) design process. In particular, we propose a new preprocessing approach that embeds watermarks as constraints into the input of a black-box design tool and a new postprocessing approach that embeds watermarks as constraints into the output of a black-box design tool. To demonstrate that our protocols can be transparently integrated into existing design flows, we use a testbed of commercial tools for VLSI physical design and embed watermarks into real-world industrial designs. We show that the implementation overhead is low-both in terms of central processing unit time and such standard physical design metrics as wirelength, layout area, number of vias, and routing congestion. We empirically show that the placement and routing applications considered in our methods achieve strong proofs of authorship and are resistant to tampering and do not adversely influence timing.

220 citations


Journal Article•DOI•
TL;DR: A new software-based self-testing methodology for processors, which uses a software tester embedded in the processor memory as a vehicle for applying structural tests and demonstrates its significant cost/fault coverage benefits and its ability to apply at-speed test while alleviating the need for high-speed testers.
Abstract: At-speed testing of gigahertz processors using external testers may not be technically and economically feasible. Hence, there is an emerging need for low-cost high-quality self-test methodologies that can be used by processors to test themselves at-speed. Currently, built-in self-test (BIST) is the primary self-test methodology available. While memory BIST is commonly used for testing embedded memory cores, complex logic designs such as microprocessors are rarely tested with logic BIST. In this paper, we first analyze the issues associated with current hardware-based logic-BIST methodologies by applying a commercial logic-BIST tool to two processor cores. We then propose a new software-based self-testing methodology for processors, which uses a software tester embedded in the processor memory as a vehicle for applying structural tests. The software tester consists of programs for test generation and test application. Prior to the test, structural tests are prepared for processor components in the form of self-test signatures. During the process of self-test, the test generation program expands the self-test signatures into test sets and the test application program applies the tests to the components under test at the speed of the processor. Application of the novel software-based self-test method demonstrates its significant cost/fault coverage benefits and its ability to apply at-speed test while alleviating the need for high-speed testers.

217 citations


Journal Article•DOI•
TL;DR: In this article, the authors present two new approaches that better model system behavior for general user request distributions, which are based on renewal theory and time-indexed semi-Markov decision process (TISMDP).
Abstract: Energy consumption of electronic devices has become a serious concern in recent years. Power management (PM) algorithms aim at reducing energy consumption at the system-level by selectively placing components into low-power states. Formerly, two classes of heuristic algorithms have been proposed for PM: timeout and predictive. Later, a category of algorithms based on stochastic control was proposed for PM. These algorithms guarantee optimal results as long as the system that is power managed can be modeled well with exponential distributions. We show that there is a large mismatch between measurements and simulation results if the exponential distribution is used to model all user request arrivals. We develop two new approaches that better model system behavior for general user request distributions. Our approaches are event-driven and give optimal results verified by measurements. The first approach we present is based on renewal theory. This model assumes that the decision to transition to low-power state can be made in only one state. Another method we developed is based on the time-indexed semi-Markov decision process (TISMDP) model. This model has wider applicability because it assumes that a decision to transition into a lower-power state can be made upon each event occurrence from any number of states. This model allows for transitions into low-power states from any state, but it is also more complex than our other approach. It is important to note that the results obtained by renewal model are guaranteed to match results obtained by TISMDP model, as both approaches give globally optimal solutions. We implemented our PM algorithms on two different classes of devices: two different hard disks and client-server wireless local area network systems such as the SmartBadge or a laptop. The measurement results show power savings ranging from a factor of 1.7 up to 5.0 with insignificant variation in performance.

189 citations


Journal Article•DOI•
TL;DR: Numerical experiments show that the solutions to the sequence of convex programs converge to the same design point for widely varying initial guesses, suggesting that the approach is capable of determining the globally optimal solution to the CMOS op-amp circuit sizing problem.
Abstract: The problem of CMOS op-amp circuit sizing is addressed here. Given a circuit and its performance specifications, the goal is to automatically determine the device sizes in order to meet the given performance specifications while minimizing a cost function, such as a weighted sum of the active area and power dissipation. The approach is based on the observation that the first order behavior of a MOS transistor in the saturation region is such that the cost and the constraint functions for this optimization problem can be modeled as posynomial in the design variables. The problem is then solved efficiently as a convex optimization problem. Second order effects are then handled by formulating the problem as one of solving a sequence of convex programs. Numerical experiments show that the solutions to the sequence of convex programs converge to the same design point for widely varying initial guesses. This strongly suggests that the approach is capable of determining the globally optimal solution to the problem. Accuracy of performance prediction in the sizing program (implemented in MATLAB) is maintained by using a newly proposed MOS transistor model and verified against detailed SPICE simulation.

187 citations


Journal Article•DOI•
TL;DR: This paper presents pipeline vectorization, a method for synthesizing hardware pipelines based on software vectorizing compilers that improves efficiency and ease of development of hardware designs, particularly for users with little electronics design experience.
Abstract: This paper presents pipeline vectorization, a method for synthesizing hardware pipelines based on software vectorizing compilers. The method improves efficiency and ease of development of hardware designs, particularly for users with little electronics design experience. We propose several loop transformations to customize pipelines to meet hardware resource constraints while maximizing available parallelism. For runtime reconfigurable systems, we apply hardware specialization to increase circuit utilization. Our approach is especially effective for highly repetitive computations in digital signal processor (DSP) and multimedia applications. Case studies using field programmable gate arrays (FPGAs)-based platforms are presented to demonstrate the benefits of our approach and to evaluate tradeoffs between alternative implementations. For instance, the loop-tiling transformation, has been found to improve vectorization performance 30-40 times above a PC-based software implementation, depending on whether runtime reconfiguration (RTR) is used.

185 citations


Journal Article•DOI•
TL;DR: A novel system-level performance analysis technique to support the design of custom communication architectures for system-on-chip integrated circuits and achieves accuracy comparable to complete system simulation while being over two orders of magnitude faster.
Abstract: This paper presents a novel system-level performance analysis technique to support the design of custom communication architectures for system-on-chip integrated circuits. Our technique fills a gap in existing techniques for system-level performance analysis, which are either too slow to use in an iterative communication architecture design framework (e.g., simulation of the complete system) or are not accurate enough to drive the design of the communication architecture (e.g., techniques that perform a "static" analysis of the system performance). Our technique is based on a hybrid trace-based performance-analysis methodology in which an initial cosimulation of the system is performed with the communication described in an abstract manner (e.g., as events or abstract data transfers). An abstract set of traces are extracted from the initial cosimulation containing necessary and sufficient information about the computations and communications of the system components. The system designer then specifies a communication architecture by: 1) selecting a topology consisting of dedicated as well as shared communication channels (shared buses) interconnected by bridges; 2) mapping the abstract communications to paths in the communication architecture; and 3) customizing the protocol used for each channel. The traces extracted in the initial step are represented as a communication analysis graph (CAG) and an analysis of the CAG provides an estimate of the system performance as well as various statistics about the components and their communication. Experimental results indicate that our performance-analysis technique achieves accuracy comparable to complete system simulation (an average error of 1.88%) while being over two orders of magnitude faster.

184 citations


Journal Article•DOI•
TL;DR: A combined WL optimization and high-level synthesis algorithm not only to minimize the hardware implementation cost, but also to reduce the optimization time significantly is developed.
Abstract: Conventional approaches for fixed-point implementation of digital signal processing algorithms require the scaling and word-length (WL) optimization at the algorithm level and the high-level synthesis for functional unit sharing at the architecture level. However, the algorithm-level WL optimization has a few limitations because it can neither utilize the functional unit sharing information for signal grouping nor estimate the hardware cost for each operation accurately. In this study, we develop a combined WL optimization and high-level synthesis algorithm not only to minimize the hardware implementation cost, but also to reduce the optimization time significantly. This software initially finds the WL sensitivity or minimum WL of each signal throughout fixed-point simulations of a signal flow graph, performs the WL conscious high-level synthesis where signals having the similar WL sensitivity are assigned to the same functional unit, and then conducts the final WL optimization by iteratively modifying the WLs of the synthesized hardware model. A list-scheduling-based and an integer linear-programming-based algorithms are developed for the WL conscious high-level synthesis. The hardware cost function to minimize is generated by using a synthesized hardware model. Since fixed-point simulation is used to measure the performance, this method can be applied to general, including nonlinear and time-varying, digital signal processing systems. A fourth-order infinite-impulse response filter, a fifth-order elliptic filter, and a 12th-order adaptive least mean square filter are implemented using this software.

181 citations


Journal Article•DOI•
TL;DR: A retargetable framework for ASIP design which is based on machine descriptions in the LISA language is presented which can be generated automatically including high-level language C compiler, assembler, linker, simulator, and debugger frontend.
Abstract: The development of application-specific instruction-set processors (ASIP) is currently the exclusive domain of the semiconductor houses and core vendors. This is due to the fact that building such an architecture is a difficult task that requires expertise in different domains: application software development tools, processor hardware implementation, and system integration and verification. This paper presents a retargetable framework for ASIP design which is based on machine descriptions in the LISA language. From that, software development tools can be generated automatically including high-level language C compiler, assembler, linker, simulator, and debugger frontend. Moreover, for architecture implementation, synthesizable hardware description language code can be derived, which can then be processed by standard synthesis tools. Implementation results for a low-power ASIP for digital video broadcasting terrestrial acquisition and tracking algorithms designed with the presented methodology are given. To show the quality of the generated software development tools, they are compared in speed and functionality with commercially available tools of state-of-the-art digital signal processor and /spl mu/C architectures.

Journal Article•DOI•
TL;DR: A synthesis environment for analog integrated circuits is presented that is able to drastically increase design and layout productivity for analog blocks and shows the productiveness and efficiency of the environment for the synthesis and process tuning of frequently used analog cells.
Abstract: A synthesis environment for analog integrated circuits is presented that is able to drastically increase design and layout productivity for analog blocks. The system covers the complete design flow from specification over topology selection and optimal circuit sizing down to automatic layout generation and performance characterization. It follows a hierarchical refinement strategy for more complex cells and is process independent. The sizing is based on an improved equation-based optimization approach, where the circuit behavior is characterized by declarative models that are then converted in a sequential design plan. Supporting tools have been developed to reduce the total effort to set up a new circuit topology in the system's database. The performance-driven layout generation tool guarantees layouts that satisfy all performance constraints. Redesign support is included in the design flow management to perform backtracking in case of design problems. The experimental results illustrate the productiveness and efficiency of the environment for the synthesis and process tuning of frequently used analog cells.

Journal Article•DOI•
TL;DR: Evidence that no known algorithms for circuit manipulation can be used to efficiently remove or change the watermark is presented and that the process is immune to a variety of other attacks is presented.
Abstract: We present a methodology for the watermarking of synchronous sequential circuits that makes it possible to identify the authorship of designs by imposing a digital watermark on the state transition graph (STG) of the circuit. The methodology is applicable to sequential designs that are made available as firm intellectual property, the designation commonly used to characterize designs specified as structural hardware description languages or circuit netlists. The watermarking is obtained by manipulating the STG of the design in such a way as to make it exhibit a chosen property that is extremely rare in nonwatermarked circuits while, at the same time, not changing the functionality of the circuit. This manipulation is performed without ever actually computing this graph in either implicit or explicit form. Instead, the digital watermark is obtained by direct manipulation of the circuit description. We present evidence that no known algorithms for circuit manipulation can be used to efficiently remove or change the watermark and that the process is immune to a variety of other attacks. We present both theoretical and experimental results that show that the watermarking can be created and verified efficiently. We also test possible attack strategies and verify that they are inapplicable to realistic designs of medium to large complexity.

Journal Article•DOI•
TL;DR: This work proposes two new RC delay metrics called delay via two moments (D2M) and effective capacitance metric (ECM), which are virtually as simple and fast as the Elmore metric, but more accurate.
Abstract: For performance optimization tasks such as floorplanning, placement, buffer insertion, wire sizing, and global routing, the Elmore resistance-capacitance (RC) delay metric remains popular due to its simple closed form expression, fast computation speed, and fidelity with respect to simulation. More accurate delay computation methods are typically central processing unit intensive and/or difficult to implement. To bridge this gap between accuracy and efficiency/simplicity, we propose two new RC delay metrics called delay via two moments (D2M) and effective capacitance metric (ECM), which are virtually as simple and fast as the Elmore metric, but more accurate. D2M uses two moments of the impulse response in a simple formula that has high accuracy at the far end of RC lines. ECM captures resistive shielding effects by modeling the downstream capacitance by an "effective capacitance." In contrast, the Elmore metric models this as a lumped capacitance, thereby ignoring resistive shielding. Although not as accurate as D2M, ECM yields consistent performance and may be well-suited to optimization due to its Elmore-like recursive construction.

Journal Article•DOI•
TL;DR: A two-step procedure of global density assignment followed by local insertion is proposed to solve the dummy feature placement problem in the fixed-dissection regime with both single-layer and multiple-layer considerations.
Abstract: Chemical-mechanical polishing (CMP) is an enabling technique used in deep-submicrometer VLSI manufacturing to achieve long range oxide planarization. Post-CMP oxide topography is highly related to local pattern density in the layout. To change local pattern density and, thus, ensure post-CMP planarization, dummy features are placed in the layout. Based on models that accurately describe the relation between local pattern density and post-CMP planarization by Stine et al. (1997), Ouma et al. (1998), and Yu et al. (1999), a two-step procedure of global density assignment followed by local insertion is proposed to solve the dummy feature placement problem in the fixed-dissection regime with both single-layer and multiple-layer considerations. Two experiments conducted with real design layouts gave excellent results by reducing simulated post-CMP topography variation from 767 /spl Aring/ to 152 /spl Aring/ in the single-layer formulation and by avoiding cumulative effect in the multiple-layer formulation. The simulation result from single-layer formulation compares very favorably both to the rule-based approach widely used in industry and to the algorithm by Kahng et al (1999). The multiple-layer formulation has no previously published work.

Journal Article•DOI•
Christoph Albrecht1•
TL;DR: It is shown that not only the maximum relative congestion is minimized, but the congestion of the edges is distributed equally such that the solution is optimal in a well-defined sense.
Abstract: We show how the new approximation algorithms by Garg and Konemann with extensions due to Fleischer for the multicommodity flow problem can be modified to solve the linear programming relaxation of the global routing problem. Implementation issues to improve the performance, such as a discussion of different functions for the dual variables and how to use the Newton method as an additional optimization step, are given. It is shown that not only the maximum relative congestion is minimized, but the congestion of the edges is distributed equally such that the solution is optimal in a well-defined sense: the vector of the relative congestion of the edges sorted in nonincreasing order is minimal by lexicographic order. This is an important step toward improving signal integrity by extra spacing between wires. Finally, we show how the weighted netlength can be minimized. Our computational results with recent IBM processor chips show that this approach can be used in practice even for large chips and that it is superior on difficult instances where ripup and reroute algorithms fail.

Journal Article•DOI•
TL;DR: Experimental results show that the power management method based on a Markov decision process outperforms heuristic methods by as much as 44% in terms of power dissipation savings for a given level of system performance.
Abstract: The goal of a dynamic power management policy is to reduce the power consumption of an electronic system by putting system components into different states, each representing a certain performance and power consumption level. The policy determines the type and timing of these transitions based on the system history, workload, and performance constraints. In this paper we propose a new abstract model of a power-managed electronic system. We formulate the problem of system-level power management as a controlled optimization problem based on the theories of continuous-time Markov derision processes and stochastic networks. This problem is solved exactly using linear programming or heuristically using "policy iteration." Our method is compared with existing heuristic methods for different workload statistics. Experimental results show that the power management method based on a Markov decision process outperforms heuristic methods by as much as 44% in terms of power dissipation savings for a given level of system performance.

Journal Article•DOI•
Jaewon Oh1, Massoud Pedram•
TL;DR: This paper constructs a clock-tree topology based on the locations and the activation frequencies of the modules, while the locations of the internal nodes of the clock tree are determined using a dynamic programming approach followed by a gate reduction heuristic.
Abstract: This paper presents a zero-skew gated clock routing technique for VLSI circuits. Gated clock trees include masking gates at the internal nodes of the clock tree, which are selectively turned on and off by the gate control signals during the active and idle times of the circuit modules to reduce the switched capacitance of the clock tree. We construct a clock-tree topology based on the locations and the activation frequencies of the modules, while the locations of the internal nodes of the clock tree (and, hence, the masking gates) are determined using a dynamic programming approach followed by a gate reduction heuristic. This work assumes that the gates are turned on/off by a centralized controller. Therefore, the additional power and routing area incurred by the controller and the gate control signal routing are examined. Various tradeoffs between power and area for different design options and module activities are discussed and detailed experimental results are presented. Finally, good design practices for implementing the gated clocks are suggested.

Journal Article•DOI•
TL;DR: This work presents the first technique that leverages the unique characteristics of field-programmable gate arrays (FPGAs) to protect commercial investment in intellectual property through fingerprinting.
Abstract: As current computer-aided design (CAD) tool and very large scale integration technology capabilities create a new market of reusable digital designs, the economic viability of this new core-based design paradigm is pending on the development of techniques for intellectual property protection. This work presents the first technique that leverages the unique characteristics of field-programmable gate arrays (FPGAs) to protect commercial investment in intellectual property through fingerprinting. A hidden encrypted mark is embedded into the physical layout of a digital circuit when it is placed and routed onto the FPGA. This mark uniquely identifies both the circuit origin and original circuit recipient, yet is difficult to detect and/or remove, even via recipient collusion. While this approach imposes additional constraints on the backend CAD tools for circuit place and route, experiments indicate that the performance and area impacts are minimal.

Journal Article•DOI•
TL;DR: In this paper, an activity-driven clock gate insertion problem was proposed to minimize the system's power consumption by constructing an activity driven clock tree, where sections of the clock tree are turned off by gating the clock signals.
Abstract: In this paper, we investigate reducing the power consumption of a synchronous digital system by minimizing the total power consumed by the clock signals We construct activity-driven clock trees wherein sections of the clock tree are turned off by gating the clock signals Since gating the clock signal implies that additional control signals and gates are needed, there exists a tradeoff between the amount of clock tree gating and the total power consumption of the clock tree We exploit similarities in the switching activity of the clocked modules to reduce the number of clock gates Assuming a given switching activity of the modules, we propose three novel activity-driven problems: a clock tree construction problem, a clock gate insertion problem, and a zero-skew clock gate insertion problem The objective of these problems is to minimize the system's power consumption by constructing an activity-driven clock tree We propose an approximation algorithm based on recursive matching to solve the clock tree construction problem We also propose an exact algorithm employing the dynamic programming paradigm to solve the gate insertion problems Finally, we present experimental results that verify the effectiveness of our approach This paper is a step in understanding how high-level decisions (eg, behavioral design) can affect a low-level design (eg, clock design)

Journal Article•DOI•
TL;DR: It is shown that a full-wave partial element equivalent circuit method, which includes the delays among the partial elements, leads to an efficient solver enabling the analysis of large meaningful problems for realistic very large scale integration wiring problems.
Abstract: With the advances in the speed of high-performance chips, inductance effects in some on-chip interconnects have become significant. Specific networks such as clock distributions and other highly optimized circuits are especially impacted by inductance. Several difficult aspects have to be overcome to obtain valid waveforms for problems where inductances contribute significantly. Mainly, the geometries are very complex and the interactions between the capacitive and inductive currents have to be taken into account simultaneously. In this paper, we show that a full-wave partial element equivalent circuit method, which includes the delays among the partial elements, leads to an efficient solver enabling the analysis of large meaningful problems. Applying this method to several examples leads to helpful insights for realistic very large scale integration wiring problems. It is shown in this paper that the impact overshoot, reflections, and inductive coupling are critical for the design of critical on-chip interconnects.

Journal Article•DOI•
TL;DR: This work uses integer linear programming to minimize the processing time by automatically extracting parallelism from a biochemical assay and applies the optimization method to the polymerase chain reaction, an important step in many lab-on-a-chip biochemical applications.
Abstract: We present an architectural design and optimization methodology for performing biochemical reactions using two-dimensional (2-D) electrowetting arrays. We define a set of basic microfluidic operations and leverage electronic design automation principles for system partitioning, resource allocation, and operation scheduling. Fluidic operations are carried out through the electrostatic configuration of a set of grid points. While concurrency is desirable to minimize processing time, the size of the 2-D array limits the number of concurrent operations of any type. Furthermore, functional dependencies between the operations also limit concurrency. We use integer linear programming to minimize the processing time by automatically extracting parallelism from a biochemical assay. As a case study, we apply our optimization method to the polymerase chain reaction, which is an important step in many lab-on-a-chip biochemical applications.

Journal Article•DOI•
TL;DR: This paper proposes a new pattern generation technique for delay testing and dynamic timing analysis that can take into account the impact of the power supply noise on the signal propagation delays and shows that the new patterns produce significantly longer delays on the selected paths.
Abstract: Noise effects such as power supply and crosstalk noise can significantly impact the performance of deep submicrometer designs. Existing delay testing and timing analysis techniques cannot capture the effects of noise on the signal/cell delays. Therefore, these techniques cannot capture the worst case timing scenarios and the predicted circuit performance might not reflect the worst case circuit delay. More accurate and efficient timing analysis and delay testing strategies need to be developed to predict and guarantee the performance of deep submicrometer designs. In this paper, we propose a new pattern generation technique for delay testing and dynamic timing analysis that can take into account the impact of the power supply noise on the signal propagation delays. In addition to sensitizing the selected paths, the new patterns also cause high power supply noise on the nodes in these paths. Thus, they also cause longer propagation delays for the nodes along the paths. Our experimental results on benchmark circuits show that the new patterns produce significantly longer delays on the selected paths compared to the patterns derived using existing pattern generation methods.

Journal Article•DOI•
TL;DR: Experimental results show that exploiting integer bitwidth substantially reduces the gate count of PICO-synthesized hardware accelerators across a range of applications.
Abstract: Program-in chip-out (PICO) is a system for automatically synthesizing embedded hardware accelerators from loop nests specified in the C programming language. A key issue confronted when designing such accelerators is the optimization of hardware by exploiting information that is known about the varying number of bits required to represent and process operands. In this paper, we describe the handling and exploitation of integer bitwidth in PICO. A bitwidth analysis procedure is used to determine bitwidth requirements for all integer variables and operations in a C application. Given known bitwidths for all variables, complex problems arise when determining a program schedule that specifies on which function unit (FU) and at what time each operation executes. If operations are assigned to FUs with no knowledge of bitwidth, bitwidth-related cost benefit is lost when each unit is built to accommodate the widest operation assigned. By carefully placing operations of similar width on the same unit, hardware costs are decreased. This problem is addressed using a preliminary clustering of operations that is based jointly on width and implementation cost. These clusters are then honored during resource allocation and operation scheduling to create an efficient width-conscious design. Experimental results show that exploiting integer bitwidth substantially reduces the gate count of PICO-synthesized hardware accelerators across a range of applications.

Journal Article•DOI•
TL;DR: A deterministic floorplanning algorithm utilizing the structure of O tree is developed with promising performance with average 16% improvement in wire length and 1% less dead space over previous central processing unit (CPU) intensive cluster refinement method.
Abstract: We present an ordered tree (O tree) structure to represent nonslicing floorplans. The O tree uses only n(2+[lg n]) bits for a floorplan of n rectangular blocks. We define an admissible placement as a compacted placement in both x and y directions. For each admissible placement, we can find an O-tree representation. We show that the number of possible O-tree combinations is O(n!2/sup 2n-2//n/sup l.5/). This is very concise compared to a sequence pair representation that has O((n!)/sup 2/) combinations. The approximate ratio of sequence pair and O-tree combinations is O(n/sup 2/(n/4e)/sup n/). The complexity of O tree is even smaller than a binary tree structure for slicing floorplan that has O(n!2/sup 5n-3//n/sup 1.5/) combinations. Given an O tree, it takes only linear time to construct the placement and its constraint graph. We have developed a deterministic floorplanning algorithm utilizing the structure of O tree. Empirical results on MCNC (www.mcnc.org) benchmarks show promising performance with average 16% improvement in wire length and 1% less dead space over previous central processing unit (CPU) intensive cluster refinement method.

Journal Article•DOI•
TL;DR: A low-overhead scheme for achieving complete (100%) fault coverage during built-in self test of circuits with scan is presented and experimental results indicate that complete fault coverage can be obtained with low hardware overhead.
Abstract: A low-overhead scheme for achieving complete (100%) fault coverage during built-in self test of circuits with scan is presented. It does not require modifying the function logic and does not degrade system performance (beyond using scan). Deterministic test cubes that detect the random-pattern-resistant (r.p.r.) faults are embedded in a pseudorandom sequence of bits generated by a linear feedback shift register (LFSR). This is accomplished by altering the pseudorandom sequence by adding logic at the LFSR's serial output to "fix" certain bits. A procedure for synthesizing the bit-fixing logic for embedding the test cubes is described. Experimental results indicate that complete fault coverage can be obtained with low hardware overhead. Further reduction in overhead is possible by using a special correlating automatic test pattern generation procedure that is described for finding test cubes for the r.p.r. faults in a way that maximizes bitwise correlation.

Journal Article•DOI•
TL;DR: An algorithm for generating test patterns automatically from functional register-transfer level (RTL) circuits that target detection of stuck-at faults in the circuit at the logic level, using a data structure named assignment decision diagram that has been proposed previously in the field of high-level synthesis.
Abstract: In this paper, we present an algorithm for generating test patterns automatically from functional register-transfer level (RTL) circuits that target detection of stuck-at faults in the circuit at the logic level. In order to do this, we utilize a data structure named assignment decision diagram that has been proposed previously in the field of high-level synthesis. With the advent of RTL synthesis tools, functional RTL designs are now widely used in the industry to cut design turn around time. This paper addresses the problem of test pattern generation directly at this level due to a number of advantages inherent at the RTL. Since the number of primitive elements at the RTL is usually less than the logic level, the problem size is reduced leading to a reduction in the test-generation time over logic-level automatic test pattern generation (ATPG). Also, a reduction in the number of backtracks can lead to improved fault coverage and reduced test application time over logic-level techniques. The test patterns thus generated can also be used to perform RTL-RTL and RTL-logic validation. The algorithm is very versatile and can tackle almost any type of single-clock design, although performance varies according to the design style. It gracefully degrades to an inefficient logic-level ATPG algorithm if it is applied to a logic-level circuit. Experimental results demonstrate that over 1000 times reduction in test-generation time can be achieved by this algorithm on certain types of RTL circuits without any compromise in fault coverage.

Journal Article•DOI•
TL;DR: This paper presents a very simple and efficient O(n/sup 2/) algorithm to solve the sequence pair evaluation problem and shows that using a more sophisticated data structure, the algorithm can be implemented to run in O (n log log n) time.
Abstract: Murata et al (1996) introduced an elegant representation of block placement called sequence pair All block-placement algorithms that are based on sequence pairs use simulated annealing where the generation and evaluation of a large number of sequence pairs is required Therefore, a fast algorithm is needed to evaluate each generated sequence pair, ie, to translate the sequence pair to its corresponding block placement This paper presents a new approach to evaluate a sequence pair based on computing longest common subsequence in a pair of weighted sequences We present a very simple and efficient O(n/sup 2/) algorithm to solve the sequence pair evaluation problem We also show that using a more sophisticated data structure, the algorithm can be implemented to run in O (n log log n) time Both implementations of our algorithm are significantly faster than the previous O(n/sup 2/) graph-based algorithm For example, we achieve 60 /spl times/ speedup over the previous algorithm when input size n = 128 As a result, we can examine a million sequence pairs within one minute for typical input size of placement problems For all MCNC benchmark block-placement problems, we have obtained the best results ever reported in the literature (including those reported by algorithms based on O tree and B* tree) with significantly less runtime For example, the best known result for ami49 (368 mm/sup 2/) was obtained by a B*-tree-based algorithm using 4752 s and we obtained a better result (365 mm/sup 2/) in 31 s

Journal Article•DOI•
TL;DR: This paper presents a multilayer gridless detailed routing system for deep submicrometer physical designs that features an efficient point-to-point gridless routing algorithm using an implicit representation of a nonuniform grid graph and a coarse grid-based wire-planning algorithm that uses exact gridless design rules to accurately estimate the routing resources and distribute nets into routing regions.
Abstract: Advances of very large scale integration technologies present two challenges for routing problems: (1) the higher integration of transistors due to shrinking of featuring size and (2) the requirement for off-grid routing due to the variable-width variable-spacing design rules imposed by optimization techniques. In this paper, we present a multilayer gridless detailed routing system for deep submicrometer physical designs. Our detailed routing system uses a hybrid approach consisting of two parts: (1) an efficient variable-width variable-spacing detailed routing engine and (2) a wire-planning algorithm providing high-level guidance as well as ripup and reroute capabilities. Our gridless routing engine is based on an efficient point-to-point gridless routing algorithm using an implicit representation of a nonuniform grid graph. We proved that such a graph guarantees a gridless connection of the minimum cost in multilayer variable-width and variable-spacing routing problem. A novel data structure using a two-level slit tree plus interval tree in combination of cache structure is developed to support efficient queries into the connection graph. Our experiments show that this data structure is very efficient in memory usage while very fast in answering maze expansion related queries. Our detailed routing system also features a coarse grid-based wire-planning algorithm that uses exact gridless design rules (variable-width and variable-spacing) to accurately estimate the routing resources and distribute nets into routing regions. The wire-planning method also enables efficient ripup and reroute in gridless routing. Unlike previous approaches for gridless routing that explore alternatives of blocked nets by gradually tightening the design rules, our planning-based approach can take the exact gridless rules and resolve the congestion and blockage at a higher level. Our experimental results show that using the wire-planning algorithm in our detailed routing system can improve the routability and also speed up the runtime by 3 to 17 times.

Journal Article•DOI•
TL;DR: These models are extremely efficient, yet provide high degree of accuracy, and have been tested on a wide range of parameters and shown to have over 90% accuracy on average compared to running best-available interconnect layout optimization algorithms directly.
Abstract: This paper presents a set of interconnect performance estimation models for design planning with consideration of various effective interconnect layout optimization techniques, including optimal wire sizing, simultaneous driver and wire sizing, and simultaneous buffer insertion/sizing and wire sizing. These models are extremely efficient, yet provide high degree of accuracy. They have been tested on a wide range of parameters and shown to have over 90% accuracy on average compared to running best-available interconnect layout optimization algorithms directly. As a result, these fast yet accurate models can be used efficiently during high-level design space exploration, interconnect-driven design planning/synthesis, and timing-driven placement to ensure design convergence for deep submicrometer designs.