scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems in 2004"


Journal ArticleDOI
TL;DR: This paper presents a novel test-data volume-compression methodology called the embedded deterministic test (EDT), which reduces manufacturing test cost by providing one to two orders of magnitude reduction in scan test data volume and scan test time.
Abstract: This paper presents a novel test-data volume-compression methodology called the embedded deterministic test (EDT), which reduces manufacturing test cost by providing one to two orders of magnitude reduction in scan test data volume and scan test time. The presented scheme is widely applicable and easy to deploy because it is based on the standard scan/ATPG methodology and adopts a very simple flow. It is nonintrusive as it does not require any modifications to the core logic such as the insertion of test points or logic bounding unknown states. The EDT scheme consists of logic embedded on a chip and a new deterministic test-pattern generation technique. The main contributions of the paper are test-stimuli compression schemes that allow us to deliver test data to the on-chip continuous-flow decompressor. In particular, it can be done by repeating certain patterns at the rates, which are adjusted to the requirements of the test cubes. Experimental results show that for industrial circuits with test cubes with very low fill rates, ranging from 3% to 0.2%, these schemes result in compression ratios of 30 to 500 times. A comprehensive analysis of the encoding efficiency of the proposed compression schemes is also provided.

529 citations


Journal ArticleDOI
TL;DR: An approach for generating accurate geometrically parameterized integrated circuit interconnect models that are efficient enough for use in interconnect synthesis is described, based on a multiparameter moment matching model-reduction algorithm.
Abstract: In this paper, we describe an approach for generating accurate geometrically parameterized integrated circuit interconnect models that are efficient enough for use in interconnect synthesis. The model-generation approach presented is automatic, and is based on a multiparameter moment matching model-reduction algorithm. A moment-matching theorem proof for the algorithm is derived, as well as a complexity analysis for the model-order growth. The effectiveness of the technique is tested using a capacitance extraction example, where the plate spacing is considered as the geometric parameter, and a multiline bus example, where both wire spacing and wire width are considered as geometric parameters. Experimental results demonstrate that the generated models accurately predict capacitance values for the capacitor example, and both delay and cross-talk effects over a reasonably wide range of spacing and width variation for the multiline bus example.

362 citations


Journal ArticleDOI
Subhasish Mitra1, Kee Sup Kim1
TL;DR: X-Compact is an X-tolerant test response compaction technique that enables up to exponential reduction in the test response data volume and the number of pins required to collect test response from a chip.
Abstract: X-Compact is an X-tolerant test response compaction technique. It enables up to exponential reduction in the test response data volume and the number of pins required to collect test response from a chip. The compaction hardware requires negligible area, does not add any extra delay during normal operation, guarantees detection of defective chips even in the presence of unknown logic values (often referred to as X's), and preserves diagnosis capabilities for most practical scenarios. The technique has minimum impact on current design and test flows, and can be used to reduce test time, test-data volume, test-input/output pins and tester channels, and also to improve test quality.

240 citations


Journal ArticleDOI
TL;DR: It is shown that any Boolean function can be realized as a reversible network in terms of this new approach by giving the theoretical method of finding such a network.
Abstract: The problem of minimizing the number of garbage outputs is an important issue in reversible logic design. We start with the analysis of the number of garbage outputs that must be added to a multiple output function to make it reversible. We give a precise formula for the theoretical minimum of the required number of garbage outputs. For some benchmark functions, we calculate the garbage required by some proposed reversible design methods and compare it to the theoretical minimum. Based on the information about minimal garbage, we suggest a new reversible design method that uses the minimum number of garbage outputs. We show that any Boolean function can be realized as a reversible network in terms of this new approach by giving the theoretical method of finding such a network. Using a heuristics synthesis approach, we create a program and run it to compare results of our synthesis to the previously reported synthesis results for the benchmark functions with up to ten variables. Finally, we show that the synthesis for the proposed model can be accomplished with lower cost than the synthesis of EXOR programmable logic arrays.

204 citations


Journal ArticleDOI
TL;DR: The implications of exponentially increasing repeater and clocked repeater counts on the algorithms and methodologies used for physical synthesis and full-chip assembly are studied, showing that mere capacity scaling of current algorithms and methods is insufficient to handle the new challenges.
Abstract: We study scaling in the context of typical block-level wiring distributions, and identify its impact on the design process. In particular, we study the implications of exponentially increasing repeater and clocked repeater counts on the algorithms and methodologies used for physical synthesis and full-chip assembly, showing that mere capacity scaling of current algorithms and methodologies is insufficient to handle the new challenges. Finally, we suggest a few approaches to tackle these challenges by constructing a case for abstract fabrics.

199 citations


Journal ArticleDOI
TL;DR: A scan architecture with mutually exclusive scan segment activation which overcomes the shortcomings of previous approaches and achieves both shift and capture-power reduction with no impact on the performance of the design, and with minimal impact on area and testing time.
Abstract: Power dissipation during scan testing is becoming an important concern as design sizes and gate densities increase. While several approaches have been recently proposed for reducing power dissipation during the shift cycle (minimum-transition don't care fill, special scan cells, and scan chain partitioning), limited work has been carried out toward reducing the peak power during test response capture and the few existing approaches for reducing capture power rely on complex automatic test pattern generation (ATPG) algorithms. This paper proposes a scan architecture with mutually exclusive scan segment activation which overcomes the shortcomings of previous approaches. The proposed architecture achieves both shift and capture-power reduction with no impact on the performance of the design, and with minimal impact on area and testing time (typically 2%-3%). An algorithmic procedure for assigning flip-flops to scan segments enables reuse of test patterns generated by standard ATPG tools. An implementation of the proposed method had been integrated into an automated design flow using commercial synthesis and simulation tools which was used on a wide range of benchmark designs. Reductions up to 57% in average power, and up to 44% and 34% in peak-power dissipation during shift and capture cycles, respectively, were obtained when using two scan segments. Increasing the number of scan segments to six leads to reductions of 96% and 80% in average power and, respectively, maximum number of simultaneous transitions.

196 citations


Journal ArticleDOI
TL;DR: A novel method of circuit-compatible modeling of single-walled semiconducting CNFETs in their ultimate performance limit is presented, for the first time, both the I-V and the C-V characteristics of the device have been efficiently modeled for circuit simulations.
Abstract: Carbon nanotube field-effect transistors (CNFETs) are being extensively studied as possible successors to CMOS. Novel device structures have been fabricated and device simulators have been developed to estimate their performance in a sub-10-nm transistor era. This paper presents a novel method of circuit-compatible modeling of single-walled semiconducting CNFETs in their ultimate performance limit. For the first time, both the I-V and the C-V characteristics of the device have been efficiently modeled for circuit simulations. The model so developed has been used to simulate arithmetic and logic blocks using HSPICE.

193 citations


Journal ArticleDOI
TL;DR: In this article, the authors introduce process constructors, which cleanly separate the computation part of a process from the synchronization and communication part, and use the characteristic function for each process type to define semantic preserving and design decision transformations.
Abstract: The scope of the formal system design (ForSyDe) methodology is high-level modeling and refinement of systems-on-a-chip and embedded systems. Starting with a formal specification model, that captures the functionality of the system at a high abstraction level, it provides formal design-transformation methods for a transparent refinement process of the system model into an implementation model that is optimized for synthesis. The main contribution of this paper is the ForSyDe modeling technique and the formal treatment of transformational design refinement. We introduce process constructors, that cleanly separate the computation part of a process from the synchronization and communication part. We develop the characteristic function for each process type and use it to define semantic preserving and design decision transformations. In a study of a digital equalizer example, we illustrate the modeling and refinement process and focus in particular on refinement of the clock domain, communication refinement, and resource sharing.

174 citations


Journal ArticleDOI
TL;DR: The approach, which incorporates passivity constraints via convex optimization algorithms, is guaranteed to produce a passive-system model that is optimal in the sense of having minimum error in the frequency band of interest over all models with a prescribed set of system poles.
Abstract: In this paper, we present a methodology for generating guaranteed passive time-domain models of subsystems described by tabulated frequency-domain data obtained through measurement or through physical simulation. Such descriptions are commonly used to represent on- and off-chip interconnect effects, package parasitics, and passive devices common in high-frequency integrated circuit applications. The approach, which incorporates passivity constraints via convex optimization algorithms, is guaranteed to produce a passive-system model that is optimal in the sense of having minimum error in the frequency band of interest over all models with a prescribed set of system poles. We demonstrate that this algorithm is computationally practical for generating accurate high-order models of data sets representing realistic, complicated multiinput, multioutput systems.

161 citations


Journal ArticleDOI
TL;DR: This paper addresses the problem of mapping a system's communication requirements to a given communication architecture template, and describes an exploration methodology that uses efficient algorithms to help automate the process of mapping the system communications to the selected template.
Abstract: Rapid growth in the complexity of system-on-chips is being accompanied by increasing volume and diversity of on-chip communication traffic, which in turn, is driving the development of advanced system-level communication architectures. While these architectures have the potential to improve system performance, they pose significant new challenges to the system designer, owing to the complex design space defined by the availability of numerous network topologies, communication protocols, and mapping alternatives for system communications. In this paper, we address the problem of mapping a system's communication requirements to a given communication architecture template. We illustrate the nature of the communication architecture design space, and describe an exploration methodology that uses efficient algorithms to help automate the process of mapping the system communications to the selected template. In addition, we demonstrate the importance of simultaneously optimizing the on-chip communication protocols in order to maximize system performance. Experiments conducted on example systems, including a cell forwarding unit of an ATM switch, indicate that the proposed techniques aid in automatically constructing communication architectures that have high performance. For the systems we considered, the solutions generated using our methodology had 53% superior performance (on average), over those based on conventional architectures and mapping approaches. The algorithms used in the proposed methodology are computationally efficient, and scale well with increasing communication architecture complexity.

134 citations


Journal ArticleDOI
TL;DR: It is shown that any test set that detects all single stuck-at faults in a reversible circuit also detects all multiple stuck- at faults, and a practical test-set generation algorithm is given, based on an integer linear programming formulation, that yields test sets approximately half the size of those produced by conventional automatic test pattern generation.
Abstract: Applications of reversible circuits can be found in the fields of low-power computation, cryptography, communications, digital signal processing, and the emerging field of quantum computation. Furthermore, prototype circuits for low-power applications are already being fabricated in CMOS. Regardless of the eventual technology adopted, testing is sure to be an important component in any robust implementation. We consider the test-set generation problem. Reversibility affects the testing problem in fundamental ways, making it significantly simpler than for the irreversible case. For example, we show that any test set that detects all single stuck-at faults in a reversible circuit also detects all multiple stuck-at faults. We present efficient test-set constructions for the standard stuck-at fault model, as well as the usually intractable cell-fault model. We also give a practical test-set generation algorithm, based on an integer linear programming formulation, that yields test sets approximately half the size of those produced by conventional automatic test pattern generation.

Journal ArticleDOI
TL;DR: It is shown that after applying the transition function preserving transformations in a certain order, the resultant circuits feature a significantly reduced the number of levels of XOR logic, minimized internal fanouts, and simplified circuit layout and routing, as compared to previous schemes.
Abstract: This paper presents a novel methodology of designing generators and compactors of test data. The essence of the proposed approach is to use a set of transformations, which alters the structure of the conventional linear feedback shift registers (LFSRs) while preserving the transition function of the original circuits. It is shown that after applying the transition function preserving transformations in a certain order, the resultant circuits feature a significantly reduced the number of levels of XOR logic, minimized internal fanouts, and simplified circuit layout and routing, as compared to previous schemes based on external feedback LFSRs, internal feedback LFSRs, and cellular automata, all implementing the same characteristic polynomial. Consequently, the proposed devices can operate at higher speeds than those of conventional solutions and become highly modular structures.

Journal ArticleDOI
TL;DR: It is demonstrated that the number of custom instruction candidates grows rapidly with program size, leading to a large design space, and that the quality (speedup) of custom instructions varies significantly across this space, motivating the need for the proposed flow.
Abstract: Efficiency and flexibility are critical, but often conflicting, design goals in embedded system design. The recent emergence of extensible processors promises a favorable tradeoff between efficiency and flexibility, while keeping design turnaround times short. Current extensible processor design flows automate several tedious tasks, but typically require designers to manually select the parts of the program that are to be implemented as custom instructions. In this work, we describe an automatic methodology to select custom instructions to augment an extensible processor, in order to maximize its efficiency for a given application program. We demonstrate that the number of custom instruction candidates grows rapidly with program size, leading to a large design space, and that the quality (speedup) of custom instructions varies significantly across this space, motivating the need for the proposed flow. Our methodology features cost functions to guide the custom instruction selection process, as well as static and dynamic pruning techniques to eliminate inferior parts of the design space from consideration. Furthermore, we employ a two-stage process, wherein a limited number of promising instruction candidates are first short-listed using efficient selection criteria, and then evaluated in more detail through cycle-accurate instruction set simulation and synthesis of the corresponding hardware, to identify the custom instruction combinations that result in the highest program speedup or maximize speedup under a given area constraint. We have evaluated the proposed techniques using a state-of-the-art extensible processor platform, in the context of a commercial design flow. Experiments with several benchmark programs indicate that custom processors synthesized using automatic custom instruction selection can result in large improvements in performance (up to 5.4/spl times/, an average of 3.4/spl times/), energy (up to 4.5/spl times/, an average of 3.2/spl times/), and energy-delay products (up to 24.2/spl times/, an average of 12.6/spl times/), while speeding up the design process significantly.

Journal ArticleDOI
TL;DR: A method for identifying the X inputs of test vectors in a given test set by using fault simulation and procedures similar to implication and justification of automatic test pattern generation (ATPG) algorithms is proposed.
Abstract: Given a test set for stuck-at faults of a combinational circuit or a full-scan sequential circuit, some of the primary input values may be changed to the opposite logic values without losing fault coverage. We can regard such input values as don't care (X). In this paper, we propose a method for identifying the X inputs of test vectors in a given test set. While there are many combinations of X inputs in the test set generally, the proposed method finds one including as many X inputs as possible, by using fault simulation and procedures similar to implication and justification of automatic test pattern generation (ATPG) algorithms. Experimental results for ISCAS benchmark circuits show that approximately 69% of the inputs of uncompacted test sets could be X on the average. Even for highly compacted test sets, the method found that approximately 48% of inputs are X.

Journal ArticleDOI
TL;DR: It is demonstrated that the droplet-based microelectrofluidic MEFS provides higher performance, as well as lower design and integration complexity than the continuous-flow systems.
Abstract: Composite microsystems that incorporate microelectromechanical and microelectrofluidic devices are emerging as the next generation of system-on-a-chip (SOC). We present a performance comparison between two types of microelectrofluidic systems (MEFS): continuous-flow systems and droplet-based systems. The comparison is based on a specific microelectrofluidic application-a polymerase chain reaction (PCR) system. The behavioral modeling, simulation, and performance evaluation are based on a SystemC design environment. The performance comparison includes the system throughput, system-correction capacity, system-processing capacity, and system-design complexity. By using our system-performance evaluation environment, we demonstrated that the droplet-based MEFS provides higher performance, as well as lower design and integration complexity.

Journal ArticleDOI
Leendert M. Huisman1
TL;DR: A new form of logic diagnosis is described that is suitable for diagnosing fails in combinational logic and can diagnose failures caused by bridges and opens as well as fails caused by regular stuck-at faults.
Abstract: A new form of logic diagnosis is described that is suitable for diagnosing fails in combinational logic. It can diagnose defects that can affect arbitrarily many elements in the integrated circuit. It operates by first identifying patterns during which only one element is affected by the defect, and then diagnosing the fails observed during the application of such patterns, one pattern at a time. Single stuck-at faults are used for this purpose, and the aggregate of stuck-at fault locations thus identified is then further analyzed to obtain the most accurate estimate of the identities of those elements that can be affected by the defect. This approach to logic diagnosis is as effective as that of classical stuck-at fault-based diagnosis, when the latter applies, but is far more general. In particular, it can diagnose fails caused by bridges and opens as well as fails caused by regular stuck-at faults.

Journal ArticleDOI
TL;DR: This work addresses the problem of finding porosity-aware buffering solutions by constructing a "smart Steiner tree" to pass to van Ginneken's topology-based algorithm, and shows that significant improvements on timing closure are obtained when this approach is integrated into a physical synthesis system.
Abstract: In order to achieve timing closure on increasingly complex IC designs, buffer insertion needs to be performed on thousands of nets within an integrated physical synthesis system. Modern designs may contain large blocks which severely constrain the buffer locations. Even when there may appear to be space for buffers in the alleys between large blocks, these regions are often densely packed or may be needed later to fix critical paths. Therefore, within physical synthesis, a buffer insertion scheme needs to be aware of the porosity of the existing layout to be able to decide when to insert buffers in dense regions to achieve critical performance improvement and when to utilize the sparser regions of the chip. This work addresses the problem of finding porosity-aware buffering solutions by constructing a "smart Steiner tree" to pass to van Ginneken's topology-based algorithm. This flow allows one to fully integrate the algorithm into a physical synthesis system without paying an exorbitant runtime penalty. We show that significant improvements on timing closure are obtained when this approach is integrated into a physical synthesis system.

Journal ArticleDOI
TL;DR: This paper provides theoretical analysis to demonstrate that the new path-selection problem consists of two computationally intractable subproblems, and discusses practical heuristics and their performance with respect to each subproblem.
Abstract: Critical path selection is an indispensable step for testing of small-size delay defects. Historically, this step relies on the construction of a set of worst-case paths, where the timing lengths of the paths are calculated based upon discrete-valued timing models. The assumption of discrete-valued timing models may become invalid for modeling delay effects in the deep submicron domain, where the effects of timing defects and process variations are often statistical in nature. This paper studies the problem of critical path selection for testing small-size delay defects, assuming that circuit delays are statistical. We provide theoretical analysis to demonstrate that the new path-selection problem consists of two computationally intractable subproblems. Then, we discuss practical heuristics and their performance with respect to each subproblem. Using a statistical defect injection and timing-simulation framework, we present experimental results to support our theoretical analysis.

Journal ArticleDOI
TL;DR: This paper proposes a new efficient O(nlogn) connectivity-based bottom-up clustering algorithm called edge separability-based clustering (ESC), which exploits more global connectivity information using edge separation to guide the clustering process, while carefully monitoring cluster area balance.
Abstract: In this paper, we propose a new efficient O(nlogn) connectivity-based bottom-up clustering algorithm called edge separability-based clustering (ESC). Unlike existing bottom-up algorithms that are based on local connectivity information of the netlist, ESC exploits more global connectivity information using edge separability to guide the clustering process, while carefully monitoring cluster area balance. Exact computation of the edge separability /spl lambda/(e) for a given edge e=(x,y) in an edge-weighted undirected graph G is equivalent to finding the maximum flow between x and y. Since the currently best known time bounds for solving the maximum flow problem is O(mnlog(n/sup 2//m)), due to Goldberg and Tarjan (Goldberg and Tarjan, 1988), the computation of /spl lambda/(e) for all edges in G requires O(m/sup 2/nlog(n/sup 2//m)) time. However, we show that a simple and efficient algorithm CAPFOREST (Nagamochi and Ibaraki, 1992) can be used to provide a good approximation of edge separability (within 9.1% empirical error bound) for all edges in G without using any network flow computation in O(nlogn) time. Our experimental results based on large-scale benchmark circuits demonstrate the effectiveness of using edge separability in the context of multilevel partitioning framework for cutsize minimization. We observe that exploiting edge separability yields better quality partitioning solution compared to existing clustering algorithms (Sun and Sechen, 1993), (Cong and Smith, 1993), (Huang and Kahng, 1995), (Ng et al., 1987), (Wei and Cheng, 1991), (Shin and Kim, 1993), (Schuler and Ulrich, 1972), (Karypis et al., 1997), that rely on local connectivity information. In addition, our ESC-based iterative improvement based multilevel partitioning algorithm LR/ESC-PM provides comparable results to state-of-the-art hMetis package (Karypis et al., 1997), (Karypis and Kumar, 1999).

Journal ArticleDOI
TL;DR: New techniques for model checking in the counterexample-guided abstraction-refinement framework that use a combination of 0-1 integer linear programming and machine learning techniques for refining the abstraction based on the counteretxample.
Abstract: We describe new techniques for model checking in the counterexample-guided abstraction-refinement framework. The abstraction phase "hides" the logic of various variables, hence considering them as inputs. This type of abstraction may lead to "spurious" counterexamples, i.e., traces that cannot be simulated on the original (concrete) machine. We check whether a counterexample is real or spurious with a satisfiability (SAT) checker. We then use a combination of 0-1 integer linear programming and machine learning techniques for refining the abstraction based on the counterexample. The process is repeated until either a real counterexample is found or the property is verified. We have implemented these techniques on top of the model checker NuSMV and the SAT solver Chaff. Experimental results prove the viability of these new techniques.

Journal ArticleDOI
TL;DR: This paper proposes the first generic fingerprinting technique that can be applied to an arbitrary synthesis (optimization or decision) or compilation problem and, therefore, to hardware and software IPs and generates a uniquely fingerprinted new solution.
Abstract: Fingerprinting is an approach that assigns a unique and invisible ID to each sold instance of the intellectual property (IP). One of the key advantages fingerprinting-based intellectual property protection (IPP) has over watermarking-based IPP is the enabling of tracing stolen hardware or software. Fingerprinting schemes have been widely and effectively used to achieve this goal; however, their application domain has been restricted only to static artifacts, such as image and audio, where distinct copies can be obtained easily. In this paper, we propose the first generic fingerprinting technique that can be applied to an arbitrary synthesis (optimization or decision) or compilation problem and, therefore to hardware and software IPs. The key problem with design IP fingerprinting is that there is a need to generate a large number of structurally unique but functionally and timing identical designs. To reduce the cost of generating such distinct copies, we apply iterative optimization in an incremental fashion to solve a fingerprinted instance. Therefore, we leverage on the optimization effort already spent in obtaining previous solutions, yet we generate a uniquely fingerprinted new solution. This generic approach is the basis for developing specific fingerprinting techniques for four important problems in VLSI CAD: partitioning, graph coloring, satisfiability, and standard-cell placement. We demonstrate the effectiveness of the new fingerprinting-based IPP techniques on a number of standard benchmarks.

Journal ArticleDOI
TL;DR: A new algorithm based on efficient nonlinear programming techniques is presented to solve the area minimization of power network for very large-scale integration designs and can achieve the objective of minimizing the area of powernetwork in a short runtime.
Abstract: This paper deals with area minimization of power network for very large-scale integration designs. A new algorithm based on efficient nonlinear programming techniques is presented to solve this problem. During the optimization, a penalty method, conjugate gradient method, circuit sensitivity analysis, and merging adjoint networks are applied, which enables the algorithm to optimize large circuits. The experiment results prove that this algorithm is robust and can achieve the objective of minimizing the area of power network in a short runtime.

Journal ArticleDOI
TL;DR: A reconfigurable interconnection network (RIN) is placed between the outputs of a pseudorandom pattern generator and the scan inputs of the circuit under test (CUT) to reduce correlation between the test data bits that are fed into the scan chains.
Abstract: We present a new approach for deterministic built-in self-test (BIST) in which a reconfigurable interconnection network (RIN) is placed between the outputs of a pseudorandom pattern generator and the scan inputs of the circuit under test (CUT). The RIN, which consists only of multiplexer switches, replaces the phase shifter that is typically used in pseudorandom BIST to reduce correlation between the test data bits that are fed into the scan chains. The connections between the linear-feedback shift-register (LFSR) and the scan chains can be dynamically changed (reconfigured) during a test session. In this way, the RIN is used to match the LFSR outputs to the test cubes in a deterministic test set. The control data bits used for reconfiguration ensure that all the deterministic test cubes are embedded in the test patterns applied to the CUT. The proposed approach requires very little hardware overhead, only a modest amount of CPU time, and fewer control bits compared to the storage required for reseeding techniques or for hybrid BIST. Moreover, as a nonintrusive BIST solution, it does not require any circuit redesign and has minimal impact on circuit performance.

Journal ArticleDOI
TL;DR: The results from a pipelined processor example show that an operation-centric framework offers a significant reduction in design time, while achieving comparable implementation quality as traditional register-transfer-level design flows.
Abstract: The operation-centric hardware abstraction is useful for describing systems whose behavior exhibits a high degree of concurrency. In the operation-centric style, the behavior of a system is described as a collection of operations on a set of state elements. Each operation is specified as a predicate and a set of simultaneous state-element updates, which may only take effect in case the predicate is true on the current state values. The effect of an operation's state updates is atomic, that is, the legal behaviors of the system constitute some sequential interleaving of the operations. This atomic and sequential execution semantics permits each operation to be formulated as if the rest of the system were frozen and thus simplifies the description of concurrent systems. This paper presents an approach to synthesize an efficient synchronous digital implementation from an operation-centric hardware-design description. The resulting implementation carries out multiple operations per clock cycle and yet maintains the semantics that is consistent with the atomic and sequential execution of operations. The paper defines, and then gives algorithms to identify, conflict-free and sequentially composable operations that can be performed in the same clock cycle. The paper further gives an algorithm to generate the hardwired arbitration logic to coordinate the concurrent execution of conflict-free and sequentially composable operations. Lastly, the paper evaluates synthesis results based on the TRAC compiler for the TRSPEC operation-centric hardware-description language. The results from a pipelined processor example show that an operation-centric framework offers a significant reduction in design time, while achieving comparable implementation quality as traditional register-transfer-level design flows.

Journal ArticleDOI
TL;DR: A compiler-controlled dynamic on-chip scratch-pad memory (SPM) management framework that includes an optimization suite that uses loop and data transformations, an on- chip memory partitioning step, and a code-rewriting phase that collectively transform an input code automatically to take advantage of the on- Chip SPM.
Abstract: Optimizations aimed at improving the efficiency of on-chip memories in embedded systems are extremely important. Using a suitable combination of program transformations and memory design space exploration aimed at enhancing data locality enables significant reductions in effective memory access latencies. While numerous compiler optimizations have been proposed to improve cache performance, there are relatively few techniques that focus on software-managed on-chip memories. It is well-known that software-managed memories are important in real-time embedded environments with hard deadlines as they allow one to accurately predict the amount of time a given code segment will take. In this paper, we propose and evaluate a compiler-controlled dynamic on-chip scratch-pad memory (SPM) management framework. Our framework includes an optimization suite that uses loop and data transformations, an on-chip memory partitioning step, and a code-rewriting phase that collectively transform an input code automatically to take advantage of the on-chip SPM. Compared with previous work, the proposed scheme is dynamic, and allows the contents of the SPM to change during the course of execution, depending on the changes in the data access pattern. Experimental results from our implementation using a source-to-source translator and a generic cost model indicate significant reductions in data transfer activity between the SPM and off-chip memory.

Journal ArticleDOI
TL;DR: A novel exploration technique for analog placement operating on a subset of tree representations of the layout-called symmetric-feasible, where the typical presence of an arbitrary number of symmetry groups of devices is directly taken into account during the search of the solution space.
Abstract: The traditional way of approaching placement problems in computer-aided design (CAD) tools for analog layout is to explore an extremely large search space of feasible or unfeasible placement configurations, where the cells are moved in the chip plane (being even allowed to overlap in possibly illegal ways) by a stochastic optimizer. This paper presents a novel exploration technique for analog placement operating on a subset of tree representations of the layout-called symmetric-feasible, where the typical presence of an arbitrary number of symmetry groups of devices is directly taken into account during the search of the solution space. The computation times exhibited by this novel approach are significantly better than those of the algorithms using the traditional exploration strategy. This superior efficiency is partly due to the use of segment trees, a data structure introduced by Bentley, mainly used in computational geometry.

Journal ArticleDOI
S. Odanaka1
TL;DR: The discretization method provides numerical stability and accuracy for carrier transport simulations with quantum confinement effects in ultrasmall MOSFET structures.
Abstract: This paper describes a new approach to construct a multidimensional discretization scheme of quantum drift-diffusion (QDD) model (or density gradient model) arising in MOSFET structures. The discretization is performed for the stationary QDD equations replaced by an equivalent form, employing an exponential transformation of variables. A multidimensional discretization scheme is constructed by making use of an exponential-fitting method in a class of conservative difference schemes, applying the finite-volume method, which leads to a consistent generalization of the Scharfetter-Gummel expression to the nonlinear Sturm-Liouville type equation. The discretization method is evaluated in a variety of MOSFET structures, including a double-gate MOSFET with thin body layer. The discretization method provides numerical stability and accuracy for carrier transport simulations with quantum confinement effects in ultrasmall MOSFET structures.

Journal ArticleDOI
TL;DR: This paper proposes a regular distributed register (RDR) microarchitecture, which offers high regularity and direct support of multicycle on-chip communication and demonstrates promising results on a number of real-life examples.
Abstract: For multigigahertz designs in nanometer technologies, data transfers on global interconnects take multiple clock cycles. In this paper, we propose a regular distributed register (RDR) microarchitecture, which offers high regularity and direct support of multicycle on-chip communication. The RDR microarchitecture divides the entire chip into an array of islands so that all local computation and communication within an island can be performed in a single clock cycle. Each island contains a cluster of computational elements, local registers, and a local controller. On top of the RDR microarchitecture, novel layout-driven architectural synthesis algorithms have been developed for multicycle communication, including scheduling-driven placement, placement-driven simultaneous scheduling with rebinding, and distributed control generation, etc. The experimentation on a number of real-life examples demonstrates promising results. For data flow intensive examples, we obtain a 44% improvement on average in terms of the clock period and a 37% improvement on average in terms of the final latency, over the traditional flow. For designs with control flow, our approach achieves a 28% clock-period reduction and a 23% latency reduction on average.

Journal ArticleDOI
TL;DR: A simultaneous buffer insertion/sizing and wire-sizing algorithm which guarantees zero skew and minimizes delay and power in polynomial time and can be used to achieve useful clock skew to facilitate timing convergence and to incrementally adjust the clock tree for design convergence.
Abstract: Clock distribution is crucial for timing and design convergence in high-performance very large scale integration designs. Minimum-delay/power zero skew buffer insertion/sizing and wire-sizing problems have long been considered intractable. In this paper, we present ClockTune , a simultaneous buffer insertion/sizing and wire-sizing algorithm which guarantees zero skew and minimizes delay and power in polynomial time. Extensive experimental results show that our algorithm executes very efficiently. For example, ClockTune achieves 45/spl times/ delay improvement for buffering and sizing an industrial clock tree with 3101 sink nodes on a 1.2-GHz Pentium IV PC in 16 min, compared with the initial routing. Our algorithm can also be used to achieve useful clock skew to facilitate timing convergence and to incrementally adjust the clock tree for design convergence and explore delay-power tradeoffs during design cycles. ClockTune is available on the web (http://vlsi.ece.wisc.edu/Tools.htm).

Journal ArticleDOI
TL;DR: This work presents a technique to derive fully testable circuits under the stuck-at fault model (SAFM) and the path-delay faultmodel (PDFM) by starting from a function description as a binary decision diagram and generated by a linear time mapping algorithm.
Abstract: We present a technique to derive fully testable circuits under the stuck-at fault model (SAFM) and the path-delay fault model (PDFM). Starting from a function description as a binary decision diagram, the netlist is generated by a linear time mapping algorithm. Only one additional input and one inverter are needed to achieve 100% testable circuits under SAFM and PDFM. Experiments are given to show the advantages of the technique.