scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Very Large Scale Integration Systems in 1994"


Journal ArticleDOI
TL;DR: A power analysis technique is developed that has been applied to two commercial microprocessors and can be employed to evaluate the power cost of embedded software and can help in verifying if a design meets its specified power constraints.
Abstract: Embedded computer systems are characterized by the presence of a dedicated processor and the software that runs on it Power constraints are increasingly becoming the critical component of the design specification of these systems At present, however, power analysis tools can only be applied at the lower levels of the design-the circuit or gate level It is either impractical or impossible to use the lower level tools to estimate the power cost of the software component of the system This paper describes the first systematic attempt to model this power cost A power analysis technique is developed that has been applied to two commercial microprocessors-Intel 486DX2 and Fujitsu SPARClite 934 This technique can be employed to evaluate the power cost of embedded software This can help in verifying if a design meets its specified power constraints Further, it can also be used to search the design space in software power optimization Examples with power reduction of up to 40%, obtained by rewriting code using the information provided by the instruction level power model, illustrate the potential of this idea >

1,055 citations


Journal ArticleDOI
TL;DR: A review of the power estimation techniques that have recently been proposed for very large scale integrated (VLSI) circuits is presented.
Abstract: With the advent of portable and high-density microelectronic devices, the power dissipation of very large scale integrated (VLSI) circuits is becoming a critical concern. Accurate and efficient power estimation during the design phase is required in order to meet the power specifications without a costly redesign process. In this paper, we present a review of the power estimation techniques that have recently been proposed. >

696 citations


Journal ArticleDOI
TL;DR: The dissipation of the adiabatic amplifier is compared to that of conventional switching circuits, both for the case of a fixed voltage swing and the case when the voltage swing can be scaled to reduce power dissipation.
Abstract: Adiabatic switching is an approach to low-power digital circuits that differs fundamentally from other practical low-power techniques. When adiabatic switching is used, the signal energies stored on circuit capacitances may be recycled instead of dissipated as heat. We describe the fundamental adiabatic amplifier circuit and analyze its performance. The dissipation of the adiabatic amplifier is compared to that of conventional switching circuits, both for the case of a fixed voltage swing and the case when the voltage swing can be scaled to reduce power dissipation. We show how combinational and sequential adiabatic-switching logic circuits may be constructed and describe the timing restrictions required for adiabatic operation. Small chip-building experiments have been performed to validate the techniques and to analyse the associated circuit overhead. >

609 citations


Journal ArticleDOI
TL;DR: This work presents a powerful sequential logic optimization method based on selectively precomputing the output logic values of the circuit one clock cycle before they are required, and using the precomputed values to reduce internal switching activity in the succeeding clock cycle.
Abstract: We address the problem of optimizing logic-level sequential circuits for low power We present a powerful sequential logic optimization method that is based on selectively precomputing the output logic values of the circuit one clock cycle before they are required, and using the precomputed values to reduce internal switching activity in the succeeding clock cycle We present two different precomputation architectures which exploit this observation The primary optimization step is the synthesis of the precomputation logic, which computes the output values for a subset of input conditions If the output values can be precomputed, the original logic circuit can be "turned off" in the next clock cycle and will have substantially reduced switching activity The size of the precomputation logic determines the power dissipation reduction, area increase and delay increase relative to the original circuit Given a logic-level sequential circuit, we present an automatic method of synthesizing precomputation logic so as to achieve maximal reductions in power dissipation We present experimental results on various sequential circuits Up to 75% reductions in average switching activity and power dissipation are possible with marginal increases in circuit area and delay >

326 citations


Journal ArticleDOI
TL;DR: The combination of supply scaling and self-timed circuitry which has some unique advantages, and the thorough analysis of the power savings that are possible using this technique are described.
Abstract: Recent research has demonstrated that for certain types of applications like sampled audio systems, self-timed circuits can achieve very low power consumption, because unused circuit parts automatically turn into a stand-by mode. Additional savings may be obtained by combining the self-timed circuits with a mechanism that adaptively adjusts the supply voltage to the smallest possible, while maintaining the performance requirements. This paper describes such a mechanism, analyzes the possible power savings, and presents a demonstrator chip that has been fabricated and tested. The idea of voltage scaling has been used previously in synchronous circuits, and the contributions of the present paper are: 1) the combination of supply scaling and self-timed circuitry which has some unique advantages, and 2) the thorough analysis of the power savings that are possible using this technique. >

255 citations


Journal ArticleDOI
TL;DR: A novel way of implementing the leading zero detector (LZD) circuit is presented based on an algorithmic approach resulting in a modular and scalable circuit for any number of bits.
Abstract: A novel way of implementing the leading zero detector (LZD) circuit is presented The implementation is based on an algorithmic approach resulting in a modular and scalable circuit for any number of bits We designed a 32 and 64 bit leading zero detector circuit in CMOS and ECL technology The CMOS version was designed using both: logic synthesis and an algorithmic approach The algorithmic implementation is compared with the results obtained using modern logic synthesis tools in the same 06 /spl mu/m CMOS technology The implementation based on an algorithmic approach showed an advantage compared to the results produced by the logic synthesis ECL implementation of the 64 bit LZD circuit was simulated to perform in under 200 ps for nominal speed >

167 citations


Journal ArticleDOI
TL;DR: The results are extremely encouraging, indicating that double edge triggered flip-flops are capable of significant energy savings, for only a small overhead in complexity.
Abstract: In this paper we study the power savings possible using double edge triggered (DET), instead of, conventional single edge triggered (SET) flip-flops. We begin the paper by introducing a set of novel D-type double edge triggered flip-flops which can be implemented with fewer transistors than any previous design. The power dissipation in these flip-flops and single edge triggered flip-flops is compared via architectural level studies, analytical considerations and simulations. The analysis includes an implementation independent study on the effect of input sequences in the energy dissipation of single and double edge triggered flip-flops. The system level energy savings possible by using registers consisting of double edge triggered flip-flops, instead of single edge triggered flip-flops, is subsequently explored. The results are extremely encouraging, indicating that double edge triggered flip-flops are capable of significant energy savings, for only a small overhead in complexity. >

149 citations


Journal ArticleDOI
TL;DR: A polynomial time optimal algorithm is developed for computing an area-minimum mapping solution without node duplication for a K-bounded general Boolean network, which makes a significant step towards complete understanding of the general area minimization problem in FPGA technology mapping.
Abstract: In this paper, we study the area and depth trade-off in lookup-table (LUT) based FPGA technology mapping. Starting from a depth-optimal mapping solution, we perform a sequence of depth relaxation operations and area-minimizing mapping procedures to produce a set of mapping solutions for a given design with smooth area and depth trade-off. As the core of the area minimization step, we have developed a polynomial time optimal algorithm for computing an area-minimum mapping solution without node duplication for a K-bounded general Boolean network, which makes a significant step towards complete understanding of the general area minimization problem in FPGA technology mapping. The experimental results on MCNC benchmark circuits show that our solution sets outperform the solutions produced by most existing mapping algorithms in terms of both area and depth minimization. >

144 citations


Journal ArticleDOI
TL;DR: Three schemes for concurrent error detection in a multilevel circuit are proposed in this paper, using which all the single stuck at faults in the circuit can be detected concurrently.
Abstract: Conventional logic synthesis systems are targeted towards reducing the area required by a logic block, as measured by the literal count or gate count; or, improving the performance in terms of gate delays; or, improving the testability of the synthesized circuit, as measured by the irredundancy of the resultant circuit. In this paper, we address the problem of developing reliability driven logic synthesis algorithms for multilevel logic circuits, which are integrated within the MIS synthesis system. Our procedures are based on concurrent error detection techniques that have been proposed in the past for two level circuits, and adapting those techniques to multilevel logic synthesis algorithms. Three schemes for concurrent error detection in a multilevel circuit are proposed in this paper, using which all the single stuck at faults in the circuit can be detected concurrently. The first scheme uses duplication of a given multilevel circuit with the addition of a totally self-checking comparator. The second scheme proposes a procedure to generate the multilevel circuit from a two level representation under some constraint such that, the Berger code of the output vector can be used to detect any single fault inside the circuit, except at the inputs. A constrained technology mapping procedure is also presented in this paper. The third scheme is based on parity codes on the outputs. The outputs are partitioned using a novel partitioning algorithm, and each partition is implemented using a multilevel circuit. Some additional parity coded outputs are generated. In all three schemes, all the necessary checkers are generated automatically and the whole circuit is placed and routed using the Timberwolf layout package. The area overheads for several benchmark examples are reported in this paper. The entire procedure is integrated into a new system called RSYN. >

116 citations


Journal ArticleDOI
TL;DR: This paper studies the simultaneous driver and wire sizing problem under two objective functions: i) delay minimization only, or ii) combined delay and power dissipation minimization, and implements polynomial time algorithms and efficient algorithms for computing optimal SDWS solutions under the two objectives.
Abstract: In this paper, we study the simultaneous driver and wire sizing (SDWS) problem under two objective functions: i) delay minimization only, or ii) combined delay and power dissipation minimization. We present general formulations of the SDWS problem under these two objectives based on the distributed Elmore delay model with consideration of both capacitive power dissipation and short-circuit power dissipation. We show several interesting properties of the optimal SDWS solutions under the two objectives, including an important result which reveals the relationship between driver sizing and optimal wire sizing. These results lead to polynomial time algorithms for computing the lower and upper bounds of optimal SDWS solutions under the two objectives, and efficient algorithms for computing optimal SDWS solutions under the two objectives. We have implemented these algorithms and compared them with existing design methods for driver sizing only or independent driver and wire sizing. Accurate SPICE simulation shows that our methods reduce the delay by up to 12%-49% and power dissipation by 26%-63% compared with existing design methods. >

114 citations


Journal ArticleDOI
TL;DR: An assessment of the strengths and weaknesses of using FPGA's for floating-point arithmetic.
Abstract: We present empirical results describing the implementation of an IEEE Standard 754 compliant floating-point adder/multiplier using field programmable gate arrays. The use of FPGA's permits fast and accurate quantitative evaluation of a variety of circuit design tradeoffs for addition and multiplication. PPGA's also permit accurate assessments of the area and time costs associated with various features of the IEEE floating-point standard, including rounding and gradual underflow. These costs are analyzed, along with the effects of architectural correlation, a phenomenon that occurs when the cost of combining architectural features exceeds the sum of separate implementation. We conclude with an assessment of the strengths and weaknesses of using FPGA's for floating-point arithmetic. >

Journal ArticleDOI
TL;DR: The inability to contain faults within single cells and the need for fast reconfiguration are identified as the key obstacles to obtaining a significant increase in yield.
Abstract: The fine granularity and reconfigurable nature of field-programmable gate arrays (FPGA's) suggest that defect-tolerant methods can be readily applied to these devices in order to increase their maximum economic sizes, through increased yield. This paper identifies the inability to contain faults within single cells and the need for fast reconfiguration as the key obstacles to obtaining a significant increase in yield. Monte Carlo defect modeling of the photolithographic layers of VLSI FPGA's is used as a foundation for the yield modeling of various defect-tolerant architectures. Results suggest that a medium-grain architecture is the best solution, offering a substantial increase in size without significant side effects. This architecture is shown to produce greater gate densities than the alternative approach of realizing ultralarge scale FPGA's-multichip modules. >

Journal ArticleDOI
TL;DR: This paper provides the first in-depth formal analysis of the structure of the constraints in high-level synthesis, and shows how to exploit that structure in a well-designed ILP formulation by adding new valid inequalities.
Abstract: In integer linear programming (ILP), formulating a "good" model is of crucial importance to solving that model. In this paper, we begin with a mathematical analysis of the structure of the assignment, timing, and resource constraints in high-level synthesis, and then evaluate the structure of the scheduling polytope described by these constraints. We then show how the structure of the constraints can be exploited to develop a well-structured ILP formulation, which can serve as a solid theoretical foundation for future improvement. As a start in that direction, we also present two methods to further tighten the formulation. The contribution of this paper is twofold: 1) it provides the first in-depth formal analysis of the structure of the constraints, and it shows how to exploit that structure in a well-designed ILP formulation, and 2) it shows how to further improve a well-structured formulation by adding new valid inequalities. >

Journal ArticleDOI
TL;DR: Analysis of the reliability model indicates that, for some circuits, the reliability obtained with majority voting techniques is significantly greater than predicted by any previous model.
Abstract: The effect of compensating module faults on the reliability of majority voting based VLSI fault-tolerant circuits is investigated using a fault injection simulation method. This simulation method facilitates consideration of multiple faults in the replicated circuit modules as well as the majority voting circuits to account for the fact that, in VLSI implementations, the majority voting circuits are constructed from components of the same reliability as those used to construct the circuit modules. From the fault injection simulation, a survivability distribution is obtained which, when combined with an area overhead expression, leads to a more accurate reliability model for majority voting based VLSI fault-tolerant circuits. The new model is extended to facilitate the calculation of reliability of fault-tolerant circuits which have sustained faults but continue to operate properly. Analysis of the reliability model indicates that, for some circuits, the reliability obtained with majority voting techniques is significantly greater than predicted by any previous model. >

Journal ArticleDOI
TL;DR: This paper describes a new high-level synthesis system based on the hierarchical production based specification (PBS) that automatically constructs a controlling machine from the PBS and this process is not impacted by the possibly exponentially larger deterministic state space of the designs.
Abstract: This paper describes a new high-level synthesis system based on the hierarchical production based specification (PBS). Advantages of this form of specification are that the designer does not describe the control flow in terms of explicit states or control variables, and that the designer does not describe a particular form of implementation. The production-based specification also separates the specification of the control aspects and data-flow aspects of the design. The control is implicitly described via the production hierarchy, while the data-flow is described as action computations. This approach is a hardware analog of popular software engineering techniques. The Clairvoyant system automatically constructs a controlling machine from the PBS and this process is not impacted by the possibly exponentially larger deterministic state space of the designs. The encodings generated by the constructions compare favorably to encodings derived using graph-based state encoding techniques in terms of logic complexity and logic depth. These construction techniques utilize recent advances in BDD techniques. >

Journal ArticleDOI
TL;DR: A compact analog synapse cell which is not biased in the subthreshold region for fully-parallel operation is presented, which can approximate a Gaussian function with accuracy around 98% in the ideal case.
Abstract: Back-propagation neural networks with Gaussian function synapses have better convergence property over those with linear-multiplying synapses. In digital simulation, more computing time is spent on Gaussian function evaluation. We present a compact analog synapse cell which is not biased in the subthreshold region for fully-parallel operation. This cell can approximate a Gaussian function with accuracy around 98% in the ideal case. Device mismatch induced by fabrication process will cause some degradation to this approximation. The Gaussian synapse cell can also be used in unsupervised learning. Programmability of the proposed Gaussian synapse cell is achieved by changing the stored synapse weight W/sub ji/, the reference current and the sizes of transistors in the differential pair. >

Journal ArticleDOI
Sandip Kundu1
TL;DR: A diagnosis system that can diagnose faults in a scan chain so that the manufacturing process or physical design can be filed to improve yield is described.
Abstract: Testing screens for good chips. However, when test fall out is high (low yield) it becomes necessary to diagnose faults so that the manufacturing process or physical design can be filed to improve yield. Several scan based diagnostic schemes are used in industry. They work when the scan chain itself is fault free. In this paper we describe a diagnosis system that can diagnose faults in a scan chain. >

Journal ArticleDOI
TL;DR: It is shown that by sizing transistors judiciously it is possible to gain significant speed improvements at the cost of only a slight increase in power and hence a better power-delay product.
Abstract: An approach to designing CMOS adders for both high speed and low power is presented by analyzing the performance of three types of adders - linear time adders, logN time adders and constant time adders. The representative adders used are a ripple carry adder, a blocked carry lookahead adder and several signed-digit adders, respectively. Some of the tradeoffs that are possible during the logic design of an adder to improve its power-delay product are identified. An effective way of improving the speed of a circuit is by transistor sizing which unfortunately increases power dissipation to a large extent. It is shown that by sizing transistors judiciously it is possible to gain significant speed improvements at the cost of only a slight increase in power and hence a better power-delay product. Perflex, an in-house performance driven layout generator, is used to systematically generate sized layouts. >

Journal ArticleDOI
TL;DR: An Integer Linear Programming model for the self-recovering microarchitecture synthesis problem is presented and the resulting ILP formulation can minimize either the number of voters or the overall hardware, subject to constraints on the numbers of clock cycles the retry period, and thenumber of checkpoints.
Abstract: The growing trend towards VLSI implementation of crucial tasks in critical applications has increased both the demand for and the scope of fault-tolerant VLSI systems. In this paper, we present a self-recovering microarchitecture synthesis system. In a self-recovering microarchitecture, intermediate results are compared at regular intervals, and if correct saved in registers (checkpointing). On the other hand, on detecting a fault, the self-recovering microarchitecture rolls back to a previous checkpoint and retries. The proposed synthesis system comprises of a heuristic and an optimal subsystem. The heuristic synthesis subsystem has two components. Whereas the checkpoint insertion algorithm identifies good checkpoints by successively eliminating clock cycle boundaries that either have a high checkpoint overhead or violate the retry period constraint, the novel edge-based schedule, assigns edges to clock cycle boundaries, in addition to scheduling nodes to clock cycles. Also, checkpoint insertion and edge-based scheduling are intertwined using a flexible synthesis methodology. We additionally show an Integer Linear Programming model for the self-recovering microarchitecture synthesis problem. The resulting ILP formulation can minimize either the number of voters or the overall hardware, subject to constraints on the number of clock cycles the retry period, and the number of checkpoints. >

Journal ArticleDOI
TL;DR: This paper describes a formal method for the diagnosis and correction of logic design errors in an incorrect gate-level implementation, which is robust and covers all, simple design errors described by Abadir et al. (1988).
Abstract: Logic verification tools are often used to verify a gate-level implementation of a digital system in terms of its functional specification. If the implementation is found not to be functionally equivalent to the specification, it is important to correct the implementation automatically. This paper describes a formal method for the diagnosis and correction of logic design errors in an incorrect gate-level implementation. We use Boolean equation techniques to search for potential error locations. An efficient search and pruning algorithm is developed by introducing the notion of immediate dominator set. Two correction procedures are proposed. Gate correction corrects errors such as wrong gate type, missing inverters, etc.; line correction corrects errors such as missing wires and wrong connections. Our method is robust and covers all, simple design errors described by Abadir et al. (1988). Experimental results for a set of ISCAS and MCNC benchmark circuits demonstrate the effectiveness of the proposed techniques. >

Journal ArticleDOI
Jacob Savir1, S. Patil1
TL;DR: This paper concentrates on generation of broadside delay test vectors; shows the results of experiments conducted on the ISCAS sequential benchmarks, and discusses some concerns of the broad-sidedelay test strategy.
Abstract: A broad-side delay test is a form of a scan-based delay test, where the first vector of the pair is scanned into the chain, and the second vector of the pair is the combinational circuit's response to this first vector. This delay test form is called "broad-side" since the second vector of the delay test pair is provided in a broad-side fashion, namely through the logic. This paper concentrates on generation of broadside delay test vectors; shows the results of experiments conducted on the ISCAS sequential benchmarks, and discusses some concerns of the broad-side delay test strategy. >

Journal ArticleDOI
TL;DR: A new, parallel, nearest-neighbor (NN) pattern classifier, based on a 2D Cellular Automaton (CA) architecture, is presented, which produces piece-wise linear discriminant curves between clusters of points of complex shape (nonlinearly separable).
Abstract: A new, parallel, nearest-neighbor (NN) pattern classifier, based on a 2D Cellular Automaton (CA) architecture, is presented in this paper. The proposed classifier is both time and space efficient, when compared with already existing NN classifiers, since it does not require complex distance calculations and ordering of distances, and storage requirements are kept minimal since each cell stores information only about its nearest neighborhood. The proposed classifier produces piece-wise linear discriminant curves between clusters of points of complex shape (nonlinearly separable) using the computational geometry concept known as the Voronoi diagram, which is established through CA evolution. These curves are established during an "off-line" operation and, thus, the subsequent classification of unknown patterns is achieved very fast. The VLSI design and implementation of a nearest neighborhood processor of the proposed 2D CA architecture is also presented in this paper. >

Journal ArticleDOI
TL;DR: The commonly employed models, most notably, the large area clustering negative binomial distribution do not provide a sufficiently good match for these large area VLSI IC's and only the recently proposed medium size clustering model is close enough to the empirical distribution.
Abstract: Defect maps of 57 wafers containing large area VLSI IC's were analyzed in order to find a good match between the empirical distribution of defects and a theoretical model. Our main result is that the commonly employed models, most notably, the large area clustering negative binomial distribution, do not provide a sufficiently good match for these large area IC's. Only the recently proposed medium size clustering model is close enough to the empirical distribution. An even better match can be obtained either by combining two theoretical distributions or by a "censoring" procedure in which the worst chips are ignored. Another goal of the study was to find out whether certain portions of either the chip or the wafer had more defects than the others. >

Journal ArticleDOI
TL;DR: This paper examines the transition delay of a circuit and provides a procedure for directly calculating the transitionDelays, which outputs a vector sequence that may be timing simulated to certify static timing verification.
Abstract: Most research in timing verification has implicitly assumed a single vector floating mode computation of delay which is an approximation of the multivector transition delay. In this paper we examine the transition delay of a circuit and demonstrate that the transition delay of a circuit can differ from the floating delay of a circuit. We then provide a procedure for directly calculating the transition delay of a circuit. The most practical benefit of this procedure is the fact that it not only results in a delay calculation but outputs a vector sequence that may be timing simulated to certify static timing verification. >

Journal ArticleDOI
TL;DR: This paper presents methods for scheduling and partitioning behavioral descriptions (e.g., CDFG's) in order to synthesize application-specific multiprocessor systems and proposes an iterative partitioning heuristic to solve.
Abstract: In this paper, we present methods for scheduling and partitioning behavioral descriptions (e.g., CDFG's) in order to synthesize application-specific multiprocessor systems. Our target application domain is digital signal processing (DSP). In order to meet the user given constraints (such as timing), maximizing the system throughput and minimizing the amount of communication between processors are important. A model of a target processor and the communication device (i.e., bus, FIFO and delay element) is defined as a basis for the synthesis. We use an integer linear programming formulating to solve the partitioning and scheduling problems simultaneously. The optimization complexity for large applications can be reduced by using a simplified formulation. For even larger applications, we propose an iterative partitioning heuristic to solve. Finally, the formulations are extended to take into account of conditional branches, loops, and critical signals. >

Journal ArticleDOI
TL;DR: This paper presents two novel sorting network-based architectures for computing high sample rate nonrecursive rank order filters that are based on bubble-sort and Batcher's odd-even merge sort.
Abstract: This paper presents two novel sorting network-based architectures for computing high sample rate nonrecursive rank order filters. The proposed architectures consist of significantly fewer comparators than existing sorting network-based architectures that are based on bubble-sort and Batcher's odd-even merge sort. The reduction in the number of comparators is obtained by sorting the columns of the window only once, and by merging the sorted columns in a way such that the number of candidate elements for the output is very small. The number of comparators per output is reduced even further by processing a block of outputs at a time. Block processing procedures that exploit the computational overlap between consecutive windows are developed for both the proposed networks. >

Journal ArticleDOI
TL;DR: This paper describes a VLSI architecture for a real-time dynamic code book generator and encoder of 512/spl times/512 images at 30 frames/s for image compression applications.
Abstract: Image compression applications use vector quantization (VQ) for its high compression ratio and image quality. The current VQ hardware employs static instead of dynamic code book generation as the latter demands intensive computation and corresponding expensive hardware even though it offers better image quality. This paper describes a VLSI architecture for a real-time dynamic code book generator and encoder of 512/spl times/512 images at 30 frames/s. The four-chip 0.8 /spl mu/m CMOS design implements a tree of Kohonen self-organizing maps, and consists of two VQ processors and two image buffer memory chips. The pipelined VQ processor contains a computational core for both code book generation and encoding, and is scalable to processing larger frames. >

Journal ArticleDOI
TL;DR: Simulation results show that this high-speed image compression VLSI processor based on the systolic architecture of difference-codebook binary tree-searched vector quantization is applicable to many types of image data and capable of producing good reconstructed data quality at high compression ratios.
Abstract: A high-speed image compression VLSI processor based on the systolic architecture of difference-codebook binary tree-searched vector quantization has been developed to meet the increasing demands on large-volume data communication and storage requirements. Simulation results show that this design is applicable to many types of image data and capable of producing good reconstructed data quality at high compression ratios. Various design aspects of the binary tree-searched vector quantizer including the algorithm, architecture, and detailed functional design are thoroughly investigated for VLSI implementation. An 8-level difference-codebook binary tree-searched vector quantizer can be implemented on a custom VLSI chip that includes a systolic array of eight identical processors and a hierarchical memory of eight subcodebook memory banks. The total transistor count is about 300000 and the die size is about 8.67/spl times/7.72 mm/sup 2/ in a 1.0 /spl mu/m CMOS technology. The throughput rate of this high-speed VLSI compression system is approximately 25 Mpixels per second and its equivalent computation power is 600 million instructions per second. >

Journal ArticleDOI
TL;DR: This research breaks new ground by guaranteeing globally optimal architectures for multichip systems for a specific objective function, and supporting interchip communication delay, interchip bus allocation, and other complex interface constraints.
Abstract: An optimization approach to the high level synthesis of VLSI multichip architectures is presented in this paper. This research is important for industry since it is well known that these early high level decisions have the greatest impact on the final VLSI implementation. Optimal application-specific architectures are synthesized here to minimize latency given constraints on chip area, I/O pin count and interchip communication delays. A mathematical integer programming (IP) model for simultaneously partitioning, scheduling, and allocating hardware (functional units, I/O pins, and interchip busses) is formulated. By exploiting the problem structure, using polyhedral theory, the size of the search space is decreased and a new variable selection strategy is introduced based on the branch and bound algorithm. Multichip optimal architectures for several examples are synthesized in practical cpu times. Execution times are comparable to previous heuristic approaches, however there are significant improvements in optimal schedules and allocations of multichips. This research breaks new ground by 1) simultaneously partitioning, scheduling, and allocating in practical cpu times, 2) guaranteeing globally optimal architectures for multichip systems for a specific objective function, and 3) supporting interchip communication delay, interchip bus allocation, and other complex interface constraints. >

Journal ArticleDOI
TL;DR: An optimal and a heuristic approach to solve the binding problem which occurs in high-level synthesis of digital systems, based on a network flow model and also considers floorplanning during the design process to minimize the interconnection area.
Abstract: In this paper we present an optimal and a heuristic approach to solve the binding problem which occurs in high-level synthesis of digital systems. The optimal approach is based on an integer linear programming formulation. Given that such an approach is not practical for large problems, we then derive a heuristic from the ILP formulation which produces very good solutions in order of seconds. The heuristic is based on a network flow model and also considers floorplanning during the design process to minimize the interconnection area. >