scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Very Large Scale Integration Systems in 2004"


Journal Article•DOI•
TL;DR: Using the proposed architecture, a fully subpipelined encryptor with 7 substages in each round unit can achieve a throughput of 21.56 Gbps on a Xilinx XCV1000 e-8 bg560 device in non-feedback modes, which is faster and 79% more efficient in terms of equivalent throughput/slice than the fastest previous FPGA implementation known to date.
Abstract: This paper presents novel high-speed architectures for the hardware implementation of the Advanced Encryption Standard (AES) algorithm. Unlike previous works which rely on look-up tables to implement the SubBytes and InvSubBytes transformations of the AES algorithm, the proposed design employs combinational logic only. As a direct consequence, the unbreakable delay incurred by look-up tables in the conventional approaches is eliminated, and the advantage of subpipelining can be further explored. Furthermore, composite field arithmetic is employed to reduce the area requirements, and different implementations for the inversion in subfield GF(2/sup 4/) are compared. In addition, an efficient key expansion architecture suitable for the subpipelined round units is also presented. Using the proposed architecture, a fully subpipelined encryptor with 7 substages in each round unit can achieve a throughput of 21.56 Gbps on a Xilinx XCV1000 e-8 bg560 device in non-feedback modes, which is faster and is 79% more efficient in terms of equivalent throughput/slice than the fastest previous FPGA implementation known to date.

450 citations


Journal Article•DOI•
TL;DR: Two runtime mechanisms for reducing the leakage current of a CMOS circuit are described and a design technique for applying the minimum leakage input to a sequential circuit is presented, which shows that it is possible to reduce the leakage by an average of 25% with practically no delay penalty.
Abstract: The first part of this paper describes two runtime mechanisms for reducing the leakage current of a CMOS circuit. In both cases, it is assumed that the system or environment produces a "sleep" signal that can be used to indicate that the circuit is in a standby mode. In the first method, the "sleep" signal is used to shift in a new set of external inputs and pre-selected internal signals into the circuit with the goal of setting the logic values of all of the internal signals so as to minimize the total leakage current in the circuit. This minimization is possible because the leakage current of a CMOS gate is strongly dependent on the input combination applied to its inputs. In the second method, nMOS and pMOS transistors are added to some of the gates in the circuit to increase the controllability of the internal signals of the circuit and decrease the leakage current of the gates using the "stack effect". This is, however, done carefully so that the minimum leakage is achieved subject to a delay constraint for all input-output paths in the circuit. In both cases, Boolean satisfiability is used to formulate the problems, which are subsequently solved by employing a highly efficient SAT solver. Experimental results on the combinational circuits in the MCNC91 benchmark suite demonstrate that it is possible to reduce the leakage current in combinational circuits by an average of 25% with only a 5% delay penalty. The second part of this paper presents a design technique for applying the minimum leakage input to a sequential circuit. The proposed method uses the built-in scan-chains in a VLSI circuit to drive it with the minimum leakage vector when it enters the sleep mode. The use of these scan registers eliminates the area and delay overhead of the additional circuitry that would otherwise be needed to apply the minimum leakage vector to the circuit. Experimental results on the sequential circuits in the MCNC91 benchmark suit show that, by using the proposed method, it is possible to reduce the leakage by an average of 25% with practically no delay penalty.

293 citations


Journal Article•DOI•
TL;DR: DLS saves 20% to 80% of power consumption of the backlight systems while keeping a reasonable amount of image quality degradation, which fulfills a large variety of user preferences in power-aware multimedia applications.
Abstract: Backlight systems dominate the power requirements of battery-operated hand-held devices with color thin-film transistor (TFT), liquid crystal displays (LCDs). We introduce dynamic luminance scaling of the backlight with appropriate image compensation. Dynamic backlight luminance scaling (DLS) keeps the perceived intensity or contrast of the image as close as possible to the original while achieving significant power reduction. DLS compromises quality of image between power consumption, which fulfills a large variety of user preferences in power-aware multimedia applications. DLS saves 20% to 80% of power consumption of the backlight systems while keeping a reasonable amount of image quality degradation.

219 citations


Journal Article•DOI•
TL;DR: The objective of this paper is to introduce self-similarity as a fundamental property exhibited by the bursty traffic between on-chip modules in typical MPEG-2 video applications.
Abstract: The objective of this paper is to introduce self-similarity as a fundamental property exhibited by the bursty traffic between on-chip modules in typical MPEG-2 video applications. Statistical tests performed on relevant traces extracted from common video clips establish unequivocally the existence of self-similarity in video traffic. Using a generic tile-based communication architecture, we discuss the implications of our findings on on-chip buffer space allocation and present quantitative evaluations for typical video streams. We also describe a technique for synthetically generating traces having statistical properties similar to those obtained from real video clips. Our proposed technique speeds up buffer simulations, allows media system designers to explore architectures rapidly and use large media data benchmarks more efficiently. We believe that our findings open new directions of research with deep implications on some fundamental issues in on-chip networks design for multimedia applications.

210 citations


Journal Article•DOI•
TL;DR: In this paper, high-performance flip-flops are analyzed and classified into two categories: the conditional precharge and the conditional capture technologies, based on how to prevent or reduce the redundant internal switching activities.
Abstract: In this paper, high-performance flip-flops are analyzed and classified into two categories: the conditional precharge and the conditional capture technologies. This classification is based on how to prevent or reduce the redundant internal switching activities. A new flip-flop is introduced: the conditional discharge flip-flop (CDFF). It is based on a new technology, known as the conditional discharge technology. This CDFF not only reduces the internal switching activities, but also generates less glitches at the output, while maintaining the negative setup time and small D-to-Q delay characteristics. With a data-switching activity of 37.5%, the proposed flip-flop can save up to 39% of the energy with the same speed as that for the fastest pulsed flip-flops.

204 citations


Journal Article•DOI•
TL;DR: The resulting soft digital signal processing system achieves up to 60% and 44% energy savings with no loss in the signal-to-noise ratio (SNR) for receive filtering in a QPSK system and the butterfly of fast Fourier transform in a WLAN OFDM system.
Abstract: In this paper, we present a novel algorithmic noise-tolerance (ANT) technique referred to as reduced precision redundancy (RPR). RPR requires a reduced precision replica whose output can be employed as the corrected output in case the original system computes erroneously. When combined with voltage overscaling (VOS), the resulting soft digital signal processing system achieves up to 60% and 44% energy savings with no loss in the signal-to-noise ratio (SNR) for receive filtering in a QPSK system and the butterfly of fast Fourier transform (FFT) in a WLAN OFDM system, respectively. These energy savings are with respect to optimally scaled (i.e., the supply voltage equals the critical voltage V/sub dd-crit/) present day systems. Further, we show that the RPR technique is able to maintain the output SNR for error rates of up to 0.09/sample and 0.06/sample in an finite impulse response filter and a FFT block, respectively.

196 citations


Journal Article•DOI•
TL;DR: A novel technique called LECTOR is proposed for designing CMOS gates which significantly cuts down the leakage current without increasing the dynamic power dissipation, resulting in better leakage reduction compared to other techniques.
Abstract: In CMOS circuits, the reduction of the threshold voltage due to voltage scaling leads to increase in subthreshold leakage current and hence static power dissipation. We propose a novel technique called LECTOR for designing CMOS gates which significantly cuts down the leakage current without increasing the dynamic power dissipation. In the proposed technique, we introduce two leakage control transistors (a p-type and a n-type) within the logic gate for which the gate terminal of each leakage control transistor (LCT) is controlled by the source of the other. In this arrangement, one of the LCTs is always "near its cutoff voltage" for any input combination. This increases the resistance of the path from V/sub dd/ to ground, leading to significant decrease in leakage currents. The gate-level netlist of the given circuit is first converted into a static CMOS complex gate implementation and then LCTs are introduced to obtain a leakage-controlled circuit. The significant feature of LECTOR is that it works effectively in both active and idle states of the circuit, resulting in better leakage reduction compared to other techniques. Further, the proposed technique overcomes the limitations posed by other existing methods for leakage reduction. Experimental results indicate an average leakage reduction of 79.4% for MCNC'91 benchmark circuits.

194 citations


Journal Article•DOI•
TL;DR: The results show that data and instruction caches require different control strategies for efficient execution, and a technique called cache subbank prediction, which is used to selectively wake up only the necessary parts of the instruction cache, while allowing most of the cache to stay in a low-leakage drowsy mode is proposed.
Abstract: On-chip caches represent a sizable fraction of the total power consumption of microprocessors. As feature sizes shrink, the dominant component of this power consumption will be leakage. However, during a fixed period of time, the activity in a data cache is only centered on a small subset of the lines. This behavior can be exploited to cut the leakage power of large data caches by putting the cold cache lines into a state preserving, low-power drowsey mode. In this paper, we investigate policies and circuit techniques for implementing drowsy data caches. We show that with simple microarchitectural techniques, about 80%-90% of the data cache lines can be maintained in a drowsy state without affecting performance by more than 0.6%, even though moving lines into and out of a drowsy state incurs a slight performance loss. According to our projections, in a 70-nm complementary metal-oxide-semiconductor process, drowsy data caches will be able to reduce the total leakage energy consumed in the caches by 60%-75%. In addition, we extend the drowsy cache concept to reduce leakage power of instruction caches without significant impact on execution time. Our results show that data and instruction caches require different control strategies for efficient execution. In order to enable drowsy instruction caches, we propose a technique called cache subbank prediction, which is used to selectively wake up only the necessary parts of the instruction cache, while allowing most of the cache to stay in a low-leakage drowsy mode. This prediction technique reduces the negative performance impact by 78% compared with the no-prediction policy. Our technique works well even with small predictor sizes and enables a 75% reduction of leakage energy in a 32-kB instruction cache.

189 citations


Journal Article•DOI•
TL;DR: An analytical expression is derived to estimate the probability density function of the leakage current for stacked devices found in CMOS gates and an approach is presented to account for both the inter- and intra-die gate length variations to ensure that the circuit leakage PDF correctly models both types of variation.
Abstract: We develop a method to estimate the variation of leakage current due to both intra-die and inter-die gate length process variability. We derive an analytical expression to estimate the probability density function (PDF) of the leakage current for stacked devices found in CMOS gates. These distributions of individual gate leakage currents are then combined to obtain the mean and variance of the leakage current for an entire circuit. We also present an approach to account for both the inter- and intra-die gate length variations to ensure that the circuit leakage PDF correctly models both types of variation. The proposed methods were implemented and tested on a number of benchmark circuits. Comparison to Monte Carlo simulation validates the accuracy of the proposed method and demonstrates the efficiency of the proposed analysis method. Comparison with traditional deterministic leakage current analysis demonstrates the need for statistical methods for leakage current analysis.

187 citations


Journal Article•DOI•
TL;DR: Chimaera is described, a system that overcomes the communication bottleneck by integrating reconfigurable logic into the host processor itself and enables the creation of multi-operand instructions and a speculative execution model key to high-performance, general-purpose reconfiguring computing.
Abstract: By strictly separating reconfigurable logic from the host processor, current custom computing systems suffer from a significant communication bottleneck. In this paper, we describe Chimaera, a system that overcomes the communication bottleneck by integrating reconfigurable logic into the host processor itself. With direct access to the host processor's register file, the system enables the creation of multi-operand instructions and a speculative execution model key to high-performance, general-purpose reconfigurable computing. Chimaera also supports multi-output functions and utilizes partial run-time reconfiguration to reduce reconfiguration time. Combined, the system can provide speedups of a factor of two or more for general-purpose computing, and speedups of 160 or more are possible for hand-mapped applications.

179 citations


Journal Article•DOI•
TL;DR: In this article, a half-latch level converter and a precharged level converter are used in a flip-flop to minimize energy, delay, and area penalties due to level conversion.
Abstract: Dual-supply voltage design using a clustered voltage scaling (CVS) scheme is an effective approach to reduce chip power. The optimal CVS design relies on a level converter implemented in a flip-flop to minimize energy, delay, and area penalties due to level conversion. Additionally, circuit robustness against supply bounce is a key property that differentiates good level converter design. Novel flip-flops presented in this paper incorporate a half-latch level converter and a precharged level converter. These flip-flops are optimized in the energy-delay design space to achieve over 30% reduction of energy-delay product and about 10% savings of total power in a CVS design as compared to the conventional flip-flop. These benefits are accompanied by 24% flip-flop robustness improvement leading to 13% delay spread reduction in a CVS critical path. The proposed flip-flops also show 18% layout area reduction. Advantages of level conversion in a flip-flop over asynchronous level conversion in combinational logic are also discussed in terms of delay penalty and its sensitivity to supply bounce.

Journal Article•DOI•
TL;DR: This paper presents several low-latency mixed-timing FIFO (first-in-first-out) interfaces designs that interface systems on a chip working at different speeds and initial simulations for both latency and throughput are promising.
Abstract: This paper presents several low-latency mixed-timing FIFO (first-in-first-out) interfaces designs that interface systems on a chip working at different speeds. The connected systems can be either synchronous or asynchronous. The designs are then adapted to work between systems with very long interconnect delays, by migrating a single-clock solution by Carloni et al. (1999, 2000, and 2001) (for "latency-insensitive" protocols) to mixed-timing domains. The new designs can be made arbitrarily robust with regard to metastability and interface operating speeds. Initial simulations for both latency and throughput are promising.

Journal Article•DOI•
TL;DR: By simulations, it is shown that quantization error can be reduced up to 50% by the proposed error compensation method compared with the existing method with approximately the same hardware overhead in the bias generation circuit.
Abstract: This paper presents an error compensation method for a modified Booth fixed-width multiplier that receives a W-bit input and produces a W-bit product. To efficiently compensate for the quantization error, Booth encoder outputs (not multiplier coefficients) are used for the generation of error compensation bias. The truncated bits are divided into two groups depending upon their effects on the quantization error. Then, different error compensation methods are applied to each group. By simulations, it is shown that quantization error can be reduced up to 50% by the proposed error compensation method compared with the existing method with approximately the same hardware overhead in the bias generation circuit. It is also shown that the proposed method leads to up to 35% reduction in area and power consumption of a multiplier compared with the ideal multiplier.

Journal Article•DOI•
TL;DR: A fast approach to analyze the total leakage power of a large circuit block, considering both I/sub gate/ and subthreshold leakage (I/sub sub/), and proposes the use of pin reordering as a means to reduce I/ sub gate/.
Abstract: In this paper we address the growing issue of gate oxide leakage current (I/sub gate/) at the circuit level. Specifically, we develop a fast approach to analyze the total leakage power of a large circuit block, considering both I/sub gate/ and subthreshold leakage (I/sub sub/). The interaction between I/sub sub/ and I/sub gate/ complicates analysis in arbitrary CMOS topologies and we propose simple and accurate heuristics based on lookup tables to quickly estimate the state-dependent total leakage current for arbitrary circuit topologies. We apply this method to a number of benchmark circuits using a projected 100-nm technology and demonstrate accuracy within 0.09% of SPICE on average with a four order of magnitude speedup. We then make several observations on the impact of I/sub gate/ in designs that are standby power limited, including the role of device ordering within a stack and the differing state dependencies for NOR versus NAND topologies. Based on these observations, we propose the use of pin reordering as a means to reduce I/sub gate/. We find that for technologies with appreciable I/sub gate/, this technique is more effective at reducing total leakage current in standby mode than state assignment, which is often used for I/sub sub/ reduction.

Journal Article•DOI•
TL;DR: This paper proposes a novel distributed sleep transistor network (DSTN), and shows that DSTN is intrinsically better than the cluster-based design in terms of the sleep transistor area and circuit performance.
Abstract: Sleep transistors are effective to reduce leakage power during standby modes. The cluster-based design was proposed to save sleep transistor area by clustering gates to minimize the simultaneous switching current per cluster and inserting a sleep transistor per cluster. In this paper, we propose a novel distributed sleep transistor network (DSTN), and show that DSTN is intrinsically better than the cluster-based design in terms of the sleep transistor area and circuit performance. We reveal properties of optimal DSTN designs, and then develop an efficient algorithm for gate level DSTN synthesis. The algorithm obtains DSTN designs with up to 70.7% sleep transistor area reduction compared to cluster-based designs. Furthermore, we present custom layout designs to verify the area reduction by DSTN.

Journal Article•DOI•
TL;DR: A novel technique is developed which reduces the silicon area requirements by two orders of magnitude, as well as enables the measurement device to be synthesized from a register transfer level (RTL) description.
Abstract: Jitter characterization has become significantly more important for systems running at multigigahertz data rates. Time and frequency domain characterization of jitter is thus a crucial element for system specification testing. Time domain jitter measurement on a data signal with subgate timing resolution can be achieved using two delay chains feeding into the clock and datalines of a series of D-latches known as a Vernier delay line (VDL). An important drawback to the VDL structure is that its measurement accuracy depends on the matching of the various delay elements. Although careful layout techniques can help to minimize these mismatches, it cannot eliminate them completely. As well, due to the nature of the design, a relatively large silicon area is required for silicon implementation. In this paper, a novel technique is developed which reduces the silicon area requirements by two orders of magnitude, as well enables the measurement device to be synthesized from a register transfer level (RTL) description. A custom IC was designed and fabricated in a 0.18-/spl mu/m CMOS process as a first proof of concept. The design requires a silicon area of 0.12 mm/sup 2/ and measured results indicate a timing resolution of 19 ps. The synthesizable nature of the design is demonstrated using an field-programmable gate-array implementation. As test time is an important consideration for a production test, an extension to the component-invariant VDL technique is provided that reduces test time at the expense of more hardware. Finally, a method for obtaining the frequency domain characteristics of the jitter using the VDL will also be given.

Journal Article•DOI•
TL;DR: This paper presents a novel group matching scheme to reduce the Chien search hardware complexity by 60% for BCH(2047, 1926, 23) code as opposed to only 26% if directly applying the iterative matching algorithm.
Abstract: To implement parallel BCH (Bose-Chaudhuri-Hochquenghem) decoders in an area-efficient manner, this paper presents a novel group matching scheme to reduce the Chien search hardware complexity by 60% for BCH(2047, 1926, 23) code as opposed to only 26% if directly applying the iterative matching algorithm. The proposed scheme exploits the substructure sharing within a finite field multiplier (FFM) and among groups of FFMs.

Journal Article•DOI•
TL;DR: Experimental results show a higher performance for the new latch architectures compared to the conventional CML latch circuit at ultrahigh-frequencies, and why CML buffers are better than CMOS inverters in high-speed low-voltage applications.
Abstract: A comprehensive study of ultrahigh-speed current-mode logic (CML) buffers along with the design of novel regenerative CML latches will be illustrated. First, a new design procedure to systematically design a chain of tapered CML buffers is proposed. Next, two new high-speed regenerative latch circuits capable of operating at ultrahigh-speed data rates will be introduced. Experimental results show a higher performance for the new latch architectures compared to the conventional CML latch circuit at ultrahigh-frequencies. It is also shown, both through the experiments and by using efficient analytical models, why CML buffers are better than CMOS inverters in high-speed low-voltage applications.

Journal Article•DOI•
TL;DR: A high-speed AES IP-core is presented, which runs at 880 MHz on a 0.13-/spl mu/m CMOS standard cell library, and which achieves over 10-Gbps throughput in all encryption modes, including cipher block chaining (CBC) mode.
Abstract: In this brief, we present a high-speed AES IP-core, which runs at 880 MHz on a 0.13-/spl mu/m CMOS standard cell library, and which achieves over 10-Gbps throughput in all encryption modes, including cipher block chaining (CBC) mode. Although the CBC mode is the most widely used and important, achieving such high throughput was difficult because pipelining and/or loop unrolling techniques cannot be applied. To reduce the propagation delays of the S-Box, the slowest function block, we developed a special circuit architecture that we call twisted-binary decision diagram (BDD), where the fanout of signals is distributed in the S-Box circuit. Our S-Box is 1.5 to 2 times faster than the conventional S-Box implementations. The T-Box algorithm, which merges the S-Box and another primitive function (MixColumns) into a single function, is also used for an additional speedup.

Journal Article•DOI•
TL;DR: It is demonstrated through analysis and simulation that using the proposed method the noise tolerance of dynamic logic gates can be improved beyond the level of static CMOS logic gates while the performance advantage of dynamic circuits is still retained.
Abstract: Dynamic CMOS logic circuits are widely employed in high-performance VLSI chips in pursuing very high system performance. However, dynamic CMOS gates are inherently less resistant to noises than static CMOS gates. With the increasing stringent noise requirement due to aggressive technology scaling, the noise tolerance of dynamic circuits has to be first improved for the overall reliable operation of VLSI chips designed using deep submicron process technology. In the literature, a number of design techniques have been proposed to enhance the noise tolerance of dynamic logic gates. An overview and classification of these techniques are first presented in this paper. Then, we introduce a novel noise-tolerant design technique using circuitry exhibiting a negative differential resistance effect. We have demonstrated through analysis and simulation that using the proposed method the noise tolerance of dynamic logic gates can be improved beyond the level of static CMOS logic gates while the performance advantage of dynamic circuits is still retained. Simulation results on large fan-in dynamic CMOS logic gates have shown that, at a supply voltage of 1.6 V, the input noise immunity level can be increased to 0.8 V for about 10% delay overhead and to 1.0 V for only about 20% delay overhead.

Journal Article•DOI•
A. Blotti1, Roberto Saletti1•
TL;DR: This brief shows that a conventional semi-custom design-flow based on a positive feedback adiabatic logic (PFAL) cell library allows any VLSI designer to design and verify complex adiAbatic systems in a short time and easy way, thus, enjoying the energy reduction benefits of adiABatic logic.
Abstract: This brief shows that a conventional semi-custom design-flow based on a positive feedback adiabatic logic (PFAL) cell library allows any VLSI designer to design and verify complex adiabatic systems (e.g., arithmetic units) in a short time and easy way, thus, enjoying the energy reduction benefits of adiabatic logic. A family of semi-custom PFAL carry lookahead adders and parallel multipliers were designed in a 0.6-/spl mu/m CMOS technology and verified. Post-layout simulations show that semi-custom adiabatic arithmetic units can save energy a factor 17 at 10 MHz and about 7 at 100 MHz, as compared to a logically equivalent static CMOS implementation. The energy saving obtained is also better if compared to other custom adiabatic circuit realizations and maintains high values (3/spl divide/6) even when the losses in power-clock generation are considered.

Journal Article•DOI•
TL;DR: A fast optimization method is introduced that runs multiple orders of magnitude faster than the previous optimization approaches while still having the same accuracy in obtaining the power management control.
Abstract: In this paper, we present a new methodology for managing power consumption of networks-on-chips (NOCs). A power management problem is formulated for the first time using closed-loop control concepts. We introduce an estimator and a controller that implement our power management methodology. The estimator is capable of very fast and accurate tracking of changes in the system parameters. Parameters estimated are used to form the system model. Our system model combines node and network centric power management decisions. Node centric power management assumes no a priori knowledge of requests coming in from outside the core. Thus, it implements a more traditional dynamic voltage scaling and power management control algorithms. Network-centric power management utilizes interaction with the other system cores regarding the power and the quality of service (QoS) needs. The overall system model is based on Renewal theory and, thus, guarantees globally optimal results. We introduce a fast optimization method that runs multiple orders of magnitude faster than the previous optimization approaches while still having the same accuracy in obtaining the power management control. Finally, our controller implements the results of optimization in either hardware or software. The new methodology for power management of NOCs is tested on a system consisting of four satellite units, each implementing an estimator and a controller capable of both node and network centric power management. Our results show large savings in power with good QoS.

Journal Article•DOI•
TL;DR: This paper describes new level converting circuits that provide 10%-61% lower energy consumption at equivalent or better speeds compared to those available in the literature and makes the argument that level converters should be evaluated largely by their maximum speed.
Abstract: Multi-V/sub DD/ design is an effective way to reduce power consumption, but the need for level conversion imposes delay and energy penalties that limit the potential gains. In this paper, we describe new level converting circuits that provide 10%-61% lower energy consumption at equivalent or better speeds compared to those available in the literature. Furthermore, we make the argument that level converters should be evaluated largely by their maximum speed since slower level converters consume valuable timing slack that can be used to reduce the energy of other gates in the circuit. Based on this criterion, we find the new structures to offer up to a 25% speed improvement over conventional level converters. Using an efficient dual V/sub DD/ voltage assignment algorithm, we show that this speed improvement can yield a reduction of up to 7.3% in total circuit power in small benchmark circuits. We also propose embedding the functionality of logic gates into the level converting circuits. For typical values of the second supply voltage, this technique can reduce delay by 15% at constant energy or lower energy by up to 30% at fixed delay.

Journal Article•DOI•
TL;DR: This paper considers early prediction of net activity and interconnect capacitance in field-programmable gate array (FPGA) designs and develops empirical prediction models for these parameters, suitable for use in power-aware layout synthesis, early power estimation/planning, and other applications.
Abstract: The dynamic power consumed by a digital CMOS circuit is directly proportional to both switching activity and interconnect capacitance. In this paper, we consider early prediction of net activity and interconnect capacitance in field-programmable gate array (FPGA) designs. We develop empirical prediction models for these parameters, suitable for use in power-aware layout synthesis, early power estimation/planning, and other applications. We examine how switching activity on a net changes when delays are zero (zero delay activity) versus when logic delays are considered (logic delay activity) versus when both logic and routing delays are considered (routed delay activity). We then describe a novel approach for prelayout activity prediction that estimates a net's routed delay activity using only zero or logic delay activity values, along with structural and functional circuit properties. For capacitance prediction, we show that prediction accuracy is improved by considering aspects of the FPGA interconnect architecture in addition to generic parameters, such as net fanout and bounding box perimeter length. We also demonstrate that there is an inherent variability (noise) in the switching activity and capacitance of nets that limits the accuracy attainable in prediction. Experimental results show the proposed prediction models work well given the noise limitations.

Journal Article•DOI•
TL;DR: A new scalable single-chip communication architecture for heterogeneous resources, adaptive system-on-a-chip (aSOC) and supporting software for application mapping that exhibits hardware simplicity and optimized support for compile-time scheduled communication is described.
Abstract: A dramatic increase in single chip capacity has led to a revolution in on-chip integration. Design reuse and ease of implementation have became important aspects of the design process. This paper describes a new scalable single-chip communication architecture for heterogeneous resources, adaptive system-on-a-chip (aSOC) and supporting software for application mapping. This architecture exhibits hardware simplicity and optimized support for compile-time scheduled communication. To illustrate the benefits of the architecture, four high-bandwidth signal processing applications including an MPEG-2 video encoder and a Doppler radar processor have been mapped to a prototype aSOC device using our design mapping technology. Through experimentation it is shown that aSOC communication outperforms a hierarchical bus-based system-on-chip (SoC) approach by up to a factor of five. A VLSI implementation of the communication architecture indicates clock rates of 400 MHz in 0.18-/spl mu/m technology for sustained on-chip communication. In comparison to previously-published results for an MPEG-2 decoder, our on-chip interconnect shows a runtime improvement of over a factor of four.

Journal Article•DOI•
TL;DR: In this article, the scaling behavior of the inductive and resistance voltage drops across the on-chip power distribution networks is investigated, where the authors review and extend the existing work on power distribution noise scaling to include the scaling behaviour of inductance of the onchip global power distribution network in high-performance flip-chip packaged integrated circuits, and show that ideal interconnect scaling of the global power grid mitigates the unfavorable scaling of inductive noise but exacerbates the scaling of resistive noise by a factor of S.
Abstract: The design of power distribution networks in high-performance integrated circuits has become significantly more challenging with recent advances in process technologies. As on-chip currents exceed tens of amperes and circuit clock periods are reduced well below a nanosecond, the signal integrity of on-chip power supply has become a primary concern in the integrated circuit design. The scaling behavior of the inductive and resistance voltage drops across the on-chip power distribution networks is the subject of this paper. The existing work on power distribution noise scaling is reviewed and extended to include the scaling behavior of the inductance of the on-chip global power distribution networks in high-performance flip-chip packaged integrated circuits. As the dimensions of the on-chip devices are scaled by S, where S>1, the resistive voltage drop across the power grids remains constant and the inductive voltage drop increases by S, if the metal thickness is maintained constant. Consequently, the signal-to-noise ratio decreases by S in the case of resistive noise and by S/sup 2/ in the case of inductive noise. As compared to the constant metal thickness scenario, ideal interconnect scaling of the global power grid mitigates the unfavorable scaling of the inductive noise but exacerbates the scaling of resistive noise by a factor of S. On-chip inductive noise will, therefore, become of greater significance with technology scaling. Careful tradeoffs between the resistance and inductance of the power distribution networks will be necessary in nanometer technologies to achieve minimum power supply noise.

Journal Article•DOI•
TL;DR: A rule based RL circuit model is proposed in this paper that is realizable and predicts skin and proximity effects accurately in the frequency range of interest.
Abstract: On-chip conductors such as clock- and power-distribution networks require accurately modeling skin and proximity effects. Furthermore, to incorporate skin and proximity effects in the existing generic simulation tools such as SPICE, simple-frequency independent-lumped element-circuit models are needed. A rule based RL circuit model is proposed in this paper that is realizable and predicts skin and proximity effects accurately in the frequency range of interest. With this circuit model, wires are characterized by a few parallel branches of resistors and inductors while proximity effect is captured by mutual inductance between inductors in different RL circuits.

Journal Article•DOI•
TL;DR: The energy overhead of the circuit technique is low, justifying the activation of the proposed sleep scheme by providing a net savings in total power consumption during short idle periods.
Abstract: A circuit technique is presented for reducing the subthreshold leakage energy consumption of domino logic circuits. Sleep switch transistors are proposed to place an idle dual threshold voltage domino logic circuit into a low leakage state. The circuit technique enhances the effectiveness of a dual threshold voltage CMOS technology to reduce the subthreshold leakage current by strongly turning off all of the high threshold voltage transistors. The sleep switch circuit technique significantly reduces the subthreshold leakage energy as compared to both standard low-threshold voltage and dual threshold voltage domino logic circuits. A domino adder enters and leaves a low leakage sleep mode within a single clock cycle. The energy overhead of the circuit technique is low, justifying the activation of the proposed sleep scheme by providing a net savings in total power consumption during short idle periods.

Journal Article•DOI•
TL;DR: A new interconnect delay model called fitted Elmore delay (FED), generated by approximating HSPICE delay data using a curve fitting technique, which has a closed-form expression as simple as the ED model and is extremely efficient to compute.
Abstract: In this brief, we present a new interconnect delay model called fitted Elmore delay (FED). FED is generated by approximating HSPICE delay data using a curve fitting technique. The functional form used in curve fitting is derived based on the Elmore delay (ED) model. Thus, our model has all the advantages of the ED model. It has a closed-form expression as simple as the ED model and is extremely efficient to compute. Interconnect optimization with respect to design parameters can also be done as easily as in the ED model. In fact, most previous algorithms and programs based on ED model can use our model without much change. Most importantly, FED is significantly more accurate than the ED model. The maximum error in delay estimation is at most 2% for our model, compared to 8.5% for the scaled ED model. The average error is less than 0.8%. We also show that FED can be more than 10 times more accurate than the ED model when applied to wire sizing.

Journal Article•DOI•
TL;DR: This work proposes to make system-level interconnects more robust using encoding that simultaneously addresses error-correction requirements and crosstalk noise avoidance, and gives algorithms for obtaining optimal encodings and a practical class of codes called boundary-shift codes.
Abstract: Aggressive process scaling and increasing clock rates have made crosstalk noise an important issue in VLSI design. Switching on long, adjacent bus wires can lead to timing violations and logic faults. At the same time, system-level interconnects have also become more susceptible to other less predictable forms of interference such as noise induced by power grid fluctuations, electromagnetic interference, and alpha-particle radiation. Previous work has treated these systematic and nonsystematic forms of noise separately. We propose to make system-level interconnects more robust using encoding that simultaneously addresses error-correction requirements and crosstalk noise avoidance. This is more efficient than satisfying these requirements separately. We give algorithms for obtaining optimal encodings and present a practical class of codes called boundary-shift codes. We evaluate the overhead of our method, and make comparisons to using error-correction with simple shielding.