scispace - formally typeset
Search or ask a question

Showing papers by "Hai Zhou published in 2006"


Proceedings Article•DOI•
Serkan Ozdemir1, Debjit Sinha1, Gokhan Memik1, Jonathan Adams1, Hai Zhou1 •
09 Dec 2006
TL;DR: Four yield-aware micro architecture schemes for data caches are developed, including a variable-latency cache architecture that allows different load accesses to be completed with varying latencies, and chips that may be tossed away due to parametric yield loss can be saved.
Abstract: One of the major issues faced by the semiconductor industry today is that of reducing chip yields. As the process technologies have scaled to smaller feature sizes, chip yields have dropped to around 50% or less. This figure is expected to decrease even further in future technologies. To attack this growing problem, we develop four yield-aware microarchitecture schemes for data caches. The first one is called Yield-Aware Power-Down (YAPD). YAPD turns off cache ways that cause delay violation and/or have excessive leakage. We also modify this approach to achieve better yields. This new method is called Horizontal YAPD (HYAPD), which turns off horizontal regions of the cache instead of ways. A third approach targets delay violation in data caches. Particularly, we develop a VAriable-latency Cache Architecture (VACA). VACA allows different load accesses to be completed with varying latencies. This is enabled by augmenting the functional units with special buffers that allow the dependants of a load operation to stall for a cycle if the load operation is delayed. As a result, if some accesses take longer than the predefined number of cycles, the execution can still be performed correctly, albeit with some performance degradation. A fourth scheme we devise is called the Hybrid mechanism, which combines the YAPD and the VACA. As a result of these schemes, chips that may be tossed away due to parametric yield loss can be saved. Experimental results demonstrate that the yield losses can be reduced by 68.1% and 72.4% with YAPD and HYAPD schemes and by 33.3% and 81.1% with VACA and Hybrid mechanisms, respectively, improving the overall yield to as much as 97.0%.

116 citations


Proceedings Article•DOI•
27 Mar 2006
TL;DR: This paper quantifies the approximation error in Clark's approach to computing the maximum (max) of Gaussian random variables; a fundamental operation in statistical timing and shows that a finite look up table can be used to store these errors.
Abstract: This paper quantifies the approximation error in Clark's approach presented in C. E. Clark (1961) to computing the maximum (max) of Gaussian random variables; a fundamental operation in statistical timing. We show that a finite look up table can be used to store these errors. Based on the error computations, approaches to different orderings for pair-wise max operations on a set of Gaussians are proposed. Experiments show accuracy improvements in the computation of the max of multiple Gaussians by up to 50% in comparison to the traditional approach. To the best of our knowledge, this is the first work addressing the mentioned issues.

31 citations


Journal Article•DOI•
TL;DR: An insight into statistical properties of gate delays for a commercial 0.13-mum technology library is presented which intuitively provides one reason why statistical timing driven optimization does better than deterministic timingdriven optimization.
Abstract: In this paper, we propose a statistical gate sizing approach to maximize the timing yield of a given circuit, under area constraints. Our approach involves statistical gate delay modeling, statistical static timing analysis, and gate sizing. Experiments performed in an industrial framework on combinational International Symposium on Circuits and Systems (ISCAS'85) and Microelectronics Center of North Carolina (MCNC) benchmarks show absolute timing yield gains of 30% on the average, over deterministic timing optimization for at most 10% area penalty. It is further shown that circuits optimized using our metric have larger timing yields than the same optimized using a worst case metric, for iso-area solutions. Finally, we present an insight into statistical properties of gate delays for a commercial 0.13-mum technology library which intuitively provides one reason why statistical timing driven optimization does better than deterministic timing driven optimization

30 citations


Proceedings Article•DOI•
05 Nov 2006
TL;DR: The problem is not convex and its optimal solution cannot be obtained by solving its Lagrangian dual problem, so a modified convex formulation is proposed and it is proposed to solve it using min-cost flow technique and trust region method.
Abstract: With the advent of deep sub-micron (DSM) era, floorplanning has become increasingly important in physical design process. In this paper we clarify a misunderstanding in using Lagrangian relaxation for the minimum area floorplanning problem. We show that the problem is not convex and its optimal solution cannot be obtained by solving its Lagrangian dual problem. We then propose a modified convex formulation and solve it using min-cost flow technique and trust region method. Experimental results under module aspect ratio bound [0.5, 2.0] show that the running time of our floorplanner scales well with the problem size in MCNC benchmark. Compared with the floorplanner in the work of Young et al. (2001), our floorplanner is 9.5times faster for the largest case "ami49". It also generates a floorplan with smaller deadspace for almost all test cases. In addition, since the generated floorplan has an aspect ratio closer to 1, it is more friendly to packaging. Our floorplanner is also amicable to including interconnect cost and other physical design metrics

24 citations


Proceedings Article•DOI•
Prasad Narayana1, Ruiming Chen1, Yao Zhao1, Yan Chen1, Zhi Fu2, Hai Zhou1 •
12 Nov 2006
TL;DR: This paper proposes use of TLA+ to automatically check DoS vulnerability of network protocols with completeness guarantee and develops new schemes to avoid state space explosion in property checking and to model attackers' capabilities for finding realistic attacks.
Abstract: Vulnerability analysis is indispensably the first step towards securing a network protocol, but currently remains mostly a best effort manual process with no completeness guarantee. Formal methods are proposed for vulnerability analysis and most existing work focus on security properties such as perfect forwarding secrecy and correctness of authentication. However, it remains unclear how to apply these methods to analyze more subtle vulnerabilities such as denial-of-service (DoS) attacks. To address this challenge, in this paper, we propose use of TLA+ to automatically check DoS vulnerability of network protocols with completeness guarantee. In particular, we develop new schemes to avoid state space explosion in property checking and to model attackers' capabilities for finding realistic attacks. As a case study, we successfully identify threats to IEEE 802.16 air interface protocols.

22 citations


Journal Article•DOI•
TL;DR: Two algorithms to handle a statistical check of the structural conditions for correct clocking of high-performance integrated-circuit designs and can conservatively estimate timing yields are proposed.
Abstract: High-performance integrated-circuit designs need to verify the clock schedules as they usually have level-sensitive latches for their speed. With process variations, the verification needs to compute the probability of correct clocking. Because of complex statistical correlations and accumulated inaccuracy of statistical operations, traditional iterative approaches have difficulties in getting accurate results. A statistical check of the structural conditions for correct clocking is proposed instead, where the central problem is to compute the probability of having a positive cycle in a graph with random edge weights. The authors proposed two algorithms to handle this. The proposed algorithms traverse the graph only several times to reduce the correlations among iterations, and it considers not only data delay variations but also clock-skew variations. Although the first algorithm is a heuristic algorithm that may overestimate timing yields, experimental results show that it has an error of 0.16% on average in comparison with the Monte Carlo (MC) simulation. Based on a cycle-breaking technique, the second heuristic algorithm can conservatively estimate timing yields. Both algorithms are much more efficient than the MC simulation

21 citations


Proceedings Article•DOI•
Chuan Lin1, Hai Zhou1•
24 Jul 2006
TL;DR: A new efficient algorithm for retiming sequential circuits with edge-triggered registers under both setup and hold constraints is presented, which solves the same problem in O(|V|2|E|) time.
Abstract: In this paper, we present a new efficient algorithm for retiming sequential circuits with edge-triggered registers under both setup and hold constraints. Compared with the previous work (Papaefthymiou, 1998), which computed the minimum clock period in O(|V|/sup 3/|E|lg|V|) time, our algorithm solves the same problem in O(|V|/sup 2/|E|) time. Experimental results validate the efficiency of our algorithm.

19 citations


Journal Article•DOI•
TL;DR: The authors establish a theoretical framework for statistical timing analysis with coupling and prove the convergence of their proposed iterative approach and discuss implementation issues under the assumption of a Gaussian distribution for the parameters of variation.
Abstract: As technology scales to smaller dimensions, increasing process variations and coupling induced delay variations make timing verification extremely challenging. In this paper, the authors establish a theoretical framework for statistical timing analysis with coupling. They prove the convergence of their proposed iterative approach and discuss implementation issues under the assumption of a Gaussian distribution for the parameters of variation. A statistical timer based on their proposed approach is developed and experimental results are presented for the International Symposium on Circuits and Systems benchmarks. They juxtapose their timer with a single pass, noniterative statistical timer that does not consider the mutual dependence of coupling with timing, and another statistical timer that handles coupling deterministically. Monte Carlo simulations reveal a distinct gain (up to 24%) in accuracy by their approach in comparison to the others mentioned

17 citations


Proceedings Article•DOI•
05 Nov 2006
TL;DR: In this article, the authors proposed a timing dependent dynamic power estimation framework that considers the impact of coupling and glitches, and showed that relative switching activities and times of coupled nets significantly affect dynamic power consumption, and neither should be ignored during power estimation.
Abstract: In this paper, we propose a timing dependent dynamic power estimation framework that considers the impact of coupling and glitches. We show that relative switching activities and times of coupled nets significantly affect dynamic power consumption, and neither should be ignored during power estimation. To capture the timing dependence, an approach to efficient representation and propagation of switching-window distributions through a circuit, considering coupling induced delay variations, is developed. Based on the propagated switching-window distributions, power consumption in charging or discharging coupling capacitances is calculated, and accounted for in the total power. Experimental results for the ISCAS'85 benchmarks demonstrate that ignoring the impact of timing dependent coupling on power can cause up to 59% error in coupling power estimation (up to 25% error in total power estimation).

15 citations


Proceedings Article•DOI•
Debasish Das1, Ahmed Shebaita1, Hai Zhou1, Yehea Ismail1, Kip Killpack2 •
01 Oct 2006
TL;DR: A novel and accurate coupling delay model is proposed, and techniques to increase the convergence rate of timing analysis when complex coupling models are employed are presented.
Abstract: This paper presents a framework for fast and accurate static timing analysis considering coupling. With technology scaling to smaller dimensions, the impact of coupling induced delay variations can no longer be ignored. Timing analysis considering coupling is iterative, and can have considerably larger run-times than a single pass approach. We propose a novel and accurate coupling delay model, and present techniques to increase the convergence rate of timing analysis when complex coupling models are employed. Experimental results obtained for the ISCAS benchmarks show promising accuracy improvements using our coupling model while an efficient iteration scheme shows significant speedup (up to 62.1%) in comparison to traditional approaches.

9 citations


Journal Article•DOI•
Chuan Lin1, Hai Zhou1•
TL;DR: A new algorithm is presented that solves the optimal wire retiming problem with polynomial-time worst case complexity and is essentially incremental, which has the potential of being combined with other optimization techniques.
Abstract: The problem of retiming over a netlist of macroblocks to achieve minimal clock period, where block internal structures may not be changed and flip-flops may not be inserted on some wire segments, is called the optimal wire retiming problem. This paper presents a new algorithm that solves the optimal wire retiming problem with polynomial-time worst case complexity. Since the new algorithm avoids binary search and is essentially incremental, it has the potential of being combined with other optimization techniques. Experimental results show that the new algorithm is very efficient in practice

Journal Article•DOI•
TL;DR: A gate-sizing algorithm for coupling-noise reduction, which optimizes the area or power consumption of a circuit while ensuring that its timing constraints are met, which is driven by the algorithms used to solving two subproblems of gate-size optimization under noise and timing constraints.
Abstract: This paper presents a gate-sizing algorithm for coupling-noise reduction, which optimizes the area or power consumption (represented as a weighted sum of gate sizes) of a circuit while ensuring that its timing constraints are met. A problem for gate-size optimization under coupling-noise and timing constraints is formulated, and is broken down into two subproblems of gate-size optimization under noise and timing constraints, respectively. The subproblem of gate-size optimization under noise constraints is solved as a fixpoint computation problem on a complete lattice. The proposed algorithm to solve this problem is guaranteed to yield the optimal solution, provided it exists. The subproblem for circuit optimization under timing constraints is considered as a geometrical programming problem. The solutions to the two problems are finally combined to solve the original problem in a Lagrangian relaxation (LR) framework. Experimental results demonstrating the effectiveness of the algorithms are reported for the International Symposium on Circuits and Systems (ISCAS) benchmarks and larger circuits. The obtained results are compared to the approach where successive iterations of gate sizing are performed for timing and for noise reduction independently. This alternative design approach is driven by the algorithms used to solving the mentioned subproblems, respectively

Journal Article•DOI•
TL;DR: The authors propose in this paper a more efficient data structure for the merge operations, with parameters to adjust adaptively, that works better than Shi's under all cases, unbalanced, balanced, and mix sizes.
Abstract: Dynamic programming is a useful technique to handle slicing floorplan, technology mapping, and buffering problems, where many maxplus merge operations of solution lists are needed. Shi proposed an efficient O(nlogn) time algorithm to speed up the merge operation. Based on balanced binary search trees, his algorithm showed superb performance with the most unbalanced sizes of merging solution lists. The authors propose in this paper a more efficient data structure for the merge operations. With parameters to adjust adaptively, their algorithm works better than Shi's under all cases, unbalanced, balanced, and mix sizes. Their data structure is also simpler

Proceedings Article•DOI•
27 Mar 2006
TL;DR: The problem of floorplanning for processing rate optimization is formulated and solved and it is shown that the minimal ratio of the flip-flop number over the delay on any cycle is an upper bound of the processing rate.
Abstract: The performance of a sequential system is usually measured by its frequency. However, with the appearance of global interconnects that require multiple clock periods to communicate, the throughput is usually traded-off for higher frequency (for example, through wire pipelining or latency insensitive design). Therefore, we propose to use the processing rate, defined as the amount of processed inputs per unit time, as the performance measure. We show that the minimal ratio of the flip-flop number over the delay on any cycle is an upper bound of the processing rate. Since the processing rate of a sequential system is mainly decided by its floorplan when interconnect delays are dominant, the problem of floorplanning for processing rate optimization is formulated and solved. We optimize the processing rate bound directly in a floorplanner by applying Howard's algorithm incrementally. Experimental results confirm the effectiveness of our approach.

Journal Article•DOI•
TL;DR: This paper forms the problem of processing rate optimization as seeking an optimal clustering with the minimal maximum-cycle-ratio in a general graph, and presents an iterative algorithm to solve it.
Abstract: Clustering (or partitioning) is a crucial step between logic synthesis and physical design in the layout of a large scale design. A design verified at the logic synthesis level may have timing closure problems at post-layout stages due to the emergence of multiple-clock-period interconnects. Consequently, a tradeoff between clock frequency and throughput may be needed to meet the design requirements. In this paper, we find that the processing rate, defined as the product of frequency and throughput, of a sequential system is upper bounded by the reciprocal of its maximum cycle ratio, which is only dependent on the clustering. We formulate the problem of processing rate optimization as seeking an optimal clustering with the minimal maximum-cycle-ratio in a general graph, and present an iterative algorithm to solve it. Experimental results validate the efficiency of our algorithm

Proceedings Article•DOI•
06 Mar 2006
TL;DR: In this paper, the authors proposed an algorithm for optimal bit width precision for two variables and a greedy heuristic which works for any number of variables for low power in a SystemC design environment.
Abstract: The modern era of embedded system design is geared towards design of low-power systems. One way to reduce power in an ASIC implementation is to reduce the bit-width precision of its computation units. This paper describes algorithms to optimize the bit-widths of fixed point variables for low power in a SystemC design environment. We propose an algorithm for optimal bit width precision for two variables and a greedy heuristic which works for any number of variables. The algorithms are used in the automation of converting floating point SystemC programs into ASIC synthesizable SystemC programs. Expected inputs are profiled to estimate errors in the finite precision conversions. Experimental results on the trade-offs between quantization error, power consumption and hardware resources used are reported on a set of four SystemC benchmarks that are mapped onto 0.18 micron ASIC cell library from Artisan Components. We demonstrate that it is possible to reduce the power consumption by 50% on average by allowing round-off errors to increase from 0.5% to 1%. 1

Proceedings Article•DOI•
Jia Wang1, Hai Zhou1•
24 Jul 2006
TL;DR: An optimal algorithm for jumper insertion under the ratio upper-bound is presented, which handles Steiner trees with obstacles and works on free trees.
Abstract: Antenna effect may damage gate oxides during plasma-based fabrication process. The antenna ratio of total exposed antenna area to total gate oxide area is directly related to the amount of damage. Jumper insertion is a common technique applied at routing and post-layout stages to avoid and to fix the problems caused by the antenna effect. This paper presents an optimal algorithm for jumper insertion under the ratio upper-bound. It handles Steiner trees with obstacles. The algorithm us based on dynamic programming while works on free trees. The time complexity is O(/spl alpha/|V|/sup 2/ ) and the space complexity is O(|V|/sup 2/), where |V| is the number of nodes in the routing tree and a is a factor depending on how to find a non-blocked position on a wire for a jumper.

Proceedings Article•
30 Apr 2006
TL;DR: The 16th edition of the Great Lakes Symposium on VLSI (GLSVLSI'06) as discussed by the authors was held in the city of Philadelphia, United States.
Abstract: Welcome to the 16th edition of the Great Lakes Symposium on VLSI (GLSVLSI'06) and the city of Philadelphia. Since its first meeting in March 1991 at Kalamazoo, Michigan, GLSVLSI has traveled beyond the Great Lakes and become an international conference with submissions from all over the United States and the world. It has emerged as a premier conference for publishing innovations in VLSI.This year, 219 papers were submitted, of which 82 (a 20.1% acceptance rate for full papers and 37.4% overall) were accepted for presentation at the symposium and publication in the proceedings. The final technical program consists of 44 full papers in 12 sessions and 38 poster papers in 2 poster sessions.Congratulations to Garrett S. Rose, Adam C. Cabe, Nadine Gergel-Hackett, Nabanita Majumdar, Mircea R. Stan, John C. Bean, Lloyd R. Harriott, Yuxing Yao, and James M. Tour for winning the GLSVLSI 2006 Best Student Paper Award sponsored by Intel. Their paper "Design Approaches for Hybrid CMOS/Molecular Memory based on Experimental Device Data" will be the first presentation of the symposium. They will also receive the prize from Intel on Monday's dinner banquet.This year's tutorial, "DFM: Swimming Upstream", will be conducted by Dan Page, Jamil Kawa, and Charles Chiang of Synopsys. The tutorial is free to all attendees and local universities thanks to the generous donation of our corporate supporters.The keynote speaker at Monday's dinner banquet is Jeff Parkhurst, Intel's academic research programs manager. The talk title is: "From single core to multi-core to many core: Are we ready for a new exponential?"

01 Jan 2006
TL;DR: An algorithm for optimal bit-width precision for two variables and a greedy heuristic which works for any number of variables is proposed and it is demonstrated that it is possible to reduce the power consumption by 50% on average by allowing round-off errors to increase from 0.5% to 1%
Abstract: The modern era of embedded system design is geared towards design of low-power systems. One way to reduce power in an ASIC implementation is to reduce the bit-width precision of its computation units. This paper describes algorithms to optimize the bit-widths of fixed point variables for low power in a SystemC design environment. We propose an algorithm for optimal bitwidth precision for two variables and a greedy heuristic which works for any number of variables. The algorithms are used in the automation of converting floating point SystemC programs into ASIC synthesizable SystemC programs. Expected inputs are profiled to estimate errors in the finite precision conversions. Experimental results on the trade-offs between quantization error, power consumption and hardware resources used are reported on a set of four SystemC benchmarks that are mapped onto 0.18 micron ASIC cell library from Artisan Components. We demonstrate that it is possible to reduce the power consumption by 50% on average by allowing round-off errors to increase from 0.5% to 1%.1

01 Jan 2006
TL;DR: This research investigates the essential problems of timing verification, power estimation, and circuit (area or power) optimization under crosstalk and variability, and shows that a circuit optimization problem under constraints on the maximal induced noise on each wire is equivalent to a fixpoint computation problem in a complete lattice.
Abstract: With very large scale integrated (VLSI) circuit fabrication entering the deep sub-micron era, devices are scaled down to finer geometries, clocks are run at higher frequencies, and more functionality is integrated into one chip. All these bring a great promise of "system-on-a-chip", but also introduce challenging new issues in the design process. As a result of the increasing frequency and density, coupling effects or crosstalk between neighboring wires are increased. These effects can cause functionality and timing failures in a circuit. The dynamic power consumption in charging or discharging coupling capacitances is timing dependent, and contributes significantly to a circuit's power consumption. In addition, manufacturing process variations (e.g. VT, Le), and environmental variations (e.g. Vdd, Temperature) contribute to uncertainties that deeply impact the timing characteristics of a circuit. This variability makes timing verification, and consequently, timing driven circuit optimization extremely difficult. Although worst case analyses for circuit optimization are simpler, they are not desirable since they severely over-constrain the optimization problem, and result in designs that have excessive penalties in terms of area or power consumption. In this research, we investigate the essential problems of timing verification, power estimation, and circuit (area or power) optimization under crosstalk and variability. We show that a circuit optimization problem under constraints on the maximal induced noise on each wire is equivalent to a fixpoint computation problem in a complete lattice. An optimal algorithm to solving this problem is developed, and is extended to handle variations. Under explicit timing constraints, we solve this problem in a Lagrangian Relaxation framework. We present a timing yield driven circuit optimization algorithm that considers variability and is based on statistical timing methodologies. Approaches to fast and approximation error aware statistical timing analysis are developed that also consider effects due to coupling as well as variability. Multiple input switching effects are considered for improved timing accuracy. We signify the importance of the timing dependence of dynamic power consumption in coupling capacitances, and develop an algorithm for accurate and efficient power estimation. Experimental results validate our approaches, and are promising.

Hai Zhou1, Chuan Lin1•
01 Jan 2006
TL;DR: It is shown that the trade-off between a level-sensitive latch and an edge-triggered flop can be leveraged in a sequential circuit design with crosstalk, so that the clock period is minimized by selecting a configuration of mixed latches and flops.
Abstract: With the advent of deep sub-micron era (DSM), "System-on-chip (SOC)" has become a mainstream of IC industry. Semiconductor devices based on a smaller feature size offer the promise of faster and more highly integrated designs, but also provide a number of new challenges. In SOCs, a large amount of communication time is spent on global multi-clock-period interconnects, which present themselves as the main performance limiting factor. How to handle global interconnects for performance optimization becomes an urgent issue. Another challenge is the increasing coupling effect (also known as crosstalk) between neighboring interconnects. Besides introducing noises on quiet interconnects, crosstalk could potentially change the interconnect delays and cause timing violations in the circuit. In this dissertation, we investigate and propose solutions to a few problems involving global interconnects and crosstalk. To handle global interconnects, we propose techniques at different stages of the design flow. At physical layout stage, we propose to pipeline global interconnects by relocating flip-flops (an operation also known as retiming). Three efficient algorithms are designed to find an optimal retiming with the minimal clock period. We then solve the problem of retiming under both setup and hold constraints more efficiently than the best-known algorithm in the literature. We also consider clock skew scheduling for prescribed skew domains and give an optimal polynomial-time algorithm to minimize the clock period with possible delay padding. At clustering stage, we propose an iterative algorithm that finds an optimal clustering with the minimal maximum-cycle-ratio. At register transfer level (RTL), we use delay relaxation to do interconnect planning. For crosstalk, we propose a circular time representation under which coupling detection is easier and more efficient than state-of-the-art approaches. Using the circular time representation, clock schedule verification with crosstalk is more efficient. We show that the trade-off between a level-sensitive latch and an edge-triggered flop can be leveraged in a sequential circuit design with crosstalk, so that the clock period is minimized by selecting a configuration of mixed latches and flops. We design an effective and efficient algorithm to solve this problem.