Showing papers in "IEEE Transactions on Very Large Scale Integration Systems in 2006"

PDF

Open Access

Journal Article•DOI•

HotSpot: a compact thermal modeling methodology for early-stage VLSI design

[...]

Wei Huang¹, S. Ghosh¹, Sivakumar Velusamy¹, Karthik Sankaranarayanan¹, Kevin Skadron¹, Mircea R. Stan¹ - Show less +2 more•Institutions (1)

University of Virginia¹

01 May 2006-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: The HotSpot compact thermal modeling approach is especially well suited for preregister transfer level (RTL) and presynthesis thermal analysis and is able to provide detailed static and transient temperature information across the die and the package, as it is also computationally efficient.

...read moreread less

Abstract: This paper presents HotSpot-a modeling methodology for developing compact thermal models based on the popular stacked-layer packaging scheme in modern very large-scale integration systems. In addition to modeling silicon and packaging layers, HotSpot includes a high-level on-chip interconnect self-heating power and thermal model such that the thermal impacts on interconnects can also be considered during early design stages. The HotSpot compact thermal modeling approach is especially well suited for preregister transfer level (RTL) and presynthesis thermal analysis and is able to provide detailed static and transient temperature information across the die and the package, as it is also computationally efficient.

...read moreread less

985 citations

Journal Article•DOI•

Design of Robust, Energy-Efficient Full Adders for Deep-Submicrometer Design Using Hybrid-CMOS Logic Style

[...]

S. Goel¹, Ashok Kumar¹, Magdy Bayoumi¹•Institutions (1)

University of Louisiana at Lafayette¹

01 Dec 2006-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: The proposed full adder is energy efficient and outperforms several standard full adders without trading off driving capability and reliability and is based on a novel xor-xnor circuit that generates xor and xnor full-swing outputs simultaneously.

...read moreread less

Abstract: We present a new design for a 1-b full adder featuring hybrid-CMOS design style. The quest to achieve a good-drivability, noise-robustness, and low-energy operations for deep submicrometer guided our research to explore hybrid-CMOS style design. Hybrid-CMOS design style utilizes various CMOS logic style circuits to build new full adders with desired performance. This provides the designer a higher degree of design freedom to target a wide range of applications, thus significantly reducing design efforts. We also classify hybrid-CMOS full adders into three broad categories based upon their structure. Using this categorization, many full-adder designs can be conceived. We will present a new full-adder design belonging to one of the proposed categories. The new full adder is based on a novel xor-xnor circuit that generates xor and xnor full-swing outputs simultaneously. This circuit outperforms its counterparts showing 5%-37% improvement in the power-delay product (PDP). A novel hybrid-CMOS output stage that exploits the simultaneous xor-xnor signals is also proposed. This output stage provides good driving capability enabling cascading of adders without the need of buffer insertion between cascaded stages. There is approximately a 40% reduction in PDP when compared to its best counterpart. During our experimentations, we found out that many of the previously reported adders suffered from the problems of low swing and high noise when operated at low supply voltages. The proposed full adder is energy efficient and outperforms several standard full adders without trading off driving capability and reliability. The new full-adder circuit successfully operates at low voltages with excellent signal integrity and driving capability. To evaluate the performance of the new full adder in a real circuit, we embedded it in a 4- and 8-b, 4-operand carry-save array adder with final carry-propagate adder. The new adder displayed better performance as compared to the standard full adders

...read moreread less

399 citations

Journal Article•DOI•

Linear-programming-based techniques for synthesis of network-on-chip architectures

[...]

K. Srinivasan¹, Karam S. Chatha¹, G. Konjevod¹•Institutions (1)

Arizona State University¹

01 Apr 2006-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: Novel linear programming based techniques for synthesis of custom NoC architectures that minimize power as the primary goal, and minimize the number of routers (area) as a secondary goal are presented.

...read moreread less

Abstract: Application-specific system-on-chip (SoC) design offers the opportunity for incorporating custom network-on-chip (NoC) architectures that are more suitable for a particular application, and do not necessarily conform to regular topologies. This paper presents novel mixed integer linear programming (MILP) formulations for synthesis of custom NoC architectures. The optimization objective of the techniques is to minimize the power consumption subject to the performance constraints. We present a two-stage approach for solving the custom NoC synthesis problem. The power consumption of the NoC architecture is determined by both the physical links and routers. The power consumption of a physical link is dependent upon the length of the link, which in turn, is governed by the layout of the SoC. Therefore, in the first stage, we address the floorplanning problem that determines the locations of the various cores and the routers. In the second stage, we utilize the floorplan from the first stage to generate topology of the NoC and the routes for the various traffic traces. We also present a clustering-based heuristic technique for the second stage to reduce the run times of the MILP formulation. We analyze the quality of the results and solution times of the proposed techniques by extensive experimentation with realistic benchmarks and comparisons with regular mesh-based NoC architectures.

...read moreread less

287 citations

Journal Article•DOI•

An analytical model for predicting the remaining battery capacity of lithium-ion batteries

[...]

Peng Rong¹, Massoud Pedram¹•Institutions (1)

University of Southern California¹

01 May 2006-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: The proposed high-level model, which relies on online current and voltage measurements, correctly accounts for the temperature and cycle aging effects and has a maximum of 5% error between simulated and predicted data.

...read moreread less

Abstract: Predicting the residual energy of the battery source that powers a portable electronic device is imperative in designing and applying an effective dynamic power management policy for the device This paper starts up by showing that a 30% error in predicting the battery capacity of a lithium-ion battery can result in up to 20% performance degradation for a dynamic voltage and frequency scaling algorithm Next, this paper presents a closed form analytical expression for predicting the remaining capacity of a lithium-ion battery The proposed high-level model, which relies on online current and voltage measurements, correctly accounts for the temperature and cycle aging effects The accuracy of the high-level model is validated by comparing it with DUALFOIL simulation results, demonstrating a maximum of 5% error between simulated and predicted data

...read moreread less

271 citations

Journal Article•DOI•

Sequential Element Design With Built-In Soft Error Resilience

[...]

Ming Zhang¹, Subhasish Mitra², Tak M. Mak¹, N. Seifert¹, N.J. Wang³, Quan Shi¹, Kee Sup Kim¹, Naresh R. Shanbhag³, Sanjay J. Patel³ - Show less +5 more•Institutions (3)

Intel¹, Stanford University², University of Illinois at Urbana–Champaign³

01 Dec 2006-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: The presented error-correcting latch and flip-flop designs are power efficient, introduce minimal speed penalty, and employ reuse of on-chip scan design- for-testability and design-for-debug resources to minimize area overheads.

...read moreread less

Abstract: This paper presents a built-in soft error resilience (BISER) technique for correcting radiation-induced soft errors in latches and flip-flops. The presented error-correcting latch and flip-flop designs are power efficient, introduce minimal speed penalty, and employ reuse of on-chip scan design-for-testability and design-for-debug resources to minimize area overheads. Circuit simulations using a sub-90-nm technology show that the presented designs achieve more than a 20-fold reduction in cell-level soft error rate (SER). Fault injection experiments conducted on a microprocessor model further demonstrate that chip-level SER improvement is tunable by selective placement of the presented error-correcting designs. When coupled with error correction code to protect in-pipeline memories, the BISER flip-flop design improves chip-level SER by 10 times over an unprotected pipeline with the flip-flops contributing an extra 7-10.5% in power. When only soft errors in flips-flops are considered, the BISER technique improves chip-level SER by 10 times with an increased power of 10.3%. The error correction mechanism is configurable (i.e., can be turned on or off) which enables the use of the presented techniques for designs that can target multiple applications with a wide range of reliability requirements

...read moreread less

226 citations

Journal Article•DOI•

Via-Configurable Routing Architectures and Fast Design Mappability Estimation for Regular Fabrics

[...]

Yajun Ran, Malgorzata Marek-Sadowska¹•Institutions (1)

University of California, Santa Barbara¹

01 Sep 2006-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: A new via-configurable routing architecture which shows much better throughput and performance than the previous structures is described, and an efficient white-space allocation scheme is suggested, which provides a fast design convergence and early prediction of the circuit mappability to a given fabric.

...read moreread less

Abstract: In this paper, we describe a new via-configurable routing architecture which shows a much better throughput and performance than the previous structures. We demonstrate how to construct a single-via-mask fabric to reduce the mask cost further, and we analyze the penalties which it incurs. To solve the routability problem commonly existing in fabric-based designs, an efficient white-space allocation and an incremental cell movement scheme are suggested, which help to provide a fast design convergence and early prediction of circuit's mappability to a given fabric

...read moreread less

171 citations

Journal Article•DOI•

Low-power network-on-chip for high-performance SoC design

[...]

Kangmin Lee¹, Se-Joong Lee¹, Hoi-Jun Yoo¹•Institutions (1)

KAIST¹

01 Feb 2006-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: An energy-efficient network-on-chip (NoC) is presented, which incorporates heterogeneous intellectual properties such as multiple RISCs and SRAMs, a reconfigurable logic array, an off-chip gateway, and a 1.6-GHz phase-locked loop (PLL) to achieve the power-efficient on-chip communications.

...read moreread less

Abstract: An energy-efficient network-on-chip (NoC) is presented for possible application to high-performance system-on-chip (SoC) design. It incorporates heterogeneous intellectual properties (IPs) such as multiple RISCs and SRAMs, a reconfigurable logic array, an off-chip gateway, and a 1.6-GHz phase-locked loop (PLL). Its hierarchically-star-connected on-chip network provides the integrated IPs, which operate at different clock frequencies, with packet-switched serial-communication infrastructure. Various low-power techniques such as low-swing signaling, partially activated crossbar, serial link coding, and clock frequency scaling are devised, and applied to achieve the power-efficient on-chip communications. The 5 /spl times/5 mm/sup 2/ chip containing all the above features is fabricated by 0.18-/spl mu/m CMOS process and successfully measured and demonstrated on a system evaluation board where multimedia applications run. The fabricated chip can deliver 11.2-GB/s aggregated bandwidth at 1.6-GHz signaling frequency. The chip consumes 160 mW and the on-chip network dissipates less than 51 mW.

...read moreread less

156 citations

Journal Article•DOI•

Sleepy Stack Leakage Reduction

[...]

Jun Cheol Park¹, Vincent J. Mooney²•Institutions (2)

Intel¹, Georgia Institute of Technology²

01 Nov 2006-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: The sleepy stack technique achieves the lowest leakage power consumption among known state-saving leakage reduction techniques, thus, providing circuit designers with new choices to handle the leakage power problem.

...read moreread less

Abstract: Leakage power consumption of current CMOS technology is already a great challenge. International Technology Roadmap for Semiconductors projects that leakage power consumption may come to dominate total chip power consumption as the technology feature size shrinks. Leakage is a serious problem particularly for CMOS circuits in nanoscale technology. We propose a novel ultra-low leakage CMOS circuit structure which we call "sleepy stack". Unlike many other previous approaches, sleepy stack can retain logic state during sleep mode while achieving ultra-low leakage power consumption. We apply the sleepy stack to generic logic circuits. Although the sleepy stack incurs some delay and area overhead, the sleepy stack technique achieves the lowest leakage power consumption among known state-saving leakage reduction techniques, thus, providing circuit designers with new choices to handle the leakage power problem

...read moreread less

132 citations

Journal Article•DOI•

Energy-efficient soft error-tolerant digital signal processing

[...]

Byonghyo Shim¹, Naresh R. Shanbhag¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

01 Apr 2006-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: It is shown that the proposed technique, referred to as algorithmic soft error-tolerance (ASET), employs low-complexity estimators of a main DSP block to achieve reliable operation in the presence of soft errors.

...read moreread less

Abstract: In this paper, we present energy-efficient soft error-tolerant techniques for digital signal processing (DSP) systems. The proposed technique, referred to as algorithmic soft error-tolerance (ASET), employs low-complexity estimators of a main DSP block to achieve reliable operation in the presence of soft errors. Three distinct ASET techniques - spatial, temporal and spatio-temporal- are presented. For frequency selective finite-impulse response (FIR) filtering, it is shown that the proposed techniques provide robustness in the presence of soft error rates of up to P/sub er/=10/sup -2/ and P/sub er/=10/sup -3/ in a single-event upset scenario. The power dissipation of the proposed techniques ranges from 1.1 X to 1.7 X (spatial ASET) and 1.05 X to 1.17 X (spatio-temporal and temporal ASET) when the desired signal-to-noise ratio SNR/sub des/=25 dB. In comparison, the power dissipation of the commonly employed triple modular redundancy technique is 2.9 X.

...read moreread less

118 citations

Journal Article•DOI•

Efficient built-in redundancy analysis for embedded memories with 2-D redundancy

[...]

Shyue-Kung Lu¹, Yu-Chen Tsai, C.-H. Hsu², Kuo-Hua Wang¹, Cheng-Wen Wu² - Show less +1 more•Institutions (2)

Fu Jen Catholic University¹, National Tsing Hua University²

01 Jan 2006-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: A novel redundant mechanism for embedded memories is proposed and it is shown that the complexity of the redundancy allocation problem is NP-complete and an extended local repair-most (ELRM) algorithm suitable for built-in implementation is proposed.

...read moreread less

Abstract: A novel redundant mechanism is proposed for embedded memories in this paper. Redundant rows and columns are added into the memory array as in the conventional approaches. However, the redundant rows and columns are divided into row blocks and column blocks, respectively. The reconfiguration is performed at the row (column) block level instead of the conventional row (column) level. Based on the proposed redundant mechanism, we first show that the complexity of the redundancy allocation problem is NP-complete. Thereafter, an extended local repair-most (ELRM) algorithm suitable for built-in implementation is proposed. The complexity of the ELRM algorithm is O(N), where N denotes the number of memory cells. According to the simulation results, the hardware overhead for implementing this algorithm is below 0.17% for a 1024/spl times/2048-b SRAM. Due to the efficient usage of the redundant elements, the manufacturing yield, repair rate, and reliability can be improved significantly.

...read moreread less

103 citations

Journal Article•DOI•

Integrating Physical Constraints in HW-SW Partitioning for Architectures With Partial Dynamic Reconfiguration

[...]

Sudarshan Banerjee¹, Eli Bozorgzadeh¹, Nikil Dutt¹•Institutions (1)

University of California, Irvine¹

01 Nov 2006-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: This work presents an exact approach for hardware-software (HW-SW) partitioning that guarantees correctness of implementation by considering placement implications as an integral aspect of HW-SW partitioning and presents a physically aware HW- SW partitioning heuristic that simultaneously partitions, schedules, and does linear placement of tasks on such devices.

...read moreread less

Abstract: Partial dynamic reconfiguration is a key feature of modern reconfigurable architectures such as the Xilinx Virtex series of devices. However, this capability imposes strict placement constraints such that even exact system-level partitioning (and scheduling) formulations are not guaranteed to be physically realizable due to placement infeasibility. We first present an exact approach for hardware-software (HW-SW) partitioning that guarantees correctness of implementation by considering placement implications as an integral aspect of HW-SW partitioning. Our exact approach is based on integer linear programming (ILP) and considers key issues such as configuration prefetch for minimizing schedule length on the target single-context device. Next, we present a physically aware HW-SW partitioning heuristic that simultaneously partitions, schedules, and does linear placement of tasks on such devices. With the exact formulation, we confirm the necessity of physically-aware HW-SW partitioning for the target architecture. We demonstrate that our heuristic generates high-quality schedules by comparing the results with the exact formulation for small tests and with a popular, but placement-uanaware scheduling heuristic for a large set of over a hundred tests. Our final set of experiments is a case study of JPEG encoding-we demonstrate that our focus on physical considerations along with our consideration of multiple task implementation points enables our approach to be easily extended to handle heterogenous architectures (with specialized resources distributed between general purpose programmable logic columns). The execution time of our heuristic is very reasonable-task graphs with hundreds of nodes are processed (partitioned, scheduled, and placed) in a couple of minutes

...read moreread less

Journal Article•DOI•

Low-power repeaters driving RC and RLC interconnects with delay and bandwidth constraints

[...]

Guoqing Chen¹, Eby G. Friedman¹•Institutions (1)

University of Rochester¹

01 Feb 2006-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: A repeater insertion methodology is presented for achieving the minimum power in an RC interconnect while satisfying delay and bandwidth constraints, and the effects of inductance on the delay, bandwidth, and power of an RLC interconnect with repeaters are analyzed.

...read moreread less

Abstract: Interconnect plays an increasingly important role in deep-submicrometer very large scale integrated technologies. Multiple design criteria are considered in interconnect design, such as delay, power, and bandwidth. In this paper, a repeater insertion methodology is presented for achieving the minimum power in an RC interconnect while satisfying delay and bandwidth constraints. These constraints determine a design space for the number and size of the repeaters. The minimum power is shown to occur at the edge of the design space. With delay constraints, closed form solutions for the minimum power are developed, where the average error is 7% as compared with SPICE. With bandwidth constraints, the minimum power can be achieved with minimum-sized repeaters. The effects of inductance on the delay, bandwidth, and power of an RLC interconnect with repeaters are also analyzed. By including inductance, the minimum interconnect power under a delay or bandwidth constraint decreases as compared with an RC interconnect.

...read moreread less

Journal Article•DOI•

A combined gate replacement and input vector control approach for leakage current reduction

[...]

Lin Yuan¹, Gang Qu¹•Institutions (1)

University of Maryland, College Park¹

01 Feb 2006-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: A divide-and-conquer approach is presented that integrates gate replacement, an optimal MLV searching algorithm for tree circuits, and a genetic algorithm to connect the tree circuits to overcome the limitation of internal gates at high logic levels.

...read moreread less

Abstract: Input vector control (IVC) is a popular technique for leakage power reduction. It utilizes the transistor stack effect in CMOS gates by applying a minimum leakage vector (MLV) to the primary inputs of combinational circuits during the standby mode. However, the IVC technique becomes less effective for circuits of large logic depth because the input vector at primary inputs has little impact on leakage of internal gates at high logic levels. In this paper, we propose a technique to overcome this limitation by replacing those internal gates in their worst leakage states by other library gates while maintaining the circuit's correct functionality during the active mode. This modification of the circuit does not require changes of the design flow, but it opens the door for further leakage reduction when the MLV is not effective. We then present a divide-and-conquer approach that integrates gate replacement, an optimal MLV searching algorithm for tree circuits, and a genetic algorithm to connect the tree circuits. Our experimental results on all the MCNC91 benchmark circuits reveal that 1) the gate replacement technique alone can achieve 10% leakage current reduction over the best known IVC methods with no delay penalty and little area increase; 2) the divide-and-conquer approach outperforms the best pure IVC method by 24% and the existing control point insertion method by 12%; and 3) compared with the leakage achieved by optimal MLV in small circuits, the gate replacement heuristic and the divide-and-conquer approach can reduce on average 13% and 17% leakage, respectively.

...read moreread less

Journal Article•DOI•

Hardware-Assisted Run-Time Monitoring for Secure Program Execution on Embedded Processors

[...]

Divya Arora¹, Srivaths Ravi¹, Anand Raghunathan¹, Niraj K. Jha¹•Institutions (1)

Princeton University¹

01 Dec 2006-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: The proposed technique can thwart several common software and physical attacks, facilitating secure program execution with minimal overheads and presenting properties that capture permissible program behavior at different levels of granularity, namely inter-procedural control flow, intra-Procedural Control flow, and instruction-stream integrity.

...read moreread less

Abstract: Embedded system security is often compromised when "trusted" software is subverted to result in unintended behavior, such as leakage of sensitive data or execution of malicious code Several countermeasures have been proposed in the literature to counteract these intrusions A common underlying theme in most of them is to define security policies at the system level in an application-independent manner and check for security violations either statically or at run time In this paper, we present a methodology that addresses this issue from a different perspective It defines correct execution as synonymous with the way the program was intended to run and employs a dedicated hardware monitor to detect and prevent unintended program behavior Specifically, we extract properties of an embedded program through static program analysis and use them as the bases for enforcing permissible program behavior at run time The processor architecture is augmented with a hardware monitor that observes the program's dynamic execution trace, checks whether it falls within the allowed program behavior, and flags any deviations from expected behavior to trigger appropriate response mechanisms We present properties that capture permissible program behavior at different levels of granularity, namely inter-procedural control flow, intra-procedural control flow, and instruction-stream integrity We outline a systematic methodology to design application-specific hardware monitors for any given embedded program Hardware implementations using a commercial design flow, and cycle-accurate performance simulations indicate that the proposed technique can thwart several common software and physical attacks, facilitating secure program execution with minimal overheads

...read moreread less

Journal Article•DOI•

A process variation compensating technique with an on-die leakage current sensor for nanometer scale dynamic circuits

[...]

Chris H. Kim¹, Kaushik Roy², Steven K. Hsu³, Ram Krishnamurthy³, S. Borkar³ - Show less +1 more•Institutions (3)

University of Minnesota¹, Purdue University², Intel³

01 Jun 2006-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: A variable strength keeper that is optimally programmed based on the die leakage, enables 10% faster performance, 35% reduction in delay variation, and 5times reduction in the number of robustness failing dies, compared to conventional designs.

...read moreread less

Abstract: This paper describes a process compensating dynamic (PCD) circuit technique for maintaining the performance benefit of dynamic circuits and reducing the variation in delay and robustness. A variable strength keeper that is optimally programmed based on the die leakage, enables 10% faster performance, 35% reduction in delay variation, and 5times reduction in the number of robustness failing dies, compared to conventional designs. A new leakage current sensor design is also presented that can detect leakage variation and generate the keeper control signals for the PCD technique. Results based on measured leakage data show 1.9-10.2times higher signal-to-noise ratio (SNR) and reduced sensitivity to supply and p-n skew variations compared to prior leakage sensor designs

...read moreread less

Journal Article•DOI•

Using bus-based connections to improve field-programmable gate-array density for implementing datapath circuits

[...]

Andy Ye¹, Jonathan Rose²•Institutions (2)

Ryerson University¹, University of Toronto²

01 May 2006-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: It is experimentally shown that the multibit routing architecture can achieve 14% routing area reduction for implementing datapath circuits, which represents an overall FPGA area savings of 10%.

...read moreread less

Abstract: As the logic capacity of field-programmable gate arrays (FPGAs) increases, they are increasingly being used to implement large arithmetic-intensive applications, which often contain a large proportion of datapath circuits. Since datapath circuits usually consist of regularly structured components (called bit-slices) which are connected together by regularly structured signals (called buses), it is possible to utilize datapath regularity in order to achieve significant area savings through FPGA architectural innovations. This paper describes such an FPGA routing architecture, called the multibit routing architecture, which employs bus-based connections in order to exploit datapath regularity. It is experimentally shown that, compared to conventional FPGA routing architectures, the multibit routing architecture can achieve 14% routing area reduction for implementing datapath circuits, which represents an overall FPGA area savings of 10%. This paper also empirically determines the best values of several important architectural parameters for the new routing architecture including the most area efficient granularity values and the most area efficient proportion of bus-based connections.

...read moreread less

Journal Article•DOI•

Analysis and optimization of nanometer CMOS circuits for soft-error tolerance

[...]

Yuvraj Singh Dhillon¹, Abdulkadir Utku Diril¹, Abhijit Chatterjee¹, Adit D. Singh²•Institutions (2)

Georgia Institute of Technology¹, Auburn University²

01 May 2006-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: This paper presents a tool for accurate soft-error tolerance analysis of nanometer circuits (ASERTA) that can be used to estimate the soft- error tolerance of nanometers combinational circuits (SERTOPT), which uses the tolerance estimates generated by ASERTA.

...read moreread less

Abstract: Nanometer circuits are becoming increasingly susceptible to soft errors due to alpha-particle and atmospheric neutron strikes as device scaling reduces node capacitances and supply/threshold voltage scaling reduces noise margins. It is becoming crucial to add soft-error tolerance estimation and optimization to the design flow to handle the increasing susceptibility. The first part of this paper presents a tool for accurate soft-error tolerance analysis of nanometer circuits (ASERTA) that can be used to estimate the soft-error tolerance of nanometer combinational circuits. The tolerance estimates generated by the tool match SPICE-generated estimates closely while taking orders of magnitude less computation time. The second part of the paper presents a tool for soft-error tolerance optimization of nanometer circuits (SERTOPT), which uses the tolerance estimates generated by ASERTA. The number of errors propagated to the primary outputs (POs) is minimized by adding optimal amounts of capacitive loading to the POs of the logic circuit. Using a novel delay-assignment-variation-based optimization methodology, the sizes, supply voltages, and threshold voltages of internal gates (not primary outputs) are chosen to minimize the energy and delay overhead due to the added capacitive loads. Experiments on ISCAS'85 benchmarks show that 79.3% soft-error reduction can be obtained on the average with modest increase in circuit delay and energy. Comparison with other techniques shows that our approach has a significantly better energy-delay-reliability tradeoff compared with others.

...read moreread less

Journal Article•DOI•

Conditional Data Mapping Flip-Flops for Low-Power and High-Performance Systems

[...]

Chen Kong Teh¹, M. Hamada¹, T. Fujita¹, Hiroyuki Hara¹, N. Ikumi¹, Y. Oowaki¹ - Show less +2 more•Institutions (1)

Toshiba¹

01 Dec 2006-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: A new family of low-power and high-performance flip-flops, namely conditional data mapping flip- flops (CDMFFs), which reduce their dynamic power by mapping their inputs to a configuration that eliminates redundant internal transitions are introduced.

...read moreread less

Abstract: This paper introduces a new family of low-power and high-performance flip-flops, namely conditional data mapping flip-flops (CDMFFs), which reduce their dynamic power by mapping their inputs to a configuration that eliminates redundant internal transitions. We present two CDMFFs, having differential and single-ended structures, respectively, and compare them to the state-of-the-art flip-flops. The results indicate that both CDMFFs have the best power-delay product in their groups, respectively. In the aspect of power dissipation, the single-ended and differential CDMFFs consume the least power at data activity less than 50%, and are 31% and 26% less power than the conditional capture flip-flops at 25% data activity, respectively. In the aspect of performance, CDMFFs achieve small data-to-output delays, comparable to those of the transmission-gate pulsed latch and the modified-sense-amplifier flip-flop. In the aspect of timing reliability, CDMFFs have the best internal race immunity among pulse-triggered flip-flops. A post-layout case study is demonstrated with comparison to a transmission-gate flip-flop. The results indicate the single-ended CDMFF has 34% less in data-to-output delay and 28% less in power at 25% data activity, in spite of the 34% increase in size

...read moreread less

Journal Article•DOI•

Combined time and information redundancy for SEU-tolerance in energy-efficient real-time systems

[...]

Alireza Ejlali¹, Bashir M. Al-Hashimi², Marcus T. Schmitz², P. Rosinger², Seyed Ghassem Miremadi¹ - Show less +1 more•Institutions (2)

Sharif University of Technology¹, University of Southampton²

01 Apr 2006-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: It is demonstrated through analytical and experimental studies that it is possible to achieve both higher transient fault-tolerance and less energy using a combination of information and time redundancy when compared with using time redundancy alone.

...read moreread less

Abstract: Recently, the tradeoff between energy consumption and fault-tolerance in real-time systems has been highlighted. These works have focused on dynamic voltage scaling (DVS) to reduce dynamic energy dissipation and on-time redundancy to achieve transient-fault tolerance. While the time redundancy technique exploits the available slack-time to increase the fault-tolerance by performing recovery executions, DVS exploits slack-time to save energy. Therefore, we believe there is a resource conflict between the time-redundancy technique and DVS. The first aim of this paper is to propose the use of information redundancy to solve this problem. We demonstrate through analytical and experimental studies that it is possible to achieve both higher transient fault-tolerance [tolerance to single event upsets (SEUs)] and less energy using a combination of information and time redundancy when compared with using time redundancy alone. The second aim of this paper is to analyze the interplay of transient-fault tolerance (SEU-tolerance) and adaptive body biasing (ABB) used to reduce static leakage energy, which has not been addressed in previous studies. We show that the same technique (i.e., the combination of time and information redundancy) is applicable to ABB-enabled systems and provides more advantages than time redundancy alone.

...read moreread less

Journal Article•DOI•

Impact of Supply Voltage Variations on Full Adder Delay: Analysis and Comparison

[...]

Massimo Alioto¹, Gaetano Palumbo²•Institutions (2)

University of Siena¹, University of Catania²

01 Dec 2006-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: It is shown that the delay sensitivity to supply variations will increase in the next technology nodes, thus, it is expected that controlling the supply variation will be an increasingly important issue in the design of the next generation VLSI circuits.

...read moreread less

Abstract: In this paper, some of the most practically interesting full adder topologies are analyzed in terms of their delay dependence on the supply voltage fluctuations, which are a major contribution to the delay uncertainty, which in turn limits the speed performance of current VLSI circuits. Analytical models of the delay sensitivity with respect to supply variations are derived by following a simplified circuit analysis, and the resulting expressions are simple enough to afford a deeper insight into the impact of supply voltage variations on each topology. The models are shown to be sufficiently accurate through simulations with CMOS technologies having a minimum feature size ranging from 90 nm to 0.35 mum. Several interesting properties and design considerations are derived from these models, and the effect of the supply voltage scaling, technology scaling, transistor sizing, and input transition time is discussed. Strategies to evaluate the delay sensitivity since the early design phases (e.g., from ring oscillator measurements) are also introduced. As a fundamental result, it is shown that the delay sensitivity to supply variations will increase in the next technology nodes, thus, it is expected that controlling the supply variations will be an increasingly important issue in the design of the next generation VLSI circuits. The proposed methodology is also analyzed in the case of more general digital circuits, and is used to estimate the impact of the inter-die threshold voltage variations on the delay of the considered full adder topologies

...read moreread less

Journal Article•DOI•

HVS-Aware Dynamic Backlight Scaling in TFT-LCDs

[...]

Ali Iranli¹, Wonbok Lee¹, Massoud Pedram¹•Institutions (1)

University of Southern California¹

01 Oct 2006-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: The proposed backlight scaling technique is capable of efficiently computing the flickering effect online and subsequently using a measure of the temporal distortion to appropriately adjust the slack on the intra-frame spatial distortion, thereby, achieving a good balance between the two sources of distortion while maximizing the backlight dimming-driven energy saving in the display system and meeting an overall video quality figure of merit.

...read moreread less

Abstract: Liquid crystal displays (LCDs) have appeared in applications ranging from medical equipment to automobiles, gas pumps, laptops, and handheld portable computers. These display components present a cascaded energy attenuator to the battery of the handheld device which is responsible for about half of the energy drain at maximum display intensity. As such, the display components become the main focus of every effort for maximization of embedded system's battery lifetime. This paper proposes an approach for pixel transformation of the displayed image to increase the potential energy saving of the backlight scaling method. The proposed approach takes advantage of human visual system (HVS) characteristics and tries to minimize distortion between the perceived brightness values of the individual pixels in the original image and those of the backlight-scaled image. This is in contrast to previous backlight scaling approaches which simply match the luminance values of the individual pixels in the original and backlight-scaled images. Furthermore, this paper proposes a temporally-aware backlight scaling technique for video streams. The goal is to maximize energy saving in the display system by means of dynamic backlight dimming subject to a video distortion tolerance. The video distortion comprises of: 1) an intra-frame (spatial) distortion component due to frame-sensitive backlight scaling and transmittance function tuning and 2) an inter-frame (temporal) distortion component due to large-step backlight dimming across frames modulated by the psychophysical characteristics of the human visual system. The proposed backlight scaling technique is capable of efficiently computing the flickering effect online and subsequently using a measure of the temporal distortion to appropriately adjust the slack on the intra-frame spatial distortion, thereby, achieving a good balance between the two sources of distortion while maximizing the backlight dimming-driven energy saving in the display system and meeting an overall video quality figure of merit. The proposed dynamic backlight scaling approach is amenable to highly efficient hardware realization and has been implemented on the Apollo Testbed II. Actual current measurements demonstrate the effectiveness of proposed technique compared to the previous backlight dimming techniques, which have ignored the temporal distortion effect

...read moreread less

Journal Article•DOI•

Design and verification of SystemC transaction-level models

[...]

HabibiAli, TaharSofiène

01 Jan 2006-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: This paper presents an approach to design and verify SystemC models at the transaction level and presents a genetic algorithm to enhance the assertions coverage and ensures the soundness of the approach by proving the correctness of the SystemC-to-AsmL and AsmL- to-SystemC transformations.

...read moreread less

Abstract: Transaction-level modeling allows exploring several SoC design architectures, leading to better performance and easier verification of the final product. In this paper, we present an approach to de...

...read moreread less

Journal Article•DOI•

Coding schemes for chip-to-chip interconnect applications

[...]

K. Farzan, David A. Johns¹•Institutions (1)

University of Toronto¹

01 Apr 2006-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: A more realistic model for the channel is developed here that takes into account the effect of crosstalk, jitter, reflection, inter-symbol interference, and AWGN, and Interestingly, the proposed signaling schemes are significantly less sensitive to such interference.

...read moreread less

Abstract: Increasing demand for high-speed interchip interconnects requires faster links that consume less power. The Shannon limit for the capacity of these links is at least an order of magnitude higher than the data rate of the current state-of-the-art designs. Channel coding can be used to approach the theoretical Shannon limit. Although there are numerous capacity-approaching codes in the literature, the complexity of these codes prohibits their use in high-speed interchip applications. This work studies several suitable coding schemes for chip-to-chip communication and backplane application. These coding schemes achieve 3-dB coding gain in the case of an additive white Gaussian noise (AWGN) model for the channel. In addition, a more realistic model for the channel is developed here that takes into account the effect of crosstalk, jitter, reflection, inter-symbol interference (ISI), and AWGN. Interestingly, the proposed signaling schemes are significantly less sensitive to such interference. Simulation results show coding gains of 5-8 dB for these methods with three typical channel models. In addition, low-complexity decoding architectures for implementation of these schemes are presented. Finally, circuit simulation results confirm that the high-speed implementations of these methods are feasible.

...read moreread less

Journal Article•DOI•

SHIM: a deterministic model for heterogeneous embedded systems

[...]

Stephen A. Edwards¹, Olivier Tardieu¹•Institutions (1)

Columbia University¹

01 Aug 2006-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: This work presents the Tiny-shim language for such systems and its semantics, demonstrates how to implement it in hardware and software, and discusses how it can be used to model a real-world system.

...read moreread less

Abstract: Typical embedded hardware/software systems are implemented using a combination of C and an HDL such as Verilog. While each is well-behaved in isolation, combining the two gives a nondeterministic model of computation whose ultimate behavior must be validated through expensive (cycle-accurate) simulation. We propose an alternative for describing such systems. Our software/hardware integration medium (shim) model, effectively Kahn networks with rendezvous communication, provides deterministic concurrency. We present the Tiny-shim language for such systems and its semantics, demonstrate how to implement it in hardware and software, and discuss how it can be used to model a real-world system. By providing a powerful, deterministic formalism for expressing systems, designing systems, and verifying their correctness will become easier

...read moreread less

Journal Article•DOI•

Overlay techniques for scratchpad memories in low power embedded processors

[...]

Manish Verma¹, Peter Marwedel•Institutions (1)

Technical University of Dortmund¹

01 Aug 2006-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: This paper presents scratchpad overlay techniques which analyze the application and insert instructions to dynamically copy both variables and code segments onto the scratchpad at runtime, and presents optimal and near-optimal approaches for solving the scratch pad overlay problem.

...read moreread less

Abstract: Energy consumption is one of the important parameters to be optimized during the design of portable embedded systems. Thus, most of the contemporary portable devices feature low-power processors coupled with on-chip memories (e.g., caches, scratchpads). Scratchpads are better than traditional caches in terms of power, performance, area, and predictability. However, unlike caches they depend upon software allocation techniques for their utilization. In this paper, we present scratchpad overlay techniques which analyze the application and insert instructions to dynamically copy both variables and code segments onto the scratchpad at runtime. We demonstrate that the problem of overlaying scratchpad is an extension of the Global Register Allocation problem. We present optimal and near-optimal approaches for solving the scratchpad overlay problem. The near-optimal scratchpad overlay approach achieves close to the optimal results and is significantly faster than the optimal approach. Our approaches improve upon the previously known static allocation technique for assigning both variables and code segments onto the scratchpad. The evaluation of the approaches for ARM7 processor reports, average energy, and execution time reductions of 26% and 14% over the static approach, respectively. Additional experiments comparing the overlayed scratchpads against unified caches of the same size, report average energy, and execution time savings of 20% and 10%, respectively. We also report data memory energy reductions of 45%-57% due to the insertion of a 1024-bytes scratchpad memory in the memory hierarchy of a digital signal processor (DSP)

...read moreread less

Journal Article•DOI•

A Lossless Data Compression and Decompression Algorithm and Its Hardware Architecture

[...]

Ming-Bo Lin¹, Jang-Feng Lee¹, Gene Eu Jan²•Institutions (2)

National Taiwan University of Science and Technology¹, National Taipei University²

01 Sep 2006-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: A new two-stage hardware architecture that combines the features of both parallel dictionary LZW (PDLZW) and an approximated adaptive Huffman (AH) algorithms and shows that it not only outperforms the AH algorithm at the cost of only one-fourth the hardware resource but it is also competitive to the performance of LzW algorithm (compress).

...read moreread less

Abstract: In this paper, we propose a new two-stage hardware architecture that combines the features of both parallel dictionary LZW (PDLZW) and an approximated adaptive Huffman (AH) algorithms. In this architecture, an ordered list instead of the tree-based structure is used in the AH algorithm for speeding up the compression data rate. The resulting architecture shows that it not only outperforms the AH algorithm at the cost of only one-fourth the hardware resource but it is also competitive to the performance of LZW algorithm (compress). In addition, both compression and decompression rates of the proposed architecture are greater than those of the AH algorithm even in the case realized by software

...read moreread less

Journal Article•DOI•

The LOTTERYBUS on-chip communication architecture

[...]

K. Lahiri¹, Anand Raghunathan¹, Ganesh Lakshminarayana¹•Institutions (1)

Princeton University¹

01 Jun 2006-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: LOTTERYBUS is presented, a high-performance SoC communication architecture based on new randomized on-chip communication protocols that addresses the shortcomings mentioned above and provides fine-grained control over bandwidth allocation.

...read moreread less

Abstract: On-chip communication architectures play an important role in determining the overall performance of System-on-Chip (SoC) designs. Communication architectures should be flexible so as to offer high performance over a wide range of traffic characteristics. In particular, the resource sharing mechanism of the communication architecture, which determines how the often-conflicting requirements of different components are served, is of utmost importance. Conventional SoC architectures typically employ priority or time-division multiple-access (TDMA)-based communication architectures. However, these techniques are often inadequate. In the former, low-priority components may suffer from starvation, while in the latter, depending on the request profile, high-priority traffic may be subject to large latencies. This paper presents LOTTERYBUS, a high-performance SoC communication architecture based on new randomized on-chip communication protocols that addresses the shortcomings mentioned above. LOTTERYBUS provides each SoC component with a flexible, proportional, and probabilistically guaranteed share of the on-chip communication bandwidth. We present two variants of LOTTERYBUS. In the first variant, its architectural parameters are statically configured, leading to relatively low hardware overhead and design complexity. In the second variant, these parameters are allowed to vary dynamically, enabling more sophisticated use of LOTTERYBUS, at additional hardware cost. We have performed experiments to investigate the performance of LOTTERYBUS across a range of communication traffic characteristics. We have used LOTTERYBUS in designing a 4times4 ATM switch subsystem, and have compared its performance with conventional architectures. The results show that LOTTERYBUS provides fine-grained control over bandwidth allocation, and also provides significant reduction in average transaction latencies (up to 85%) compared to conventional architectures. Hardware implementations using a commercial 0.15-mum cell-based library indicate that the advantages provided by LOTTERYBUS are accompanied by modest hardware overheads compared to conventional architectures

...read moreread less

Journal Article•DOI•

High Rate Data Synchronization in GALS SoCs

[...]

Rostislav (Reuven) Dobkin¹, Ran Ginosar¹, Christos P. Sotiriou•Institutions (1)

Technion – Israel Institute of Technology¹

01 Oct 2006-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: A novel architecture for synchronizing inter-modular communications in GALS, based on locally delayed latching (LDL), is described, which replaces complex global timing constraints with simpler localized ones and supports high data rates.

...read moreread less

Abstract: Globally asynchronous, locally synchronous (GALS) systems-on-chip (SoCs) may be prone to synchronization failures if the delay of their locally-generated clock tree is not considered. This paper presents an in-depth analysis of the problem and proposes a novel solution. The problem is analyzed considering the magnitude of clock tree delays, the cycle times of the GALS module, and the complexity of the asynchronous interface controllers using a timed signal transition graph (STG) approach. In some cases, the problem can be solved by extracting all the delays and verifying whether the system is susceptible to metastability. In other cases, when high data bandwidth is not required, matched-delay asynchronous ports may be employed. A novel architecture for synchronizing inter-modular communications in GALS, based on locally delayed latching (LDL), is described. LDL synchronization does not require pausable clocking, is insensitive to clock tree delays, and supports high data rates. It replaces complex global timing constraints with simpler localized ones. Three different LDL ports are presented. The risk of metastability in the synchronizer is analyzed in a technology-independent manner

...read moreread less

Journal Article•DOI•

New degree computationless modified euclid algorithm and architecture for Reed-Solomon decoder

[...]

J.H. Baek¹, Myung Hoon Sunwoo¹•Institutions (1)

Ajou University¹

01 Aug 2006-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: A new degree computationless modified Euclid (DCME) algorithm and its dedicated architecture for Reed-Solomon (RS) decoder, which can completely remove the degree computation and comparison circuits and provide the short latency and low-cost RS decoding.

...read moreread less

Abstract: This paper proposes a new degree computationless modified Euclid (DCME) algorithm and its dedicated architecture for Reed-Solomon (RS) decoder. This architecture has low hardware complexity compared with conventional modified Euclid (ME) architectures, since it can completely remove the degree computation and comparison circuits. The architecture employing a systolic array requires only the latency of 2t clock cycles to solve the key equation without initial latency. In addition, the DCME architecture using 3t+2 basic cells has regularity and scalability since it uses only one processing element. Hence, the proposed DCME architecture provides the short latency and low-cost RS decoding. The DCME architecture has been synthesized using the 0.25-mum Faraday CMOS standard cell library and operates at 200 MHz. The gate count of the DCME architecture is 21 760. Hence, the RS decoder using the proposed DCME architecture can reduce the total gate count by at least 23% and the total latency to at least 10% compared with conventional ME decoders

...read moreread less

Journal Article•DOI•

Architectures for Dynamic Data Scaling in 2/4/8K Pipeline FFT Cores

[...]

Thomas Lenart, Viktor Öwall

01 Nov 2006-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: A hybrid floating-point scheme with tailored exponent datapath, and a co-optimized architecture between hybrid floating point and block floating point (BFP) to reduce memory requirements for 2-D signal processing are proposed.

...read moreread less

Abstract: This paper presents architectures for supporting dynamic data scaling in pipeline fast Fourier transforms (FFTs), suitable when implementing large size FFTs in applications such as digital video broadcasting and digital holographic imaging. In a pipeline FFT, data is continuously streaming and must, hence, be scaled without stalling the dataflow. We propose a hybrid floating-point scheme with tailored exponent datapath, and a co-optimized architecture between hybrid floating point and block floating point (BFP) to reduce memory requirements for 2-D signal processing. The presented co-optimization generates a higher signal-to-quantization-noise ratio and requires less memory than for instance convergent BFP. A 2048-point pipeline FFT has been fabricated in a standard-CMOS process from AMI Semiconductor (Lenart and Owall, 2003), and a field-programmable gate array prototype integrating a 2-D FFT core in a larger design shows that the architecture is suitable for image reconstruction in digital holographic imaging

...read moreread less