scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Very Large Scale Integration Systems in 2006"


Journal ArticleDOI
TL;DR: The HotSpot compact thermal modeling approach is especially well suited for preregister transfer level (RTL) and presynthesis thermal analysis and is able to provide detailed static and transient temperature information across the die and the package, as it is also computationally efficient.
Abstract: This paper presents HotSpot-a modeling methodology for developing compact thermal models based on the popular stacked-layer packaging scheme in modern very large-scale integration systems. In addition to modeling silicon and packaging layers, HotSpot includes a high-level on-chip interconnect self-heating power and thermal model such that the thermal impacts on interconnects can also be considered during early design stages. The HotSpot compact thermal modeling approach is especially well suited for preregister transfer level (RTL) and presynthesis thermal analysis and is able to provide detailed static and transient temperature information across the die and the package, as it is also computationally efficient.

985 citations


Journal ArticleDOI
TL;DR: The proposed full adder is energy efficient and outperforms several standard full adders without trading off driving capability and reliability and is based on a novel xor-xnor circuit that generates xor and xnor full-swing outputs simultaneously.
Abstract: We present a new design for a 1-b full adder featuring hybrid-CMOS design style. The quest to achieve a good-drivability, noise-robustness, and low-energy operations for deep submicrometer guided our research to explore hybrid-CMOS style design. Hybrid-CMOS design style utilizes various CMOS logic style circuits to build new full adders with desired performance. This provides the designer a higher degree of design freedom to target a wide range of applications, thus significantly reducing design efforts. We also classify hybrid-CMOS full adders into three broad categories based upon their structure. Using this categorization, many full-adder designs can be conceived. We will present a new full-adder design belonging to one of the proposed categories. The new full adder is based on a novel xor-xnor circuit that generates xor and xnor full-swing outputs simultaneously. This circuit outperforms its counterparts showing 5%-37% improvement in the power-delay product (PDP). A novel hybrid-CMOS output stage that exploits the simultaneous xor-xnor signals is also proposed. This output stage provides good driving capability enabling cascading of adders without the need of buffer insertion between cascaded stages. There is approximately a 40% reduction in PDP when compared to its best counterpart. During our experimentations, we found out that many of the previously reported adders suffered from the problems of low swing and high noise when operated at low supply voltages. The proposed full adder is energy efficient and outperforms several standard full adders without trading off driving capability and reliability. The new full-adder circuit successfully operates at low voltages with excellent signal integrity and driving capability. To evaluate the performance of the new full adder in a real circuit, we embedded it in a 4- and 8-b, 4-operand carry-save array adder with final carry-propagate adder. The new adder displayed better performance as compared to the standard full adders

399 citations


Journal ArticleDOI
TL;DR: Novel linear programming based techniques for synthesis of custom NoC architectures that minimize power as the primary goal, and minimize the number of routers (area) as a secondary goal are presented.
Abstract: Application-specific system-on-chip (SoC) design offers the opportunity for incorporating custom network-on-chip (NoC) architectures that are more suitable for a particular application, and do not necessarily conform to regular topologies. This paper presents novel mixed integer linear programming (MILP) formulations for synthesis of custom NoC architectures. The optimization objective of the techniques is to minimize the power consumption subject to the performance constraints. We present a two-stage approach for solving the custom NoC synthesis problem. The power consumption of the NoC architecture is determined by both the physical links and routers. The power consumption of a physical link is dependent upon the length of the link, which in turn, is governed by the layout of the SoC. Therefore, in the first stage, we address the floorplanning problem that determines the locations of the various cores and the routers. In the second stage, we utilize the floorplan from the first stage to generate topology of the NoC and the routes for the various traffic traces. We also present a clustering-based heuristic technique for the second stage to reduce the run times of the MILP formulation. We analyze the quality of the results and solution times of the proposed techniques by extensive experimentation with realistic benchmarks and comparisons with regular mesh-based NoC architectures.

287 citations


Journal ArticleDOI
TL;DR: The proposed high-level model, which relies on online current and voltage measurements, correctly accounts for the temperature and cycle aging effects and has a maximum of 5% error between simulated and predicted data.
Abstract: Predicting the residual energy of the battery source that powers a portable electronic device is imperative in designing and applying an effective dynamic power management policy for the device This paper starts up by showing that a 30% error in predicting the battery capacity of a lithium-ion battery can result in up to 20% performance degradation for a dynamic voltage and frequency scaling algorithm Next, this paper presents a closed form analytical expression for predicting the remaining capacity of a lithium-ion battery The proposed high-level model, which relies on online current and voltage measurements, correctly accounts for the temperature and cycle aging effects The accuracy of the high-level model is validated by comparing it with DUALFOIL simulation results, demonstrating a maximum of 5% error between simulated and predicted data

271 citations


Journal ArticleDOI
TL;DR: The presented error-correcting latch and flip-flop designs are power efficient, introduce minimal speed penalty, and employ reuse of on-chip scan design- for-testability and design-for-debug resources to minimize area overheads.
Abstract: This paper presents a built-in soft error resilience (BISER) technique for correcting radiation-induced soft errors in latches and flip-flops. The presented error-correcting latch and flip-flop designs are power efficient, introduce minimal speed penalty, and employ reuse of on-chip scan design-for-testability and design-for-debug resources to minimize area overheads. Circuit simulations using a sub-90-nm technology show that the presented designs achieve more than a 20-fold reduction in cell-level soft error rate (SER). Fault injection experiments conducted on a microprocessor model further demonstrate that chip-level SER improvement is tunable by selective placement of the presented error-correcting designs. When coupled with error correction code to protect in-pipeline memories, the BISER flip-flop design improves chip-level SER by 10 times over an unprotected pipeline with the flip-flops contributing an extra 7-10.5% in power. When only soft errors in flips-flops are considered, the BISER technique improves chip-level SER by 10 times with an increased power of 10.3%. The error correction mechanism is configurable (i.e., can be turned on or off) which enables the use of the presented techniques for designs that can target multiple applications with a wide range of reliability requirements

226 citations


Journal ArticleDOI
TL;DR: A new via-configurable routing architecture which shows much better throughput and performance than the previous structures is described, and an efficient white-space allocation scheme is suggested, which provides a fast design convergence and early prediction of the circuit mappability to a given fabric.
Abstract: In this paper, we describe a new via-configurable routing architecture which shows a much better throughput and performance than the previous structures. We demonstrate how to construct a single-via-mask fabric to reduce the mask cost further, and we analyze the penalties which it incurs. To solve the routability problem commonly existing in fabric-based designs, an efficient white-space allocation and an incremental cell movement scheme are suggested, which help to provide a fast design convergence and early prediction of circuit's mappability to a given fabric

171 citations


Journal ArticleDOI
Kangmin Lee1, Se-Joong Lee1, Hoi-Jun Yoo1
TL;DR: An energy-efficient network-on-chip (NoC) is presented, which incorporates heterogeneous intellectual properties such as multiple RISCs and SRAMs, a reconfigurable logic array, an off-chip gateway, and a 1.6-GHz phase-locked loop (PLL) to achieve the power-efficient on-chip communications.
Abstract: An energy-efficient network-on-chip (NoC) is presented for possible application to high-performance system-on-chip (SoC) design. It incorporates heterogeneous intellectual properties (IPs) such as multiple RISCs and SRAMs, a reconfigurable logic array, an off-chip gateway, and a 1.6-GHz phase-locked loop (PLL). Its hierarchically-star-connected on-chip network provides the integrated IPs, which operate at different clock frequencies, with packet-switched serial-communication infrastructure. Various low-power techniques such as low-swing signaling, partially activated crossbar, serial link coding, and clock frequency scaling are devised, and applied to achieve the power-efficient on-chip communications. The 5 /spl times/5 mm/sup 2/ chip containing all the above features is fabricated by 0.18-/spl mu/m CMOS process and successfully measured and demonstrated on a system evaluation board where multimedia applications run. The fabricated chip can deliver 11.2-GB/s aggregated bandwidth at 1.6-GHz signaling frequency. The chip consumes 160 mW and the on-chip network dissipates less than 51 mW.

156 citations


Journal ArticleDOI
TL;DR: The sleepy stack technique achieves the lowest leakage power consumption among known state-saving leakage reduction techniques, thus, providing circuit designers with new choices to handle the leakage power problem.
Abstract: Leakage power consumption of current CMOS technology is already a great challenge. International Technology Roadmap for Semiconductors projects that leakage power consumption may come to dominate total chip power consumption as the technology feature size shrinks. Leakage is a serious problem particularly for CMOS circuits in nanoscale technology. We propose a novel ultra-low leakage CMOS circuit structure which we call "sleepy stack". Unlike many other previous approaches, sleepy stack can retain logic state during sleep mode while achieving ultra-low leakage power consumption. We apply the sleepy stack to generic logic circuits. Although the sleepy stack incurs some delay and area overhead, the sleepy stack technique achieves the lowest leakage power consumption among known state-saving leakage reduction techniques, thus, providing circuit designers with new choices to handle the leakage power problem

132 citations


Journal ArticleDOI
TL;DR: It is shown that the proposed technique, referred to as algorithmic soft error-tolerance (ASET), employs low-complexity estimators of a main DSP block to achieve reliable operation in the presence of soft errors.
Abstract: In this paper, we present energy-efficient soft error-tolerant techniques for digital signal processing (DSP) systems. The proposed technique, referred to as algorithmic soft error-tolerance (ASET), employs low-complexity estimators of a main DSP block to achieve reliable operation in the presence of soft errors. Three distinct ASET techniques - spatial, temporal and spatio-temporal- are presented. For frequency selective finite-impulse response (FIR) filtering, it is shown that the proposed techniques provide robustness in the presence of soft error rates of up to P/sub er/=10/sup -2/ and P/sub er/=10/sup -3/ in a single-event upset scenario. The power dissipation of the proposed techniques ranges from 1.1 X to 1.7 X (spatial ASET) and 1.05 X to 1.17 X (spatio-temporal and temporal ASET) when the desired signal-to-noise ratio SNR/sub des/=25 dB. In comparison, the power dissipation of the commonly employed triple modular redundancy technique is 2.9 X.

118 citations


Journal ArticleDOI
TL;DR: A novel redundant mechanism for embedded memories is proposed and it is shown that the complexity of the redundancy allocation problem is NP-complete and an extended local repair-most (ELRM) algorithm suitable for built-in implementation is proposed.
Abstract: A novel redundant mechanism is proposed for embedded memories in this paper. Redundant rows and columns are added into the memory array as in the conventional approaches. However, the redundant rows and columns are divided into row blocks and column blocks, respectively. The reconfiguration is performed at the row (column) block level instead of the conventional row (column) level. Based on the proposed redundant mechanism, we first show that the complexity of the redundancy allocation problem is NP-complete. Thereafter, an extended local repair-most (ELRM) algorithm suitable for built-in implementation is proposed. The complexity of the ELRM algorithm is O(N), where N denotes the number of memory cells. According to the simulation results, the hardware overhead for implementing this algorithm is below 0.17% for a 1024/spl times/2048-b SRAM. Due to the efficient usage of the redundant elements, the manufacturing yield, repair rate, and reliability can be improved significantly.

103 citations


Journal ArticleDOI
TL;DR: This work presents an exact approach for hardware-software (HW-SW) partitioning that guarantees correctness of implementation by considering placement implications as an integral aspect of HW-SW partitioning and presents a physically aware HW- SW partitioning heuristic that simultaneously partitions, schedules, and does linear placement of tasks on such devices.
Abstract: Partial dynamic reconfiguration is a key feature of modern reconfigurable architectures such as the Xilinx Virtex series of devices. However, this capability imposes strict placement constraints such that even exact system-level partitioning (and scheduling) formulations are not guaranteed to be physically realizable due to placement infeasibility. We first present an exact approach for hardware-software (HW-SW) partitioning that guarantees correctness of implementation by considering placement implications as an integral aspect of HW-SW partitioning. Our exact approach is based on integer linear programming (ILP) and considers key issues such as configuration prefetch for minimizing schedule length on the target single-context device. Next, we present a physically aware HW-SW partitioning heuristic that simultaneously partitions, schedules, and does linear placement of tasks on such devices. With the exact formulation, we confirm the necessity of physically-aware HW-SW partitioning for the target architecture. We demonstrate that our heuristic generates high-quality schedules by comparing the results with the exact formulation for small tests and with a popular, but placement-uanaware scheduling heuristic for a large set of over a hundred tests. Our final set of experiments is a case study of JPEG encoding-we demonstrate that our focus on physical considerations along with our consideration of multiple task implementation points enables our approach to be easily extended to handle heterogenous architectures (with specialized resources distributed between general purpose programmable logic columns). The execution time of our heuristic is very reasonable-task graphs with hundreds of nodes are processed (partitioned, scheduled, and placed) in a couple of minutes

Journal ArticleDOI
TL;DR: A repeater insertion methodology is presented for achieving the minimum power in an RC interconnect while satisfying delay and bandwidth constraints, and the effects of inductance on the delay, bandwidth, and power of an RLC interconnect with repeaters are analyzed.
Abstract: Interconnect plays an increasingly important role in deep-submicrometer very large scale integrated technologies. Multiple design criteria are considered in interconnect design, such as delay, power, and bandwidth. In this paper, a repeater insertion methodology is presented for achieving the minimum power in an RC interconnect while satisfying delay and bandwidth constraints. These constraints determine a design space for the number and size of the repeaters. The minimum power is shown to occur at the edge of the design space. With delay constraints, closed form solutions for the minimum power are developed, where the average error is 7% as compared with SPICE. With bandwidth constraints, the minimum power can be achieved with minimum-sized repeaters. The effects of inductance on the delay, bandwidth, and power of an RLC interconnect with repeaters are also analyzed. By including inductance, the minimum interconnect power under a delay or bandwidth constraint decreases as compared with an RC interconnect.

Journal ArticleDOI
TL;DR: A divide-and-conquer approach is presented that integrates gate replacement, an optimal MLV searching algorithm for tree circuits, and a genetic algorithm to connect the tree circuits to overcome the limitation of internal gates at high logic levels.
Abstract: Input vector control (IVC) is a popular technique for leakage power reduction. It utilizes the transistor stack effect in CMOS gates by applying a minimum leakage vector (MLV) to the primary inputs of combinational circuits during the standby mode. However, the IVC technique becomes less effective for circuits of large logic depth because the input vector at primary inputs has little impact on leakage of internal gates at high logic levels. In this paper, we propose a technique to overcome this limitation by replacing those internal gates in their worst leakage states by other library gates while maintaining the circuit's correct functionality during the active mode. This modification of the circuit does not require changes of the design flow, but it opens the door for further leakage reduction when the MLV is not effective. We then present a divide-and-conquer approach that integrates gate replacement, an optimal MLV searching algorithm for tree circuits, and a genetic algorithm to connect the tree circuits. Our experimental results on all the MCNC91 benchmark circuits reveal that 1) the gate replacement technique alone can achieve 10% leakage current reduction over the best known IVC methods with no delay penalty and little area increase; 2) the divide-and-conquer approach outperforms the best pure IVC method by 24% and the existing control point insertion method by 12%; and 3) compared with the leakage achieved by optimal MLV in small circuits, the gate replacement heuristic and the divide-and-conquer approach can reduce on average 13% and 17% leakage, respectively.

Journal ArticleDOI
TL;DR: The proposed technique can thwart several common software and physical attacks, facilitating secure program execution with minimal overheads and presenting properties that capture permissible program behavior at different levels of granularity, namely inter-procedural control flow, intra-Procedural Control flow, and instruction-stream integrity.
Abstract: Embedded system security is often compromised when "trusted" software is subverted to result in unintended behavior, such as leakage of sensitive data or execution of malicious code Several countermeasures have been proposed in the literature to counteract these intrusions A common underlying theme in most of them is to define security policies at the system level in an application-independent manner and check for security violations either statically or at run time In this paper, we present a methodology that addresses this issue from a different perspective It defines correct execution as synonymous with the way the program was intended to run and employs a dedicated hardware monitor to detect and prevent unintended program behavior Specifically, we extract properties of an embedded program through static program analysis and use them as the bases for enforcing permissible program behavior at run time The processor architecture is augmented with a hardware monitor that observes the program's dynamic execution trace, checks whether it falls within the allowed program behavior, and flags any deviations from expected behavior to trigger appropriate response mechanisms We present properties that capture permissible program behavior at different levels of granularity, namely inter-procedural control flow, intra-procedural control flow, and instruction-stream integrity We outline a systematic methodology to design application-specific hardware monitors for any given embedded program Hardware implementations using a commercial design flow, and cycle-accurate performance simulations indicate that the proposed technique can thwart several common software and physical attacks, facilitating secure program execution with minimal overheads

Journal ArticleDOI
TL;DR: A variable strength keeper that is optimally programmed based on the die leakage, enables 10% faster performance, 35% reduction in delay variation, and 5times reduction in the number of robustness failing dies, compared to conventional designs.
Abstract: This paper describes a process compensating dynamic (PCD) circuit technique for maintaining the performance benefit of dynamic circuits and reducing the variation in delay and robustness. A variable strength keeper that is optimally programmed based on the die leakage, enables 10% faster performance, 35% reduction in delay variation, and 5times reduction in the number of robustness failing dies, compared to conventional designs. A new leakage current sensor design is also presented that can detect leakage variation and generate the keeper control signals for the PCD technique. Results based on measured leakage data show 1.9-10.2times higher signal-to-noise ratio (SNR) and reduced sensitivity to supply and p-n skew variations compared to prior leakage sensor designs

Journal ArticleDOI
TL;DR: It is experimentally shown that the multibit routing architecture can achieve 14% routing area reduction for implementing datapath circuits, which represents an overall FPGA area savings of 10%.
Abstract: As the logic capacity of field-programmable gate arrays (FPGAs) increases, they are increasingly being used to implement large arithmetic-intensive applications, which often contain a large proportion of datapath circuits. Since datapath circuits usually consist of regularly structured components (called bit-slices) which are connected together by regularly structured signals (called buses), it is possible to utilize datapath regularity in order to achieve significant area savings through FPGA architectural innovations. This paper describes such an FPGA routing architecture, called the multibit routing architecture, which employs bus-based connections in order to exploit datapath regularity. It is experimentally shown that, compared to conventional FPGA routing architectures, the multibit routing architecture can achieve 14% routing area reduction for implementing datapath circuits, which represents an overall FPGA area savings of 10%. This paper also empirically determines the best values of several important architectural parameters for the new routing architecture including the most area efficient granularity values and the most area efficient proportion of bus-based connections.

Journal ArticleDOI
TL;DR: This paper presents a tool for accurate soft-error tolerance analysis of nanometer circuits (ASERTA) that can be used to estimate the soft- error tolerance of nanometers combinational circuits (SERTOPT), which uses the tolerance estimates generated by ASERTA.
Abstract: Nanometer circuits are becoming increasingly susceptible to soft errors due to alpha-particle and atmospheric neutron strikes as device scaling reduces node capacitances and supply/threshold voltage scaling reduces noise margins. It is becoming crucial to add soft-error tolerance estimation and optimization to the design flow to handle the increasing susceptibility. The first part of this paper presents a tool for accurate soft-error tolerance analysis of nanometer circuits (ASERTA) that can be used to estimate the soft-error tolerance of nanometer combinational circuits. The tolerance estimates generated by the tool match SPICE-generated estimates closely while taking orders of magnitude less computation time. The second part of the paper presents a tool for soft-error tolerance optimization of nanometer circuits (SERTOPT), which uses the tolerance estimates generated by ASERTA. The number of errors propagated to the primary outputs (POs) is minimized by adding optimal amounts of capacitive loading to the POs of the logic circuit. Using a novel delay-assignment-variation-based optimization methodology, the sizes, supply voltages, and threshold voltages of internal gates (not primary outputs) are chosen to minimize the energy and delay overhead due to the added capacitive loads. Experiments on ISCAS'85 benchmarks show that 79.3% soft-error reduction can be obtained on the average with modest increase in circuit delay and energy. Comparison with other techniques shows that our approach has a significantly better energy-delay-reliability tradeoff compared with others.

Journal ArticleDOI
Chen Kong Teh1, M. Hamada1, T. Fujita1, Hiroyuki Hara1, N. Ikumi1, Y. Oowaki1 
TL;DR: A new family of low-power and high-performance flip-flops, namely conditional data mapping flip- flops (CDMFFs), which reduce their dynamic power by mapping their inputs to a configuration that eliminates redundant internal transitions are introduced.
Abstract: This paper introduces a new family of low-power and high-performance flip-flops, namely conditional data mapping flip-flops (CDMFFs), which reduce their dynamic power by mapping their inputs to a configuration that eliminates redundant internal transitions. We present two CDMFFs, having differential and single-ended structures, respectively, and compare them to the state-of-the-art flip-flops. The results indicate that both CDMFFs have the best power-delay product in their groups, respectively. In the aspect of power dissipation, the single-ended and differential CDMFFs consume the least power at data activity less than 50%, and are 31% and 26% less power than the conditional capture flip-flops at 25% data activity, respectively. In the aspect of performance, CDMFFs achieve small data-to-output delays, comparable to those of the transmission-gate pulsed latch and the modified-sense-amplifier flip-flop. In the aspect of timing reliability, CDMFFs have the best internal race immunity among pulse-triggered flip-flops. A post-layout case study is demonstrated with comparison to a transmission-gate flip-flop. The results indicate the single-ended CDMFF has 34% less in data-to-output delay and 28% less in power at 25% data activity, in spite of the 34% increase in size

Journal ArticleDOI
TL;DR: It is demonstrated through analytical and experimental studies that it is possible to achieve both higher transient fault-tolerance and less energy using a combination of information and time redundancy when compared with using time redundancy alone.
Abstract: Recently, the tradeoff between energy consumption and fault-tolerance in real-time systems has been highlighted. These works have focused on dynamic voltage scaling (DVS) to reduce dynamic energy dissipation and on-time redundancy to achieve transient-fault tolerance. While the time redundancy technique exploits the available slack-time to increase the fault-tolerance by performing recovery executions, DVS exploits slack-time to save energy. Therefore, we believe there is a resource conflict between the time-redundancy technique and DVS. The first aim of this paper is to propose the use of information redundancy to solve this problem. We demonstrate through analytical and experimental studies that it is possible to achieve both higher transient fault-tolerance [tolerance to single event upsets (SEUs)] and less energy using a combination of information and time redundancy when compared with using time redundancy alone. The second aim of this paper is to analyze the interplay of transient-fault tolerance (SEU-tolerance) and adaptive body biasing (ABB) used to reduce static leakage energy, which has not been addressed in previous studies. We show that the same technique (i.e., the combination of time and information redundancy) is applicable to ABB-enabled systems and provides more advantages than time redundancy alone.

Journal ArticleDOI
TL;DR: It is shown that the delay sensitivity to supply variations will increase in the next technology nodes, thus, it is expected that controlling the supply variation will be an increasingly important issue in the design of the next generation VLSI circuits.
Abstract: In this paper, some of the most practically interesting full adder topologies are analyzed in terms of their delay dependence on the supply voltage fluctuations, which are a major contribution to the delay uncertainty, which in turn limits the speed performance of current VLSI circuits. Analytical models of the delay sensitivity with respect to supply variations are derived by following a simplified circuit analysis, and the resulting expressions are simple enough to afford a deeper insight into the impact of supply voltage variations on each topology. The models are shown to be sufficiently accurate through simulations with CMOS technologies having a minimum feature size ranging from 90 nm to 0.35 mum. Several interesting properties and design considerations are derived from these models, and the effect of the supply voltage scaling, technology scaling, transistor sizing, and input transition time is discussed. Strategies to evaluate the delay sensitivity since the early design phases (e.g., from ring oscillator measurements) are also introduced. As a fundamental result, it is shown that the delay sensitivity to supply variations will increase in the next technology nodes, thus, it is expected that controlling the supply variations will be an increasingly important issue in the design of the next generation VLSI circuits. The proposed methodology is also analyzed in the case of more general digital circuits, and is used to estimate the impact of the inter-die threshold voltage variations on the delay of the considered full adder topologies

Journal ArticleDOI
TL;DR: The proposed backlight scaling technique is capable of efficiently computing the flickering effect online and subsequently using a measure of the temporal distortion to appropriately adjust the slack on the intra-frame spatial distortion, thereby, achieving a good balance between the two sources of distortion while maximizing the backlight dimming-driven energy saving in the display system and meeting an overall video quality figure of merit.
Abstract: Liquid crystal displays (LCDs) have appeared in applications ranging from medical equipment to automobiles, gas pumps, laptops, and handheld portable computers. These display components present a cascaded energy attenuator to the battery of the handheld device which is responsible for about half of the energy drain at maximum display intensity. As such, the display components become the main focus of every effort for maximization of embedded system's battery lifetime. This paper proposes an approach for pixel transformation of the displayed image to increase the potential energy saving of the backlight scaling method. The proposed approach takes advantage of human visual system (HVS) characteristics and tries to minimize distortion between the perceived brightness values of the individual pixels in the original image and those of the backlight-scaled image. This is in contrast to previous backlight scaling approaches which simply match the luminance values of the individual pixels in the original and backlight-scaled images. Furthermore, this paper proposes a temporally-aware backlight scaling technique for video streams. The goal is to maximize energy saving in the display system by means of dynamic backlight dimming subject to a video distortion tolerance. The video distortion comprises of: 1) an intra-frame (spatial) distortion component due to frame-sensitive backlight scaling and transmittance function tuning and 2) an inter-frame (temporal) distortion component due to large-step backlight dimming across frames modulated by the psychophysical characteristics of the human visual system. The proposed backlight scaling technique is capable of efficiently computing the flickering effect online and subsequently using a measure of the temporal distortion to appropriately adjust the slack on the intra-frame spatial distortion, thereby, achieving a good balance between the two sources of distortion while maximizing the backlight dimming-driven energy saving in the display system and meeting an overall video quality figure of merit. The proposed dynamic backlight scaling approach is amenable to highly efficient hardware realization and has been implemented on the Apollo Testbed II. Actual current measurements demonstrate the effectiveness of proposed technique compared to the previous backlight dimming techniques, which have ignored the temporal distortion effect

Journal ArticleDOI
TL;DR: This paper presents an approach to design and verify SystemC models at the transaction level and presents a genetic algorithm to enhance the assertions coverage and ensures the soundness of the approach by proving the correctness of the SystemC-to-AsmL and AsmL- to-SystemC transformations.
Abstract: Transaction-level modeling allows exploring several SoC design architectures, leading to better performance and easier verification of the final product. In this paper, we present an approach to de...

Journal ArticleDOI
TL;DR: A more realistic model for the channel is developed here that takes into account the effect of crosstalk, jitter, reflection, inter-symbol interference, and AWGN, and Interestingly, the proposed signaling schemes are significantly less sensitive to such interference.
Abstract: Increasing demand for high-speed interchip interconnects requires faster links that consume less power. The Shannon limit for the capacity of these links is at least an order of magnitude higher than the data rate of the current state-of-the-art designs. Channel coding can be used to approach the theoretical Shannon limit. Although there are numerous capacity-approaching codes in the literature, the complexity of these codes prohibits their use in high-speed interchip applications. This work studies several suitable coding schemes for chip-to-chip communication and backplane application. These coding schemes achieve 3-dB coding gain in the case of an additive white Gaussian noise (AWGN) model for the channel. In addition, a more realistic model for the channel is developed here that takes into account the effect of crosstalk, jitter, reflection, inter-symbol interference (ISI), and AWGN. Interestingly, the proposed signaling schemes are significantly less sensitive to such interference. Simulation results show coding gains of 5-8 dB for these methods with three typical channel models. In addition, low-complexity decoding architectures for implementation of these schemes are presented. Finally, circuit simulation results confirm that the high-speed implementations of these methods are feasible.

Journal ArticleDOI
TL;DR: This work presents the Tiny-shim language for such systems and its semantics, demonstrates how to implement it in hardware and software, and discusses how it can be used to model a real-world system.
Abstract: Typical embedded hardware/software systems are implemented using a combination of C and an HDL such as Verilog. While each is well-behaved in isolation, combining the two gives a nondeterministic model of computation whose ultimate behavior must be validated through expensive (cycle-accurate) simulation. We propose an alternative for describing such systems. Our software/hardware integration medium (shim) model, effectively Kahn networks with rendezvous communication, provides deterministic concurrency. We present the Tiny-shim language for such systems and its semantics, demonstrate how to implement it in hardware and software, and discuss how it can be used to model a real-world system. By providing a powerful, deterministic formalism for expressing systems, designing systems, and verifying their correctness will become easier

Journal ArticleDOI
TL;DR: This paper presents scratchpad overlay techniques which analyze the application and insert instructions to dynamically copy both variables and code segments onto the scratchpad at runtime, and presents optimal and near-optimal approaches for solving the scratch pad overlay problem.
Abstract: Energy consumption is one of the important parameters to be optimized during the design of portable embedded systems. Thus, most of the contemporary portable devices feature low-power processors coupled with on-chip memories (e.g., caches, scratchpads). Scratchpads are better than traditional caches in terms of power, performance, area, and predictability. However, unlike caches they depend upon software allocation techniques for their utilization. In this paper, we present scratchpad overlay techniques which analyze the application and insert instructions to dynamically copy both variables and code segments onto the scratchpad at runtime. We demonstrate that the problem of overlaying scratchpad is an extension of the Global Register Allocation problem. We present optimal and near-optimal approaches for solving the scratchpad overlay problem. The near-optimal scratchpad overlay approach achieves close to the optimal results and is significantly faster than the optimal approach. Our approaches improve upon the previously known static allocation technique for assigning both variables and code segments onto the scratchpad. The evaluation of the approaches for ARM7 processor reports, average energy, and execution time reductions of 26% and 14% over the static approach, respectively. Additional experiments comparing the overlayed scratchpads against unified caches of the same size, report average energy, and execution time savings of 20% and 10%, respectively. We also report data memory energy reductions of 45%-57% due to the insertion of a 1024-bytes scratchpad memory in the memory hierarchy of a digital signal processor (DSP)

Journal ArticleDOI
TL;DR: A new two-stage hardware architecture that combines the features of both parallel dictionary LZW (PDLZW) and an approximated adaptive Huffman (AH) algorithms and shows that it not only outperforms the AH algorithm at the cost of only one-fourth the hardware resource but it is also competitive to the performance of LzW algorithm (compress).
Abstract: In this paper, we propose a new two-stage hardware architecture that combines the features of both parallel dictionary LZW (PDLZW) and an approximated adaptive Huffman (AH) algorithms. In this architecture, an ordered list instead of the tree-based structure is used in the AH algorithm for speeding up the compression data rate. The resulting architecture shows that it not only outperforms the AH algorithm at the cost of only one-fourth the hardware resource but it is also competitive to the performance of LZW algorithm (compress). In addition, both compression and decompression rates of the proposed architecture are greater than those of the AH algorithm even in the case realized by software

Journal ArticleDOI
TL;DR: LOTTERYBUS is presented, a high-performance SoC communication architecture based on new randomized on-chip communication protocols that addresses the shortcomings mentioned above and provides fine-grained control over bandwidth allocation.
Abstract: On-chip communication architectures play an important role in determining the overall performance of System-on-Chip (SoC) designs. Communication architectures should be flexible so as to offer high performance over a wide range of traffic characteristics. In particular, the resource sharing mechanism of the communication architecture, which determines how the often-conflicting requirements of different components are served, is of utmost importance. Conventional SoC architectures typically employ priority or time-division multiple-access (TDMA)-based communication architectures. However, these techniques are often inadequate. In the former, low-priority components may suffer from starvation, while in the latter, depending on the request profile, high-priority traffic may be subject to large latencies. This paper presents LOTTERYBUS, a high-performance SoC communication architecture based on new randomized on-chip communication protocols that addresses the shortcomings mentioned above. LOTTERYBUS provides each SoC component with a flexible, proportional, and probabilistically guaranteed share of the on-chip communication bandwidth. We present two variants of LOTTERYBUS. In the first variant, its architectural parameters are statically configured, leading to relatively low hardware overhead and design complexity. In the second variant, these parameters are allowed to vary dynamically, enabling more sophisticated use of LOTTERYBUS, at additional hardware cost. We have performed experiments to investigate the performance of LOTTERYBUS across a range of communication traffic characteristics. We have used LOTTERYBUS in designing a 4times4 ATM switch subsystem, and have compared its performance with conventional architectures. The results show that LOTTERYBUS provides fine-grained control over bandwidth allocation, and also provides significant reduction in average transaction latencies (up to 85%) compared to conventional architectures. Hardware implementations using a commercial 0.15-mum cell-based library indicate that the advantages provided by LOTTERYBUS are accompanied by modest hardware overheads compared to conventional architectures

Journal ArticleDOI
TL;DR: A novel architecture for synchronizing inter-modular communications in GALS, based on locally delayed latching (LDL), is described, which replaces complex global timing constraints with simpler localized ones and supports high data rates.
Abstract: Globally asynchronous, locally synchronous (GALS) systems-on-chip (SoCs) may be prone to synchronization failures if the delay of their locally-generated clock tree is not considered. This paper presents an in-depth analysis of the problem and proposes a novel solution. The problem is analyzed considering the magnitude of clock tree delays, the cycle times of the GALS module, and the complexity of the asynchronous interface controllers using a timed signal transition graph (STG) approach. In some cases, the problem can be solved by extracting all the delays and verifying whether the system is susceptible to metastability. In other cases, when high data bandwidth is not required, matched-delay asynchronous ports may be employed. A novel architecture for synchronizing inter-modular communications in GALS, based on locally delayed latching (LDL), is described. LDL synchronization does not require pausable clocking, is insensitive to clock tree delays, and supports high data rates. It replaces complex global timing constraints with simpler localized ones. Three different LDL ports are presented. The risk of metastability in the synchronizer is analyzed in a technology-independent manner

Journal ArticleDOI
TL;DR: A new degree computationless modified Euclid (DCME) algorithm and its dedicated architecture for Reed-Solomon (RS) decoder, which can completely remove the degree computation and comparison circuits and provide the short latency and low-cost RS decoding.
Abstract: This paper proposes a new degree computationless modified Euclid (DCME) algorithm and its dedicated architecture for Reed-Solomon (RS) decoder. This architecture has low hardware complexity compared with conventional modified Euclid (ME) architectures, since it can completely remove the degree computation and comparison circuits. The architecture employing a systolic array requires only the latency of 2t clock cycles to solve the key equation without initial latency. In addition, the DCME architecture using 3t+2 basic cells has regularity and scalability since it uses only one processing element. Hence, the proposed DCME architecture provides the short latency and low-cost RS decoding. The DCME architecture has been synthesized using the 0.25-mum Faraday CMOS standard cell library and operates at 200 MHz. The gate count of the DCME architecture is 21 760. Hence, the RS decoder using the proposed DCME architecture can reduce the total gate count by at least 23% and the total latency to at least 10% compared with conventional ME decoders

Journal ArticleDOI
TL;DR: A hybrid floating-point scheme with tailored exponent datapath, and a co-optimized architecture between hybrid floating point and block floating point (BFP) to reduce memory requirements for 2-D signal processing are proposed.
Abstract: This paper presents architectures for supporting dynamic data scaling in pipeline fast Fourier transforms (FFTs), suitable when implementing large size FFTs in applications such as digital video broadcasting and digital holographic imaging. In a pipeline FFT, data is continuously streaming and must, hence, be scaled without stalling the dataflow. We propose a hybrid floating-point scheme with tailored exponent datapath, and a co-optimized architecture between hybrid floating point and block floating point (BFP) to reduce memory requirements for 2-D signal processing. The presented co-optimization generates a higher signal-to-quantization-noise ratio and requires less memory than for instance convergent BFP. A 2048-point pipeline FFT has been fabricated in a standard-CMOS process from AMI Semiconductor (Lenart and Owall, 2003), and a field-programmable gate array prototype integrating a 2-D FFT core in a larger design shows that the architecture is suitable for image reconstruction in digital holographic imaging