scispace - formally typeset
Search or ask a question

Showing papers on "Very-large-scale integration published in 2006"


Proceedings ArticleDOI
28 Jun 2006
TL;DR: It is demonstrated that the introduction of a second parallel network can increase performance while improving efficiency, and different strategies for distributing traffic over the subnetworks are evaluated.
Abstract: We develop detailed area and energy models for on-chip interconnection networks and describe tradeoffs in the design of efficient networks for tiled chip multiprocessors. Using these detailed models we investigate how aspects of the network architecture including topology, channel width, routing strategy, and buffer size affect performance and impact area and energy efficiency. We simulate the performance of a variety of on-chip networks designed for tiled chip multiprocessors implemented in an advanced VLSI process and compare area and energy efficiencies estimated from our models. We demonstrate that the introduction of a second parallel network can increase performance while improving efficiency, and evaluate different strategies for distributing traffic over the subnetworks. Drawing on insights from our analysis, we present a concentrated mesh topology with replicated subnetworks and express channels which provides a 24% improvement in area efficiency and a 48% improvement in energy efficiency over other networks evaluated in this study.

547 citations


Journal ArticleDOI
25 Sep 2006
TL;DR: A brief discussion of key sources of power dissipation and their temperature relation in CMOS VLSI circuits, and techniques for full-chip temperature calculation with special attention to its implications on the design of high-performance, low-power V LSI circuits is presented.
Abstract: The growing packing density and power consumption of very large scale integration (VLSI) circuits have made thermal effects one of the most important concerns of VLSI designers The increasing variability of key process parameters in nanometer CMOS technologies has resulted in larger impact of the substrate and metal line temperatures on the reliability and performance of the devices and interconnections Recent data shows that more than 50% of all integrated circuit failures are related to thermal issues This paper presents a brief discussion of key sources of power dissipation and their temperature relation in CMOS VLSI circuits, and techniques for full-chip temperature calculation with special attention to its implications on the design of high-performance, low-power VLSI circuits The paper is concluded with an overview of techniques to improve the full-chip thermal integrity by means of off-chip versus on-chip and static versus adaptive methods

420 citations


Proceedings ArticleDOI
30 Oct 2006
TL;DR: An area-efficient mixed-signal implementation of synapse-based long term plasticity realized in a VLSI model of a spiking neural network and simultaneously achieves a synapse density of more than 9k synapses per mm2 in a 180 nm technology is described.
Abstract: This paper describes an area-efficient mixed-signal implementation of synapse-based long term plasticity realized in a VLSI model of a spiking neural network. The artificial synapses are based on an implementation of spike time dependent plasticity (STDP). In the biological specimen, STDP is a mechanism acting locally in each synapse. The presented electronic implementation succeeds in maintaining this high level of parallelism and simultaneously achieves a synapse density of more than 9k synapses per mm2 in a 180 nm technology. This allows the construction of neural micro-circuits close to the biological specimen while maintaining a speed several orders of magnitude faster than biological real time. The large acceleration factor enhances the possibilities to investigate key aspects of plasticity, e.g. by performing extensive parameter searches.

282 citations


Journal ArticleDOI
10 Jul 2006
TL;DR: This paper focuses on the reuse and integration issues encountered in this paradigm shift in system-on-chip (SoC) design, which includes connecting the computational units to the communication medium, which is moving from ad hoc bus-based approaches toward structured network- on- chip (NoC) architectures.
Abstract: Over the past ten years, as integrated circuits became increasingly more complex and expensive, the industry began to embrace new design and reuse methodologies that are collectively referred to as system-on-chip (SoC) design. In this paper, we focus on the reuse and integration issues encountered in this paradigm shift. The reusable components, called intellectual property (IP) blocks or cores, are typically synthesizable register-transfer level (RTL) designs (often called soft cores) or layout level designs (often called hard cores). The concept of reuse can be carried out at the block, platform, or chip levels, and involves making the IP sufficiently general, configurable, or programmable, for use in a wide range of applications. The IP integration issues include connecting the computational units to the communication medium, which is moving from ad hoc bus-based approaches toward structured network-on-chip (NoC) architectures. Design-for-test methodologies are also described, along with verification issues that must be addressed when integrating reusable components.

252 citations


Proceedings ArticleDOI
06 Mar 2006
TL;DR: This work develops the first systematic droplet routing method that can be integrated with biochip synthesis, which minimizes the number of cells used fordroplet routing, while satisfying constraints imposed by throughput considerations and fluidic properties.
Abstract: Recent advances in microfluidics are expected to lead to sensor systems for high-throughput biochemical analysis. CAD tools are needed to handle increased design complexity for such systems. Analogous to classical VLSI synthesis, a top-down design automation approach can shorten the design cycle and reduce human effort. We focus here on the droplet routing problem, which is a key issue in biochip physical design automation. We develop the first systematic droplet routing method that can be integrated with biochip synthesis. The proposed approach minimizes the number of cells used for droplet routing, while satisfying constraints imposed by throughput considerations and fluidic properties. A real-life biochemical application is used to evaluate the proposed method.

228 citations


Proceedings ArticleDOI
24 Jul 2006
TL;DR: This work is the first attempt to study the performance benefits of 3D technology under the influence of thermal constraints, and it is shown that the 3D system registers large performance improvement for memory intensive applications.
Abstract: Three-dimensional (3-D) integrated circuits have emerged as promising candidates to overcome the interconnect bottlenecks of nanometer scale designs. While they offer several other advantages, it is expected that the benefits from this technology can potentially be off-set by thermal considerations which impact chip performance and reliability. The work presented in this paper is the first attempt to study the performance benefits of 3-D technology under the influence of such thermal constraints. Using a processor-cache-memory system and carefully chosen applications encompassing different memory behaviors, the performance of 3-D architecture is compared with a conventional planar (2-D) design. It is found that the substantial increase in memory bus frequency and bus width contribute to a significant reduction in execution time with a 3-D design. It is also found that increasing the clock frequency translates into larger gains in system performance with 3-D designs than for planar 2-D designs in memory intensive applications. The thermal profile of the vertically stacked chip is generated taking into account the highly temperature sensitive leakage power dissipation. The maximum allowed operating frequency imposed by temperature constraint is shown to be lower for 3-D than for 2-D designs. In spite of these constraints, it is shown that the 3-D system registers large performance improvement for memory intensive applications.

215 citations


Proceedings ArticleDOI
01 Oct 2006
TL;DR: A high-throughput and low-power ECC scheme for MLC NAND flash memories that features byte-wise processing and a low complexity key equation solver using a simplified Berlekamp-Massey algorithm is presented.
Abstract: As the reliability is a critical issue for new generation multi-level cell (MLC) flash memories, there is growing call for fast and compact error correction code (ECC) circuit with minimum impact on memory access time and chip area. This paper presents a high-throughput and low-power ECC scheme for MLC NAND flash memories. The BCH encoder and decoder architecture features byte-wise processing and a low complexity key equation solver using a simplified Berlekamp-Massey algorithm. Resource sharing and power reduction techniques are also applied. Synthesized using 0.25-mum CMOS technology in a supply voltage of 2.5 V, the proposed BCH (4148,4096) encoder/decoder achieves byte-wise processing, and it needs an estimated cell area of 0.2 mm2, and an average power of 3.18 mW with 50 MB/s throughput

188 citations


Proceedings ArticleDOI
M. Wenk1, Martin Zellweger1, Andreas Burg1, Norbert Felber1, Wolfgang Fichtner1 
21 May 2006
TL;DR: In this paper, a parallel implementation of the K-best algorithm for MIMO systems is presented, which achieves up to 424 Mbps throughput with an area that is almost on par with current state-of-the-art implementations.
Abstract: From an error rate performance perspective, maximum likelihood (ML) detection is the preferred detection method for multiple-input multiple-output (MIMO) communication systems. However, for high transmission rates a straight forward exhaustive search implementation suffers from prohibitive complexity. The K-best algorithm provides close-to-ML bit error rate (BER) performance, while its circuit complexity is reduced compared to an exhaustive search. In this paper, a new VLSI architecture for the implementation of the K-best algorithm is presented. Instead of the mostly sequential processing that has been applied in previous VLSI implementations of the algorithm, the presented solution takes a more parallel approach. Furthermore, the application of a simplified norm is discussed. The implementation in an ASIC achieves up to 424 Mbps throughput with an area that is almost on par with current state-of-the-art implementations.

166 citations


Journal ArticleDOI
TL;DR: Measurement-based experimental results have demonstrated that the secure digital design flow is a functional technique to thwart side-channel power analysis, and successfully protects a prototype Advanced Encryption Standard (AES) IC fabricated in an 0.18-mum CMOS.
Abstract: Small embedded integrated circuits (ICs) such as smart cards are vulnerable to the so-called side-channel attacks (SCAs). The attacker can gain information by monitoring the power consumption, execution time, electromagnetic radiation, and other information leaked by the switching behavior of digital complementary metal-oxide-semiconductor (CMOS) gates. This paper presents a digital very large scale integrated (VLSI) design flow to create secure power-analysis-attack-resistant ICs. The design flow starts from a normal design in a hardware description language such as very-high-speed integrated circuit (VHSIC) hardware description language (VHDL) or Verilog and provides a direct path to an SCA-resistant layout. Instead of a full custom layout or an iterative design process with extensive simulations, a few key modifications are incorporated in a regular synchronous CMOS standard cell design flow. The basis for power analysis attack resistance is discussed. This paper describes how to adjust the library databases such that the regular single-ended static CMOS standard cells implement a dynamic and differential logic style and such that 20 000+ differential nets can be routed in parallel. This paper also explains how to modify the constraints and rules files for the synthesis, place, and differential route procedures. Measurement-based experimental results have demonstrated that the secure digital design flow is a functional technique to thwart side-channel power analysis. It successfully protects a prototype Advanced Encryption Standard (AES) IC fabricated in an 0.18-mum CMOS

159 citations


Journal ArticleDOI
TL;DR: This is the first demonstration of simultaneous nongalvanic power and data transfer between chips in a stack, aimed at reducing costs and complexity that are associated with galvanic inter-chip vias in 3-D integration.
Abstract: We report on inter-chip bidirectional communication and power transfer between two stacked chips. The experimental prototype system components were fabricated in a 0.5-mum silicon-on-sapphire CMOS technology. Bi-directional communication between the two chips is experimentally measured at 1Hz-15 MHz. The circuits on the floating top chip are powered with capacitively coupled energy using a charge pump. This is the first demonstration of simultaneous nongalvanic power and data transfer between chips in a stack. The potential use in 3-D VLSI is aimed at reducing costs and complexity that are associated with galvanic inter-chip vias in 3-D integration

115 citations


Journal ArticleDOI
TL;DR: The design of a single-chip VLSI analog computer fabricated in a 0.25-/spl mu/m CMOS process is described, used to simulate ordinary differential equations (ODEs), partial differential equations, and stochastic differential equations with moderate accuracy, significantly faster than a modern workstation.
Abstract: The design of a single-chip VLSI analog computer fabricated in a 0.25-/spl mu/m CMOS process is described. It contains 80 integrators, 336 other linear and nonlinear analog functional blocks, switches for their interconnection, and circuitry to enable the system's programing and control. The IC is controlled, programmed and measured by a PC via a data acquisition card. This arrangement has been used to simulate ordinary differential equations (ODEs), partial differential equations, and stochastic differential equations with moderate accuracy, significantly faster than a modern workstation. Techniques for using the digital computer to refine the solution from the analog computer are presented. Solutions from the analog computer have been used to accelerate a digital computer's solution of the periodic steady state of an ODE by more than 10/spl times/. The IC occupies 1 cm/sup 2/ and consumes 300 mW. An analysis has been done showing that the analog computer dissipates 0.02% to 1% of the energy of a general purpose digital microprocessor and about 2% to 20% of the energy of a digital signal processor, when solving the same differential equation.

Journal ArticleDOI
TL;DR: The architecture and VLSI circuit implementation of a BiCMOS potentiostat bank for monitoring neurotransmitter concentration on a screen-printed carbon electrode array is presented and Chronoamperometry dopamine concentration measurements results are given.
Abstract: We present the architecture and VLSI circuit implementation of a BiCMOS potentiostat bank for monitoring neurotransmitter concentration on a screen-printed carbon electrode array. The potentiostat performs simultaneous acquisition of bidirectional reduction-oxidation currents proportional to neurotransmitter concentration on 16 independent channels at controlled redox potentials. Programmable current gain control yields over 100-dB cross-scale dynamic range with 46-pA input-referred rms noise over 12-kHz bandwidth. The cutoff frequency of a second-order log-domain anti-aliasing filter ranges from 50 Hz to 400 kHz. Track-and-hold current integration is triggered at the sampling rate between dc and 200 kHz. A 2.25-mmtimes2.25-mm prototype was fabricated in a 1.2-mum VLSI technology and dissipates 12.5 mW. Chronoamperometry dopamine concentration measurements results are given. Other types of neurotransmitters can be selected by adjusting the redox potential on the electrodes and the surface properties of the sensor coating

Book ChapterDOI
03 Oct 2006
TL;DR: In this paper, the hardware implementation of neural network using FPGAs is presented using Very High Speed Integrated Circuits Hardware Description Language (VHDL) and is implemented in FPGA chip.
Abstract: The usage of the FPGA (Field Programmable Gate Array) for neural network implementation provides flexibility in programmable systems. For the neural network based instrument prototype in real time application, conventional specific VLSI neural chip design suffers the limitation in time and cost. With low precision artificial neural network design, FPGAs have higher speed and smaller size for real time application than the VLSI design. In addition, artificial neural network based on FPGAs has fairly achieved with classification application. The programmability of reconfigurable FPGAs yields the availability of fast special purpose hardware for wide applications. Its programmability could set the conditions to explore new neural network algorithms and problems of a scale that would not be feasible with conventional processor. The goal of this work is to realize the hardware implementation of neural network using FPGAs. Digital system architecture is presented using Very High Speed Integrated Circuits Hardware Description Language (VHDL) and is implemented in FPGA chip. The design was tested on a FPGA demo board.

Proceedings ArticleDOI
05 Nov 2006
TL;DR: Novel circuit optimization techniques to mitigate soft error rates (SER) of combinational logic circuits are presented, including a gate sizing algorithm that trades off SER reduction and area overhead and an enhanced flipflop library that contains flipflops of varying temporal masking ability.
Abstract: Soft errors in logic are emerging as a significant reliability problem for VLSI designs. This paper presents novel circuit optimization techniques to mitigate soft error rates (SER) of combinational logic circuits. First, we propose a gate sizing algorithm that trades off SER reduction and area overhead. This approach first computes bounds on the maximum achievable SER reduction by resizing a gate. This bound is then used to prune the circuit graph, arriving at a smaller set of candidate gates on which we perform incremental sensitivity computations to determine the gates that are the largest contributors to circuit SER. Second, we propose a flipflop selection method that uses slack information at each primary output node to determine the flipflop configuration that produces maximum SER savings. This approach uses an enhanced flipflop library that contains flipflops of varying temporal masking ability. Third, we propose a unified, co-optimization approach combining flipflop selection with the gate sizing algorithm. The joint optimization algorithm produces larger SER reductions while incurring smaller circuit overhead than either technique taken in isolation. Experimental results on a variety of benchmarks show SER reductions of 7.9times with gate sizing, 6.6times with flipflop assignment, and 28.2times for the combined optimization approach, with no delay penalties and area overheads within 5-6%. The runtimes for the optimization algorithms are on the order of 1-3 minutes

Patent
Xiaoping Tang1, Xin Yuan1
11 Apr 2006
TL;DR: In this paper, a system and a method for legalizing a flat or hierarchical VLSI layout to meet multiple grid constraints and conventional ground rules is presented, and the system and method support multiple grid pitch constraints for hierarchical design, and provide for LVS correctness to be maintained while an on-grid solution possibly with some spacing violations.
Abstract: A system and method are disclosed for legalizing a flat or hierarchical VLSI layout to meet multiple grid constraints and conventional ground rules. Given a set of ground rules with multiple grid constraints and a VLSI layout (either hierarchical or flat) which is layout-versus-schematic (LVS) correct but may not be ground rule correct, the system and method provide a legalized layout which meets the multiple grid constraints while maintaining LVS correctness and fixing the ground rule errors as much as possible with minimum layout perturbation from the input design. The system and method support multiple grid pitch constraints for hierarchical design, and provide for LVS correctness to be maintained while an on-grid solution possibly with some spacing violations.

Journal ArticleDOI
TL;DR: It is shown that the delay sensitivity to supply variations will increase in the next technology nodes, thus, it is expected that controlling the supply variation will be an increasingly important issue in the design of the next generation VLSI circuits.
Abstract: In this paper, some of the most practically interesting full adder topologies are analyzed in terms of their delay dependence on the supply voltage fluctuations, which are a major contribution to the delay uncertainty, which in turn limits the speed performance of current VLSI circuits. Analytical models of the delay sensitivity with respect to supply variations are derived by following a simplified circuit analysis, and the resulting expressions are simple enough to afford a deeper insight into the impact of supply voltage variations on each topology. The models are shown to be sufficiently accurate through simulations with CMOS technologies having a minimum feature size ranging from 90 nm to 0.35 mum. Several interesting properties and design considerations are derived from these models, and the effect of the supply voltage scaling, technology scaling, transistor sizing, and input transition time is discussed. Strategies to evaluate the delay sensitivity since the early design phases (e.g., from ring oscillator measurements) are also introduced. As a fundamental result, it is shown that the delay sensitivity to supply variations will increase in the next technology nodes, thus, it is expected that controlling the supply variations will be an increasingly important issue in the design of the next generation VLSI circuits. The proposed methodology is also analyzed in the case of more general digital circuits, and is used to estimate the impact of the inter-die threshold voltage variations on the delay of the considered full adder topologies

Journal ArticleDOI
TL;DR: A pipelined single-precision floating-point multiply-accumulator (FPMAC) featuring a single-cycle accumulate loop using base 32 and internal carry-save arithmetic with delayed addition is described, allowing removal of the costly normalization step from the critical accumulate loop.
Abstract: A pipelined single-precision floating-point multiply-accumulator (FPMAC) featuring a single-cycle accumulate loop using base 32 and internal carry-save arithmetic with delayed addition is described. A combination of algorithmic, logic, and circuit techniques enables multiply-accumulate operations at speeds exceeding 3 GHz with single-cycle throughput. The optimizations allow removal of the costly normalization step from the critical accumulate loop. This logic is conditionally powered down using dynamic sleep transistors on long accumulate operations, saving active and leakage power. In addition, an improved leading-zero anticipator (LZA) and overflow prediction logic applicable to carry-save format is presented. In a 90-nm seven-metal dual-VT CMOS process, the 2 mm2 custom design contains 230K transistors. The fully functional first silicon achieves 6.2 GFlops of performance while dissipating 1.2 W at 3.1 GHz, 1.3-V supply

Journal ArticleDOI
TL;DR: Results and case studies with optimizations that are: on the gate level-Kasumi and International Data Encryption Algorithm encryptions; on the arithmetic level-redundant addition and multiplication function evaluation for two-dimensional rotation; and on the architecture level-Wavelet and Lempel-Ziv (LZ)-like compression are presented.
Abstract: A stream compiler (ASC) for computing with field programmable gate arrays (FPGAs) emerges from the ambition to bridge the hardware-design productivity gap where the number of available transistors grows more rapidly than the productivity of very large scale integration (VLSI) and FPGA computer-aided-design (CAD) tools. ASC addresses this problem with a softwarelike programming interface to hardware design (FPGAs) while keeping the performance of hand-designed circuits at the same time. ASC improves productivity by letting the programmer optimize the implementation on the algorithm level, the architecture level, the arithmetic level, and the gate level, all within the same C++ program. The increased productivity of ASC is applied to the hardware acceleration of a wide range of applications. Traditionally, hardware accelerators are tediously handcrafted to achieve top performance. ASC simplifies design-space exploration of hardware accelerators by transforming the hardware-design task into a software-design process, using only "GNU compiler collection (GCC)" and "make" to obtain a hardware netlist. From experience, the hardware-design productivity and ease of use are close to pure software development. This paper presents results and case studies with optimizations that are: 1) on the gate level-Kasumi and International Data Encryption Algorithm (IDEA) encryptions; 2) on the arithmetic level-redundant addition and multiplication function evaluation for two-dimensional (2-D) rotation; and 3) on the architecture level-Wavelet and Lempel-Ziv (LZ)-like compression

Journal ArticleDOI
TL;DR: In this paper, a scalable architecture for real-time speech recognizers based on word hidden Markov models (HMMs) is described, which provides high recognition accuracy for word recognition tasks.
Abstract: This paper describes a scalable architecture for real-time speech recognizers based on word hidden Markov models (HMMs) that provide high recognition accuracy for word recognition tasks. However, the size of their recognition vocabulary is small because its extremely high computational costs cause long processing times. To achieve high-speed operations, we developed a VLSI system that has a scalable architecture. The architecture effectively uses parallel computations on the word HMM structure. It can reduce processing time and/or extend the word vocabulary. To explore the practicality of our architecture, we designed and evaluated a complete system recognizer, including speech analysis and noise robustness parts, on a 0.18-/spl mu/m CMOS standard cell library and field-programmable gate array. In the CMOS standard-cell implementation, the total processing time is 56.9 /spl mu/s/word at an operating frequency of 80 MHz in a single system. The recognizer gives a real-time response using an 800-word vocabulary.

Journal ArticleDOI
TL;DR: A new two-stage hardware architecture that combines the features of both parallel dictionary LZW (PDLZW) and an approximated adaptive Huffman (AH) algorithms and shows that it not only outperforms the AH algorithm at the cost of only one-fourth the hardware resource but it is also competitive to the performance of LzW algorithm (compress).
Abstract: In this paper, we propose a new two-stage hardware architecture that combines the features of both parallel dictionary LZW (PDLZW) and an approximated adaptive Huffman (AH) algorithms. In this architecture, an ordered list instead of the tree-based structure is used in the AH algorithm for speeding up the compression data rate. The resulting architecture shows that it not only outperforms the AH algorithm at the cost of only one-fourth the hardware resource but it is also competitive to the performance of LZW algorithm (compress). In addition, both compression and decompression rates of the proposed architecture are greater than those of the AH algorithm even in the case realized by software

Proceedings ArticleDOI
01 Oct 2006
TL;DR: An efficient transistor-level sizing algorithm based on a modified Lagrangian Relaxation (LR) technique to account for the temporal degradation of circuit and guarantee lifetime reliability of circuit under NBTI is proposed.
Abstract: Temporal performance degradation in VLSI circuits due to Negative Bias Temperature Instability (NBTI) has emerged as a challenging design issue in nano-scale technology. In this paper, we analyze the impact of NBTI degradation in circuit performance in terms of timing, and show that under worst case scenario, one can expect more than a 10% degradation in the maximum circuit delay after 3 years (~ 108 seconds) operation time. Based on this observation, we propose an efficient transistor-level sizing algorithm based on a modified Lagrangian Relaxation (LR) technique to account for the temporal degradation of circuit and guarantee lifetime reliability of circuit under NBTI. The technique reformulates the sizing problem by considering the fact that only the rising (0 rarr 1) delays of CMOS logic gates are affected by the NBTI. Experimental results on several ISCAS'85 benchmarks have shown that our proposed transistor-level sizing approach can reduce the area overhead of conventional cell-level sizing method by an average of 43%.

Proceedings ArticleDOI
04 Oct 2006
TL;DR: Construction of soft error masking latches (SEM-latches) capable of masking transient pulses occurring on combinational circuits and experimental results show that the proposed method has higher soft error tolerant capability than the existing methods.
Abstract: In recent high-density and low-power VLSIs, soft errors occurring on not only memory systems and the latches of logic circuits but also the combinational parts of logic circuits seriously affect the operation of systems. The conventional soft error tolerant methods for soft errors on the combinational parts do not provide enough high soft error tolerant capability with small performance penalty. This paper proposes a class of soft error masking circuits by using a Schmitt trigger circuit and pass transistors. The paper also presents construction of soft error masking latches (SEM-Latches) capable of masking transient pulses occurring on combinational circuits. Moreover, experimental results show that the proposed method has higher soft error tolerant capability than the existing methods. For driving voltage VDD=3.3V, the proposed method is capable of masking transient pulses of magnitude 4.0V or less.

Proceedings ArticleDOI
25 Apr 2006
TL;DR: This paper proposes a highly parallel FPGA design for the Floyd-Warshall algorithm to solve the all-pairs shortest-paths problem in a directed graph to maximize parallelism in the presence of significant data dependences.
Abstract: With rapid advances in VLSI technology, field programmable gate arrays (FPGAs) are receiving the attention of the parallel and high performance computing community. In this paper, we propose a highly parallel FPGA design for the Floyd-Warshall algorithm to solve the all-pairs shortest-paths problem in a directed graph. Our work is motivated by a computationally intensive bio-informatics application that employs this algorithm. The design we propose makes efficient and maximal utilization of the large amount of resources available on an FPGA to maximize parallelism in the presence of significant data dependences. Experimental results from a working FPGA implementation on the Cray XD1 show a speedup of 22 over execution on the XD1's processor.


Proceedings ArticleDOI
06 Mar 2006
TL;DR: This paper presents a novel parametric waveform model based on the Weibull function to represent particle strikes at individual nodes in the circuit and describes the construction of the SET descriptor that efficiently captures the correlation between the transient waveforms and their associated rate distribution functions.
Abstract: Soft errors have emerged as an important reliability challenge for nanoscale VLSI designs. In this paper, we present a fast and efficient soft error rate (SER) computation algorithm for combinational circuits. We first present a novel parametric waveform model based on the Weibull function to represent particle strikes at individual nodes in the circuit. We then describe the construction of the SET descriptor that efficiently captures the correlation between the transient waveforms and their associated rate distribution functions. The proposed algorithm consists of operations to inject, propagate and merge SET descriptors while traversing forward along the gates in a circuit. The parameterized waveforms enable an efficient static approach to calculate the SER of a circuit. We exercise the proposed approach on a wide variety of combinational circuits and observe that our algorithm has linear runtime with the size of the circuit. The runtimes for soft error estimation were observed to be in the order of about one second, compared to several minutes or even hours for previously proposed methods.

Journal ArticleDOI
TL;DR: A new degree computationless modified Euclid (DCME) algorithm and its dedicated architecture for Reed-Solomon (RS) decoder, which can completely remove the degree computation and comparison circuits and provide the short latency and low-cost RS decoding.
Abstract: This paper proposes a new degree computationless modified Euclid (DCME) algorithm and its dedicated architecture for Reed-Solomon (RS) decoder. This architecture has low hardware complexity compared with conventional modified Euclid (ME) architectures, since it can completely remove the degree computation and comparison circuits. The architecture employing a systolic array requires only the latency of 2t clock cycles to solve the key equation without initial latency. In addition, the DCME architecture using 3t+2 basic cells has regularity and scalability since it uses only one processing element. Hence, the proposed DCME architecture provides the short latency and low-cost RS decoding. The DCME architecture has been synthesized using the 0.25-mum Faraday CMOS standard cell library and operates at 200 MHz. The gate count of the DCME architecture is 21 760. Hence, the RS decoder using the proposed DCME architecture can reduce the total gate count by at least 23% and the total latency to at least 10% compared with conventional ME decoders

Patent
26 Apr 2006
TL;DR: In this paper, a method of routing a random logic macro (RLM) that is used multiple times in a hierarchical VLSI design without having to route each individual instantiation independently is presented.
Abstract: A method of routing a random logic macro (RLM) that is used multiple times in a hierarchical VLSI design without having to route each individual instantiation independently. Once an RLM has been routed and timed it can be copied and reused in a physical design as is, and does not require any wiring changes. This method is an advantage over existing art because it conserves area, improves wireability, and reduces the time required for routing and timing each RLM instance. Furthermore, each RLM possesses the same timing and power characteristics, which improves overall circuit performance.

Journal ArticleDOI
TL;DR: A new VLSI architecture that can insert invisible or visible watermarks in images in the discrete cosine transform domain is presented that incorporates low-power techniques such as dual voltage, dual frequency, and clock gating to reduce the power consumption and exploits pipelining and parallelism extensively in order to achieve high performance.
Abstract: In this brief, we present a new VLSI architecture that can insert invisible or visible watermarks in images in the discrete cosine transform domain. The proposed architecture incorporates low-power techniques such as dual voltage, dual frequency, and clock gating to reduce the power consumption and exploits pipelining and parallelism extensively in order to achieve high performance. The supply voltage level and the operating frequency are chosen for each module so as to maintain the required bandwidth and throughput match among the different modules. A prototype VLSI chip was designed and verified using various Cadence and Synopsys tools based on TSMC 0.25-/spl mu/m technology with 1.4 M transistors and 0.3 mW of estimated dynamic power.

Journal ArticleDOI
TL;DR: An application-independent defect tolerant design flow to minimize customized postfabrication design efforts to be performed per chip and two mapping algorithms, recursive and greedy, which make the connection between defect-unaware design steps and the final defect-aware mapping step are presented.
Abstract: Self-assembled nanofabrication processes yield regular and reconfigurable devices However, defect densities in this emerging nanotechnology are higher than those in conventional lithography-based VLSI In this article, we present an application-independent defect tolerant design flow to minimize customized postfabrication design efforts to be performed per chip In this flow, higher level design steps are not needed to be aware of the existence and the location of defects in the chip Only a final mapping step is required to be defect aware Application independence of this flow minimizes the number of per-chip design steps, making it appropriate for high volume production We also present two mapping algorithms, recursive and greedy, which make the connection between defect-unaware design steps and the final defect-aware mapping step Experiments show that the results obtained by the greedy algorithm are very close to the exact solutions Using these algorithms, we analyze the manufacturing yield of molecular crossbars under different defect distribution models We report on the size of the minimum crossbar to be fabricated such that a defect-free crossbar of the desirable size can be found with a guaranteed manufacturing yield

Proceedings ArticleDOI
21 May 2006
TL;DR: A high performance VLSI architecture of FME is described to achieve the capacity of encoding the high-resolution real-time video stream for HDTV to provide the processing capacity of more than 250K MB/sec which is enough for 1080HD (1920times1088) video streams at frame rate of 30fps.
Abstract: Fractional motion estimation (FME) on sub-pixels will occupy almost over 45% of the computation complexity of H.264 encoding process. Therefore a high performance VLSI architecture of FME is described in this paper to achieve the capacity of encoding the high-resolution real-time video stream for HDTV. Our design is improved from an existing work by involving a pipeline strategy in sub-pixel interpolation unit which can avoid the long delay paths in 6-tap ID FIR so as to increase the clock frequency up to 200MHz. Moreover, a 16-pixel search engine is adopted to remove the redundant interpolation area and parallelize the various block size search which can save more than half of the clock cycles in processing a macro block. Our design is implemented with only 189K gates at operating frequency of 200MHz in worst case (285MHz in typical case). It can provide the processing capacity of more than 250K MB/sec which is enough for 1080HD (1920/spl times/1088) video streams at frame rate of 30fps. It is a useful intellectual property (IP) design for multimedia system.