scispace - formally typeset
Search or ask a question

Showing papers on "Very-large-scale integration published in 2004"


Book
21 May 2004
TL;DR: The authors draw upon extensive industry and classroom experience to introduce todays most advanced and effective chip design practices, and present extensively updated coverage of every key element of VLSI design, and illuminate the latest design challenges with 65 nm process examples.
Abstract: For both introductory and advanced courses in VLSI design, this authoritative, comprehensive textbook is highly accessible to beginners, yet offers unparalleled breadth and depth for more experienced readers. The Fourth Edition of CMOS VLSI Design: A Circuits and Systems perspective presents broad and in-depth coverage of the entire field of modern CMOS VLSI Design. The authors draw upon extensive industry and classroom experience to introduce todays most advanced and effective chip design practices. They present extensively updated coverage of every key element of VLSI design, and illuminate the latest design challenges with 65 nm process examples. This book contains unsurpassed circuit-level coverage, as well as a rich set of problems and worked examples that provide deep practical insight to readers at all levels.

2,355 citations


Journal ArticleDOI
T. Karnik1, P. Hazucha1
TL;DR: This paper presents radiation particle interactions with silicon, charge collection effects, soft errors, and their effect on VLSI circuits, and discusses the impact of SEUs on system reliability.
Abstract: Radiation-induced single event upsets (SEUs) pose a major challenge for the design of memories and logic circuits in high-performance microprocessors in technologies beyond 90nm. Historically, we have considered power-performance-area trade offs. There is a need to include the soft error rate (SER) as another design parameter. In this paper, we present radiation particle interactions with silicon, charge collection effects, soft errors, and their effect on VLSI circuits. We also discuss the impact of SEUs on system reliability. We describe an accelerated measurement of SERs using a high-intensity neutron beam, the characterization of SERs in sequential logic cells, and technology scaling trends. Finally, some directions for future research are given.

531 citations


Proceedings ArticleDOI
24 May 2004
TL;DR: In this article, the authors discuss the impact of digital control in high-frequency switched-mode power supplies (SMPS), including point-of-load and isolated DC-DC converters, microprocessor power supplies, power factor correction rectifiers, electronic ballasts, etc., where high efficiency, static and dynamic regulation, low size and weight, as well as low controller complexity and cost are very important.
Abstract: In this paper, we discuss the impact of digital control in high-frequency switched-mode power supplies (SMPS), including point-of-load and isolated DC-DC converters, microprocessor power supplies, power-factor-correction rectifiers, electronic ballasts, etc., where switching frequencies are typically in the hundreds of kHz to MHz range, and where high efficiency, static and dynamic regulation, low size and weight, as well as low controller complexity and cost are very important. To meet these application requirements, a digital SMPS controller may include fast, small analog-to-digital converters, hardware-accelerated programmable compensators, programmable digital modulators with very fine time resolution, and a standard microcontroller core to perform programming, monitoring and other system interface tasks. Based on recent advances in circuit and control techniques, together with rapid advances in digital VLSI technology, we conclude that high-performance digital controller solutions are both feasible and practical, leading to much enhanced system integration and performance gains. Examples of experimentally demonstrated results are presented, together with pointers to areas of current and future research and development.

474 citations


Journal ArticleDOI
TL;DR: Using the proposed architecture, a fully subpipelined encryptor with 7 substages in each round unit can achieve a throughput of 21.56 Gbps on a Xilinx XCV1000 e-8 bg560 device in non-feedback modes, which is faster and 79% more efficient in terms of equivalent throughput/slice than the fastest previous FPGA implementation known to date.
Abstract: This paper presents novel high-speed architectures for the hardware implementation of the Advanced Encryption Standard (AES) algorithm. Unlike previous works which rely on look-up tables to implement the SubBytes and InvSubBytes transformations of the AES algorithm, the proposed design employs combinational logic only. As a direct consequence, the unbreakable delay incurred by look-up tables in the conventional approaches is eliminated, and the advantage of subpipelining can be further explored. Furthermore, composite field arithmetic is employed to reduce the area requirements, and different implementations for the inversion in subfield GF(2/sup 4/) are compared. In addition, an efficient key expansion architecture suitable for the subpipelined round units is also presented. Using the proposed architecture, a fully subpipelined encryptor with 7 substages in each round unit can achieve a throughput of 21.56 Gbps on a Xilinx XCV1000 e-8 bg560 device in non-feedback modes, which is faster and is 79% more efficient in terms of equivalent throughput/slice than the fastest previous FPGA implementation known to date.

450 citations


Book
18 Oct 2004
TL;DR: This paper presents VLSI Architectures for Discrete Wavelet Transforms and Coding Algorithms in JPEG 2000, a guide to data compression techniques used in the development of JPEG 2000.
Abstract: Preface1 Introduction to Data Compression2 Source Coding Algorithms3 JPEG-Still Image Compression Standard4 Introduction to Discrete Wavelet Transform5 VLSI Architectures for Discrete Wavelet Transforms6 JPEG 2000 Standard7 Coding Algorithms in JPEG 20008 Code Stream Organization and File Format9 VLSI Architectures for JPEG 200010 Beyond Part 1 of JPEG 2000IndexAbout the Authors

347 citations


Book
01 Jan 2004
TL;DR: This book introduces the essentials of VLSI: fabrication, circuits, interconnects, combinational and sequential logic design, system architectures, and more, and demonstrates how to reflect this VLSi knowledge in a state-of-the-art design methodology that leverages FPGA's most valuable characteristics while mitigating its limitations.
Abstract: Everything FPGA designers need to know about FPGAs and VLSI Digital designs once built in custom silicon are increasingly implemented in field programmable gate arrays (FPGAs). Effective FPGA system design requires a strong understanding of VLSI issues and constraints, and an understanding of the latest FPGA-specific techniques. In this book, Princeton University's Wayne Wolf covers everything FPGA designers need to know about all these topics: both the "how" and the "why." Wolf begins by introducing the essentials of VLSI: fabrication, circuits, interconnects, combinational and sequential logic design, system architectures, and more. Next, he demonstrates how to reflect this VLSI knowledge in a state-of-the-art design methodology that leverages FPGA's most valuable characteristics while mitigating its limitations. Coverage includes: How VLSI characteristics affect FPGAs and FPGA-based logic design How classical logic design techniques relate to FPGA-based logic design Understanding FPGA fabrics: the basic programmable structures of FPGAs Specifying and optimizing logic to address size, speed, and power consumption Verilog, VHDL, and software tools for optimizing logic and designs The structure of large digital systems, including register-transfer design methodology Building large-scale platform and multi-FPGA systems A start-to-finish DSP case study addressing a wide range of design problems PRENTICE HALL Professional Technical Reference Upper Saddle River, NJ 07458 www.phptr.com ISBN: 0-13-142461-0

248 citations


Journal ArticleDOI
TL;DR: The ACE16k as mentioned in this paper is a member of the third generation of the ACE chips, which is designed in a 0.35-/spl mu/m standard CMOS technology, and exhibits peak computing figures of 330 GOPS, 3.6 GOPS/mm/sup 2/ and 82.5 GOPS /W.
Abstract: Today, with 0.18-/spl mu/m technologies mature and stable enough for mixed-signal design with a large variety of CMOS compatible optical sensors available and with 0.09-/spl mu/m technologies knocking at the door of designers, we can face the design of integrated systems, instead of just integrated circuits. In fact, significant progress has been made in the last few years toward the realization of vision systems on chips (VSoCs). Such VSoCs are eventually targeted to integrate within a semiconductor substrate the functions of optical sensing, image processing in space and time, high-level processing, and the control of actuators. The consecutive generations of ACE chips define a roadmap toward flexible VSoCs. These chips consist of arrays of mixed-signal processing elements (PEs) which operate in accordance with single instruction multiple data (SIMD) computing architectures and exhibit the functional features of CNN Universal Machines. They have been conceived to cover the early stages of the visual processing path in a fully-parallel manner, and hence more efficiently than DSP-based systems. Across the different generations, different improvements and modifications have been made looking to converge with the newest discoveries of neurobiologists regarding the behavior of natural retinas. This paper presents considerations pertaining to the design of a member of the third generation of ACE chips, namely to the so-called ACE16k chip. This chip, designed in a 0.35-/spl mu/m standard CMOS technology, contains about 3.75 million transistors and exhibits peak computing figures of 330 GOPS, 3.6 GOPS/mm/sup 2/ and 82.5 GOPS/W. Each PE in the array contains a reconfigurable computing kernel capable of calculating linear convolutions on 3/spl times/3 neighborhoods in less than 1.5 /spl mu/s, imagewise Boolean combinations in less than 200 ns, imagewise arithmetic operations in about 5 /spl mu/s, and CNN-like temporal evolutions with a time constant of about 0.5 /spl mu/s. Unfortunately, the many ideas underlying the design of this chip cannot be covered in a single paper; hence, this paper is focused on, first, placing the ACE16k in the ACE chip roadmap and, then, discussing the most significant modifications of ACE16K versus its predecessors in the family.

230 citations


Proceedings ArticleDOI
07 Nov 2004
TL;DR: This work uses an array of dynamic PLAs which require only metal and via mask customization in order to implement a new design, and demonstrates that this approach strikes a reasonable compromise between ASIC and field programmable design methodologies in terms of placed-and-routed area and delay.
Abstract: In recent times there has been a substantial increase in the cost and complexity of fabricating a VLSI chip. The lithography masks themselves can cost between /spl epsi/ and /spl ges/. It is conjectured that due to these increasing costs, the number of ASIC starts in the last few years has declined. We address this problem by using an array of dynamic PLAs which require only metal and via mask customization in order to implement a new design. This would allow several similar-sized designs to share the same base set of masks (right up to the metal layers) and only have different metal and via masks. We have implemented our methodology for both combinational and sequential designs, and demonstrate that our approach strikes a reasonable compromise between ASIC and field programmable design methodologies in terms of placed-and-routed area and delay. Our method has a 2.89/spl times/ (3.58/spl times/) delay overhead and a 4.96/spl times/ (3.44/spl times/) area overhead compared to standard cells for combinational (sequential) designs.

187 citations


Book
01 May 2004
TL;DR: This monograph details cutting-edge design techniques for the low power circuitry required by the many new miniaturized business and consumer products driving the electronics market.
Abstract: Designers developing the low voltage, low power chips that enable small, portable devices, face a very particular set of challenges This monograph details cutting-edge design techniques for the low power circuitry required by the many new miniaturized business and consumer products driving the electronics market Table of contents Chapter 1: Low-Power CMOS VLSI Design Chapter 2: Circuit Techniques for Low-Power Design Chapter 3: Low-Voltage Low-Power Adders Chapter 4: Low-Voltage Low-Power Multipliers Chapter 5: Low-Voltage Low-Power Read-Only Memories Chapter 6: Low-Voltage Low-Power Static Random-Access Memories Chapter 7: Low-Voltage Low-Power Dynamic Random-Access Chapter 8: Large Low-Power VLSI System Design and Applications Index

175 citations


Proceedings ArticleDOI
Shekhar Borkar1
04 Dec 2004
TL;DR: Potential solutions in process technology, circuits, and microarchitectures to exploit future gigascale integration capacity are discussed and the system on a chip (SOC) concept will help integrate diverse functional blocks, providing valued performance.
Abstract: VLSI system performance increased by five orders of magnitude in the last three decades, made possible by continued technology scaling, improving transistor performance to increase frequency, increasing integration capacity to realize complex architectures, and reducing energy consumed per logic operation to keep power dissipation within limit. The technology treadmill will continue, providing integration capacity of billions of transistors; however, power, energy consumption, and variations will be the barriers. Performance at any cost will not be an option in the future; VLSI systems will have to emphasize performance delivered in a given power envelope, with complexity limited by energy efficiency and variability. This talk will discuss potential solutions in process technology, circuits, and microarchitectures to exploit future gigascale integration capacity. The system on a chip (SOC) concept will help integrate diverse functional blocks, providing valued performance. The talk will conclude with recommendations to the VLSI system designers and microarchitects on how to exploit these emerging paradigms.

155 citations


Proceedings ArticleDOI
23 May 2004
TL;DR: A heuristic algorithm is proposed to construct good irregular LDPC codes subject to two constraints that ensure the effective LDPC encoder and decoder hardware implementations.
Abstract: This paper presents a design approach for low-density parity-check (LDPC) coding system hardware implementation by jointly conceiving irregular LDPC code construction and VLSI implementations of encoder and decoder The key idea is to construct good irregular LDPC codes subject to two constraints that ensure the effective LDPC encoder and decoder hardware implementations We propose a heuristic algorithm to construct such implementation-aware irregular LDPC codes that can achieve very good error correction performance The encoder and decoder hardware architectures are correspondingly presented

Proceedings ArticleDOI
27 Jun 2004
TL;DR: This paper uses quadratic programming to optimize the total wire-length of the placement and a deterministic recursive partition rectangle packing algorithm based on LFF principles with consideration of congestion to implement the placement in an estimated fixed die area.
Abstract: In VLSI module placement interconnection behavior becomes increasingly important. Less flexibility first (LFF) principle is derived from human accumulated experience. An interconnection driven VLSI module placement algorithm based on LFF principles is proposed in this paper. We first use quadratic programming to optimize the total wire-length of the placement and then using a deterministic recursive partition rectangle packing algorithm based on LFF principles with consideration of congestion to implement the placement in an estimated fixed die area. Experimental results show efficiency and effectiveness of the proposed method.

Journal ArticleDOI
TL;DR: This brief shows that a conventional semi-custom design-flow based on a positive feedback adiabatic logic (PFAL) cell library allows any VLSI designer to design and verify complex adiAbatic systems in a short time and easy way, thus, enjoying the energy reduction benefits of adiABatic logic.
Abstract: This brief shows that a conventional semi-custom design-flow based on a positive feedback adiabatic logic (PFAL) cell library allows any VLSI designer to design and verify complex adiabatic systems (e.g., arithmetic units) in a short time and easy way, thus, enjoying the energy reduction benefits of adiabatic logic. A family of semi-custom PFAL carry lookahead adders and parallel multipliers were designed in a 0.6-/spl mu/m CMOS technology and verified. Post-layout simulations show that semi-custom adiabatic arithmetic units can save energy a factor 17 at 10 MHz and about 7 at 100 MHz, as compared to a logically equivalent static CMOS implementation. The energy saving obtained is also better if compared to other custom adiabatic circuit realizations and maintains high values (3/spl divide/6) even when the losses in power-clock generation are considered.

Journal ArticleDOI
TL;DR: A new scalable single-chip communication architecture for heterogeneous resources, adaptive system-on-a-chip (aSOC) and supporting software for application mapping that exhibits hardware simplicity and optimized support for compile-time scheduled communication is described.
Abstract: A dramatic increase in single chip capacity has led to a revolution in on-chip integration. Design reuse and ease of implementation have became important aspects of the design process. This paper describes a new scalable single-chip communication architecture for heterogeneous resources, adaptive system-on-a-chip (aSOC) and supporting software for application mapping. This architecture exhibits hardware simplicity and optimized support for compile-time scheduled communication. To illustrate the benefits of the architecture, four high-bandwidth signal processing applications including an MPEG-2 video encoder and a Doppler radar processor have been mapped to a prototype aSOC device using our design mapping technology. Through experimentation it is shown that aSOC communication outperforms a hierarchical bus-based system-on-chip (SoC) approach by up to a factor of five. A VLSI implementation of the communication architecture indicates clock rates of 400 MHz in 0.18-/spl mu/m technology for sustained on-chip communication. In comparison to previously-published results for an MPEG-2 decoder, our on-chip interconnect shows a runtime improvement of over a factor of four.

Journal ArticleDOI
Louise H. Trevillyan1, David S. Kung1, Ruchir Puri1, Lakshmi Reddy1, Michael A. Kazda1 
TL;DR: With larger chip images and increasingly aggressive technologies, key design processes must interoperate, PDS accomplishes technology closure through interacting processes of logic optimization, placement, timing, clock insertion, and routing, all using a common infrastructure with robust variable-accuracy analysis abstractions.
Abstract: With larger chip images and increasingly aggressive technologies, key design processes must interoperate, PDS, a physical-synthesis system, accomplishes technology closure through interacting processes of logic optimization, placement, timing, clock insertion, and routing, all using a common infrastructure with robust variable-accuracy analysis abstractions.

Book
01 Aug 2004
TL;DR: A vital tool for professional engineers, as well as graduate students of engineering, the text explains the design issues, guidelines, and CAD tools for the power distribution of the VLSI chip and package, and provides numerous examples for its effective application.
Abstract: Description: A hands-on troubleshooting guide for VLSI network designers The primary goal in VLSI (very large scale integration) power network design is to provide enough power lines across a chip to reduce voltage drops from the power pads to the center of the chip. Voltage drops caused by the power network's metal lines coupled with transistor switching currents on the chip cause power supply noises that can affect circuit timing and performance, thus providing a constant challenge for designers of high-performance chips. Power Distribution Network Design for VLSI provides detailed information on this critical component of circuit design and physical integration for high-speed chips. A vital tool for professional engineers (especially those involved in the use of commercial tools), as well as graduate students of engineering, the text explains the design issues, guidelines, and CAD tools for the power distribution of the VLSI chip and package, and provides numerous examples for its effective application. Features of the text include: An introduction to power distribution network design Design perspectives, such as power network planning, layout specifications, decoupling capacitance insertion, modeling, and analysis Electromigration phenomena IR drop analysis methodology Commands and user interfaces of the VoltageStorm(TM) CAD tool Microprocessor design examples using on-chip power distribution Flip-chip and package design issues Power network measurement techniques from real silicon The author includes several case studies and a glossary of key words and basic terms to help readers understand and integrate basic concepts in VLSI design and power distribution.

Proceedings ArticleDOI
Zhan Guo1, P. Nilsson1
16 Aug 2004
TL;DR: The lattice decoder is shown to approach the performance of maximum-likelihood decoder for MIMO wireless systems with low complexity and up to 37% computation reductions, 20% area savings and more than 5 times decoding throughput improvements.
Abstract: The lattice decoder is shown to approach the performance of maximum-likelihood decoder for MIMO wireless systems with low complexity. A VLSI architecture of the K-best Schnorr-Euchner lattice decoder is proposed in this paper. The architecture is optimized on both algorithm and architecture levels, and supports a dynamic range of SNR /spl les/30 dB. Compared to a conventional VLSI implementation of the lattice decoder for MIMO systems, the proposed architecture results in up to 37% computation reductions, 20% area savings and more than 5 times decoding throughput improvements. The proposed architecture is implemented with 0.35 /spl mu/m technology for a system of 4 transmit/receive antennas and 16-QAM modulation. The results show that a decoding throughput of 53.3 Mbits/s can be achieved, and the decoding latency is less than 2.5 /spl mu/s.

Proceedings ArticleDOI
Peter Feldmann1, Frank Liu2
07 Nov 2004
TL;DR: The new RecMOR algorithm decomposes the large matrix-transfer function recursively, and applies SVDMOR compression adaptively to the sub-blocks of the transfer function, resulting in a reduced order model that is sparse, efficient, and directly usable as an efficient substitute of the subcircuit in circuit simulations.
Abstract: In the process of designing state-of-the art VLSI circuit we often encounter large but highly structured linear subcircuits with large number of terminals. Classical examples are power supply networks, clock distribution networks, large data buses, etc. Various applications would benefit from efficient high level models of such networks. Unfortunately the existing model-order-reduction algorithms are not adapted to handle more than a few tens of terminals. This talk introduces RecMOR, an algorithm for the computation of reduced order models of structured linear circuits with numerous I/O ports. The algorithm exploits certain regularities of the subcircuit response that are typical in numerous applications of interest. When these regularities are present, the normally dense matrix-transfer function of the subcircuit contains sub-blocks that in some sense are significantly low rank and can be compactly modeled by the recently introduced SVDMOR algorithm. The new RecMOR algorithm decomposes the large matrix-transfer function recursively, and applies SVDMOR compression adaptively to the sub-blocks of the transfer function. The result is a reduced order model that is sparse, efficient, and directly usable as an efficient substitute of the subcircuit in circuit simulations. The method is illustrated on several circuit examples.

Journal ArticleDOI
TL;DR: In comparison with a two-layer network implementing the same filters, this network results in a more symmetric circuit design with lower quiescent power dissipation, albeit at the expense of twice as many transistors.
Abstract: This paper describes the electronic implementation of a four-layer cellular neural network architecture implementing two components of a functional model of neurons in the visual cortex: linear orientation selective filtering and half wave rectification. Separate ON and OFF layers represent the positive and negative outputs of two-phase quadrature Gabor-type filters, whose orientation and spatial-frequency tunings are electronically adjustable. To enable the construction of a multichip network to extract different orientations in parallel, the chip includes an address event representation (AER) transceiver that accepts and produces two-dimensional images that are rate encoded as spike trains. It also includes routing circuitry that facilitates point-to-point signal fan in and fan out. We present measured results from a 32/spl times/64 pixel prototype, which was fabricated in the TSMC0.25-/spl mu/m process on a 3.84 by 2.54 mm die. Quiescent power dissipation is 3 mW and is determined primarily by the spike activity on the AER bus. Settling times are on the order of a few milliseconds. In comparison with a two-layer network implementing the same filters, this network results in a more symmetric circuit design with lower quiescent power dissipation, albeit at the expense of twice as many transistors.

Proceedings ArticleDOI
11 Oct 2004
TL;DR: A statistical sizing approach that takes into account randomness in gate delays by formulating a robust linear program that can be solved efficiently in VLSI circuits is presented.
Abstract: In this paper, we approach the gate sizing problem in VLSI circuits in the context of increasing variability of process and circuit parameters as technology scales into the nanometer regime. We present a statistical sizing approach that takes into account randomness in gate delays by formulating a robust linear program that can be solved efficiently. We demonstrate the efficiency and computational tractability of the proposed algorithm on the various ISCAS'85 benchmark circuits. Across the benchmarks, compared to the deterministic approach, the power savings range from 23-30% for the same timing target and the yield level, the average power saving being 28%. The runtime is reasonable, ranging from a few seconds to around 10 mins, and grows linearly.

Proceedings ArticleDOI
27 Jan 2004
TL;DR: This paper first formulate a k-cofamily-based register binding algorithm targeting the multiplexer optimization problem, then further reduce themultiplexer width through an efficient port assignment algorithm and achieves significantly better results consistently.
Abstract: Data path connection elements, such as multiplexers, consume a significant amount of area on a VLSI chip, especially for FPGA designs. Multiplexer optimization is a difficult problem because both register binding and port assignment to reduce total multiplexer connectivity during high-level synthesis are NP-complete problems. In this paper, we first formulate a k-cofamily-based register binding algorithm targeting the multiplexer optimization problem. We then further reduce the multiplexer width through an efficient port assignment algorithm. Experimental results show that we are 44% better overall than the left-edge register binding algorithm on the total usage of multiplexer inputs and 7% better than a bipartite graph-based algorithm. For large designs, we are able to achieve significantly better results consistently. After technology mapping, placement and routing for an FPGA architecture, it shows considerably positive impacts on chip area, delay and power consumption.

Journal ArticleDOI
TL;DR: This work proposes to make system-level interconnects more robust using encoding that simultaneously addresses error-correction requirements and crosstalk noise avoidance, and gives algorithms for obtaining optimal encodings and a practical class of codes called boundary-shift codes.
Abstract: Aggressive process scaling and increasing clock rates have made crosstalk noise an important issue in VLSI design. Switching on long, adjacent bus wires can lead to timing violations and logic faults. At the same time, system-level interconnects have also become more susceptible to other less predictable forms of interference such as noise induced by power grid fluctuations, electromagnetic interference, and alpha-particle radiation. Previous work has treated these systematic and nonsystematic forms of noise separately. We propose to make system-level interconnects more robust using encoding that simultaneously addresses error-correction requirements and crosstalk noise avoidance. This is more efficient than satisfying these requirements separately. We give algorithms for obtaining optimal encodings and present a practical class of codes called boundary-shift codes. We evaluate the overhead of our method, and make comparisons to using error-correction with simple shielding.

Journal ArticleDOI
TL;DR: This work presents an efficient VLSI architecture for the implementation of 1D lifting discrete wavelet transform that folds the computations of all resolution levels into the same low-pass and high-pass units to achieve higher hardware utilization.
Abstract: The lifting scheme has been developed as a flexible tool suitable for constructing biorthogonal wavelets recently We present an efficient VLSI architecture for the implementation of 1D lifting discrete wavelet transform The architecture folds the computations of all resolution levels into the same low-pass and high-pass units to achieve higher hardware utilization Because of its modular, regular, and flexible structure, the design is scalable for different resolution levels In addition, its area is independent of the length of the 1D input sequence and its latency is independent of the number of resolution levels Since the architecture has a similar topology to a scan chain, we can modify it easily to become a testable scan-based design by adding very few hardware resources For the computations of N-sample 1D k-level analysis (5, 3) lifting wavelet transform, the design takes N+1 clock cycles, and requires two multipliers, four adders, and (3 + 225 /spl times/ 2/sup k/) registers In the simulation, it works with a clock period of 10 ns and achieves a processing rate of about 100 /spl times/ 10/sup 6/ samples/sec for k-level lifting wavelet transform

Journal ArticleDOI
TL;DR: This paper proposes a regular distributed register (RDR) microarchitecture, which offers high regularity and direct support of multicycle on-chip communication and demonstrates promising results on a number of real-life examples.
Abstract: For multigigahertz designs in nanometer technologies, data transfers on global interconnects take multiple clock cycles. In this paper, we propose a regular distributed register (RDR) microarchitecture, which offers high regularity and direct support of multicycle on-chip communication. The RDR microarchitecture divides the entire chip into an array of islands so that all local computation and communication within an island can be performed in a single clock cycle. Each island contains a cluster of computational elements, local registers, and a local controller. On top of the RDR microarchitecture, novel layout-driven architectural synthesis algorithms have been developed for multicycle communication, including scheduling-driven placement, placement-driven simultaneous scheduling with rebinding, and distributed control generation, etc. The experimentation on a number of real-life examples demonstrates promising results. For data flow intensive examples, we obtain a 44% improvement on average in terms of the clock period and a 37% improvement on average in terms of the final latency, over the traditional flow. For designs with control flow, our approach achieves a 28% clock-period reduction and a 23% latency reduction on average.

Journal Article
TL;DR: The paper covers techniques to cope with ever-increasing leakage power as well as dynamic power of CMOS VLSI’s and touches on new applications and markets which will be open up by the low-power VLSS’.
Abstract: The paper covers techniques to cope with ever-increasing leakage power as well as dynamic power of CMOS VLSI’s. The techniques to be presented range from software, system, circuit to device level. The novel trend is to look into the cooperative approaches between disciplines such as software-circuit cooperation and circuit-technology cooperation. The biggest challenge that System-on-a-Chip designers should meet in the future is the fact that transistors go more and more leaky in digital and memory circuits as generation advances. The topics to break through this stringent problem are described. Approaches to lower power at the system level are also discussed. The paper touches on new applications and markets which will be open up by the low-power VLSI’s. key words: digital, memory, application, low power, VLSI, leakage

Proceedings ArticleDOI
07 Jun 2004
TL;DR: The results demonstrate that the heuristic is a practical method of reducing partitioning run time while providing a result that is close to the optimal for a given circuit.
Abstract: This paper presents the Quantum-Dot Cellular Automata (QCA) physical design problem, in the context of the VLSI physical design problem. The problem is divided into three subproblems: partitioning, placement, and routing of QCA circuits. This paper presents an ILP formulation and heuristic solution to the partitioning problem, and compares the two sets of results. Additionally, we compare a human-generated circuit to the ILP and Heuristic solutions. The results demonstrate that the heuristic is a practical method of reducing partitioning run time while providing a result that is close to the optimal for a given circuit.


Proceedings ArticleDOI
19 Jul 2004
TL;DR: The proposed RNS image coding scheme is based on the modified CRT and its associated residue-to-binary conversion and moduli selection methods and is more efficient than the scheme by Ammar et al. (2001) in terms of VLSI implementation.
Abstract: In this paper, we carry out a study on the RNS (residue number system) application in digital image processing and propose a RNS image coding scheme that offers high-speed and low-power VLSI implementation for secure image processing. The proposed scheme is more efficient than the RNS image coding scheme of Ammar et al. (2001) in that the proposed method encrypts the entire image and does not require any additional component other than a standard RNS system. Further, the proposed scheme is based on the modified CRT and its associated residue-to-binary conversion and moduli selection methods and is more efficient than the scheme by Ammar et al. (2001) in terms of VLSI implementation. The design of an encoder and decoder pair for the greyscale image is carried out using MATLAB tool and some VLSI tools. The preliminary results of the Matlab simulation demonstrate the security ability of the proposed image coding scheme.

Proceedings ArticleDOI
26 Oct 2004
TL;DR: It is shown that LEOSLC can be used to effectively debug, diagnose, and localize defects in a broken scan chain.
Abstract: Scan chain diagnostics have become more important than ever due to the increasing complexity of VLSI designs, as more and more scan latches/flip-flops are utilized in designs, especially in microprocessors. At the same time, the off-state leakage current of CMOS technology grows exponentially from one generation to the next one. This fact imposes a big challenge on the chip design, packaging, cooling, etc. However, innovative applications, based on the detection of light emission due to off-state leakage current (LEOSLC) have been developed for testing and diagnosing modern VLSI circuits. We show that LEOSLC can be used to effectively debug, diagnose, and localize defects in a broken scan chain.

Journal ArticleDOI
TL;DR: A new VLSI architecture, namely the transpose free row column decomposition method (TF-RCDM), for 2-D DCT/IDCT is proposed, and the proposed architecture has achieved the smallest word-length among the reported 2- D DCT architectures.
Abstract: This paper first reviewed the two-dimensional discrete cosine transform (2-D DCT) and inverse DCT (IDCT) architectures. Then a new VLSI architecture, namely the transpose free row column decomposition method (TF-RCDM), for 2-D DCT/IDCT is proposed. The new RCDM architecture replaces the transpose circuits with permutation networks and parallel memory modules. As results, the timing overhead of I/O operations is eliminated and the hardware complexity is largely reduced. An accuracy testing system is designed to find the optimum word-length parameters. Based on the accuracy testing system, the proposed architecture has achieved the smallest word-length among the reported 2-D DCT architectures. Synthesis results showed that with 0.25-/spl mu/m CMOS technology library, the area was about 1.5 mm/sup 2/ and the speed was about 125 MHz.