scispace - formally typeset
Search or ask a question

Showing papers on "Very-large-scale integration published in 2009"


Journal ArticleDOI
TL;DR: 3D NoC architectures are evaluated and demonstrate their superior functionality in terms of throughput, latency, energy dissipation and wiring area overhead compared to traditional 2D implementations.
Abstract: The Network-on-Chip (NoC) paradigm has emerged as a revolutionary methodology for integrating a very high number of intellectual property (IP) blocks in a single die. The achievable performance benefit arising out of adopting NoCs is constrained by the performance limitation imposed by the metal wire, which is the physical realization of communication channels. With technology scaling, only depending on the material innovation will extend the lifetime of conventional interconnect systems a few technology generations. According to International Technology Roadmap for Semiconductors (ITRS) for the longer term, new interconnect paradigms are in need. The conventional two dimensional (2D) integrated circuit (IC) has limited floor-planning choices, and consequently it limits the performance enhancements arising out of NoC architectures. Three dimensional (3D) ICs are capable of achieving better performance, functionality, and packaging density compared to more traditional planar ICs. On the other hand, NoC is an enabling solution for integrating large numbers of embedded cores in a single die. 3D NoC architectures combine the benefits of these two new domains to offer an unprecedented performance gain. In this paper we evaluate the performance of 3D NoC architectures and demonstrate their superior functionality in terms of throughput, latency, energy dissipation and wiring area overhead compared to traditional 2D implementations.

474 citations


Journal ArticleDOI
TL;DR: Experimental data is presented that demonstrate how the VLSI neural network can learn to classify patterns of neural activities, also in the case in which they are highly correlated.
Abstract: Real-time classification of patterns of spike trains is a difficult computational problem that both natural and artificial networks of spiking neurons are confronted with. The solution to this problem not only could contribute to understanding the fundamental mechanisms of computation used in the biological brain, but could also lead to efficient hardware implementations of a wide range of applications ranging from autonomous sensory-motor systems to brain-machine interfaces. Here we demonstrate real-time classification of complex patterns of mean firing rates, using a VLSI network of spiking neurons and dynamic synapses which implement a robust spike-driven plasticity mechanism. The learning rule implemented is a supervised one: a teacher signal provides the output neuron with an extra input spike-train during training, in parallel to the spike-trains that represent the input pattern. The teacher signal simply indicates if the neuron should respond to the input pattern with a high rate or with a low one. The learning mechanism modifies the synaptic weights only as long as the current generated by all the stimulated plastic synapses does not match the output desired by the teacher, as in the perceptron learning rule. We describe the implementation of this learning mechanism and present experimental data that demonstrate how the VLSI neural network can learn to classify patterns of neural activities, also in the case in which they are highly correlated.

228 citations


Book
11 Mar 2009
TL;DR: EDA/VLSI practitioners and researchers in need of fluency in an "adjacent" field will find this an invaluable reference to the basic EDA concepts, principles, data structures, algorithms, and architectures for the design, verification, and test of VLSI circuits.
Abstract: This book provides broad and comprehensive coverage of the entire EDA flow. EDA/VLSI practitioners and researchers in need of fluency in an "adjacent" field will find this an invaluable reference to the basic EDA concepts, principles, data structures, algorithms, and architectures for the design, verification, and test of VLSI circuits. Anyone who needs to learn the concepts, principles, data structures, algorithms, and architectures of the EDA flow will benefit from this book. Covers complete spectrum of the EDA flow, from ESL design modeling to logic/test synthesis, verification, physical design, and test - helps EDA newcomers to get "up-and-running" quickly Includes comprehensive coverage of EDA concepts, principles, data structures, algorithms, and architectures - helps all readers improve their VLSI design competence Contains latest advancements not yet available in other books, including Test compression, ESL design modeling, large-scale floorplanning, placement, routing, synthesis of clock and power/ground networks - helps readers to design/develop testable chips or products Includes industry best-practices wherever appropriate in most chapters - helps readers avoid costly mistakes Table of Contents Chapter 1: Introduction Chapter 2: Fundamentals of CMOS Design Chapter 3: Design for Testability Chapter 4: Fundamentals of Algorithms Chapter 5: Electronic System-Level Design and High-Level Synthesis Chapter 6: Logic Synthesis in a Nutshell Chapter 7: Test Synthesis Chapter 8: Logic and Circuit Simulation Chapter 9:?Functional Verification Chapter 10: Floorplanning Chapter 11: Placement Chapter 12: Global and Detailed Routing Chapter 13: Synthesis of Clock and Power/Ground Networks Chapter 14: Fault Simulation and Test Generation.

200 citations


01 Dec 2009
TL;DR: In this article, a novel error-tolerant adversary, named the Error-Tolerant Adder (ETAII), has been proposed to overcome all possible errors in modern VLSI technology.
Abstract: The occurrence of errors are inevitable in modern VLSI technology and to overcome all possible errors is an expensive task. It not only consumes a lot of power but degrades the speed performance. By adopting an emerging concept in VLSI design and test—Error- Tolerance (ET), we managed to develop a novel Error-Tolerant Adder which we named the Type II (ETAII). The circuit to some extent is able to ease the strict restriction on accuracy to achieve tremendous improvements in both the power consumption and speed performance. When compared to its conventional counterparts, the proposed ETAII is able to achieve more than 60% improvement in the Power-Delay Product (PDP). The proposed ETAII is an enhancement of our earlier design, the ETAI, which has problem adding small number inputs.

173 citations


Proceedings ArticleDOI
24 May 2009
TL;DR: This work presents a current-mode conductancebased neuron circuit, with spike-frequency adaptation, refractory period, and bio-physically realistic dynamics which is compact, low-power and compatible with fast asynchronous digital circuits.
Abstract: Silicon neuron circuits emulate the electrophysiological behavior of real neurons. Many circuits can be integrated on a single Very Large Scale Integration (VLSI) device, and form large networks of spiking neurons. Connectivity among neurons can be achieved by using time multiplexing and fast asynchronous digital circuits. As the basic characteristics of the silicon neurons are determined at design time, and cannot be changed after the chip is fabricated, it is crucial to implement a circuit which represents an accurate model of real neurons, but at the same time is compact, low-power and compatible with asynchronous logic. Here we present a current-mode conductancebased neuron circuit, with spike-frequency adaptation, refractory period, and bio-physically realistic dynamics which is compact, low-power and compatible with fast asynchronous digital circuits.

128 citations


Journal ArticleDOI
TL;DR: This paper presents a high-throughput decoder architecture for generic quasi-cyclic low-density parity-check (QC-LDPC) codes and an approximate layered decoding approach is explored to reduce the critical path of the layered LDPC decoder.
Abstract: This paper presents a high-throughput decoder architecture for generic quasi-cyclic low-density parity-check (QC-LDPC) codes. Various optimizations are employed to increase the clock speed. A row permutation scheme is proposed to significantly simplify the implementation of the shuffle network in LDPC decoder. An approximate layered decoding approach is explored to reduce the critical path of the layered LDPC decoder. The computation core is further optimized to reduce the computation delay. It is estimated that 4.7 Gb/s decoding throughput can be achieved at 15 iterations using the current technology.

125 citations


BookDOI
08 May 2009
TL;DR: This book provides readers a broad knowledge on the entire embedded memory technologies in order to better comprehend the technologies and create optimal memory solutions in real applications.
Abstract: The book provides a comprehensive and in-depth view on the state-of-the-art embedded memory technologies The book helps practicing engineers grasp key technology attributes and advanced design techniques in nano-scale VLSI design It also helps them make decisions concerning the right design tradeoffs in real product development This book first provides an overview on the landscape and trend of embedded memory in various VLSI system designs, including high-performance microprocessor, low-power mobile handheld devices, micro-controllers, and various consumer electronics It then shows an in-depth view on each different type of embedded memory technology, including high-speed SRAM, ultra-low-voltage and alternative SRAM, embedded DRAM, embedded nonvolatile memory, and emerging or so-called universal memories such as FeRAM, MRAM, and PRAM Each topic covers all the key technology attributes from a product application perspective, ranging from technology scaling challenges to advanced circuit techniques for achieving optimal design tradeoff in performance and power As VLSI systems become increasingly dependent on on-die memory to provide adequate memory bandwidth for various applications, the book gives readers a broader view of this important field and helps them to achieve their optimal design goals for different applications This book provides readers a broad knowledge on the entire embedded memory technologies in order to better comprehend the technologies and create optimal memory solutions in real applications

94 citations


15 May 2009
TL;DR: This dissertation presents the decoder architectures for regular and irregular LDPC codes that provide substantial gains over existing academic and commercial implementations and utilize an on-the-fly computation paradigm which permits scheduling of the computations in a way that the memory requirements and re-computations are reduced.
Abstract: Area and Energy Efficient VLSI Architectures for Low -Density Parity-Check Decoders Using an On-the-Fly Computation. (December 2006) Kiran Kumar Gunnam, M.S., Texas A&M University Co-Chairs of Advisory Committee: Dr. Gwan Choi Dr. Scott Miller The VLSI implementation complexity of a low density parity check (LDPC) decoder is largely influenced by the interconnect and the storage requirements. This dissertation presents the decoder architectures for regular and irregular LDPC codes that provide substantial gains over existing academic and commercial implementations. Several structured properties of LDPC codes and decoding algorithms are observed and are used to construct hardware implementation with reduced processing complexity. The proposed architectures utilize an on-the-fly computation paradigm which permits scheduling of the computations in a way that the memory requirements and re-computations are reduced. Using this paradigm, the run-time configurable and multi-rate VLSI architectures for the rate compatible array LDPC codes and irregular block LDPC codes are designed. Rate compatible array codes are considered for DSL applications. Irregular block LDPC codes are proposed for IEEE 802.16e, IEEE 802.11n, and IEEE 802.20. When compared with a recent implementation of an 802.11n LDPC decoder, the proposed decoder reduces the logic complexity by 6.45x and memory complexity by 2x for a given data throughput. When compared to the latest reported multi-rate decoders, this decoder design has an area

85 citations


Journal ArticleDOI
TL;DR: A H.264/AVC baseline-profile real-time encoder for HDTV-1080p at 30 fps is proposed in this paper and the design considerations for chief components, including high throughput integer motion estimation, data reusing fractionalmotion estimation, and hardware friendly mode reduction for intra prediction are described.
Abstract: A H.264/AVC baseline-profile real-time encoder for HDTV-1080p at 30 fps is proposed in this paper. On the basis of the specifications and algorithm optimizations, the dedicated hardware engines and one 32-bit media embedded processor (MeP) equipped with hardware extensions are mapped into the three-stage macroblock pipelining system architecture. This paper describes the design considerations for chief components, including high throughput integer motion estimation, data reusing fractional motion estimation, and hardware friendly mode reduction for intra prediction. The 11.5 Gbps 64 Mb system-in-silicon DRAM is embedded to alleviate the external memory bandwidth. Using TSMC one-poly six-metal 0.18 mum CMOS technology, the prototype chip is implemented with 1140 k logic gates and 108.3 KB internal SRAM. The SoC core occupies 27.1 mm2 die area and consumes 1.41 W at 200 MHz execution speed in typical work conditions.

82 citations


Proceedings ArticleDOI
19 Jan 2009
TL;DR: This paper reports on early efforts to accelerate transistor model evaluations using a Graphics Processing Unit (GPU) and integrated this accelerator with a commercial fast SPICE tool, and demonstrates that significant speedups can be obtained.
Abstract: SPICE based circuit simulation is a traditional workhorse in the VLSI design process. Given the pivotal role of SPICE in the IC design flow, there has been significant interest in accelerating SPICE. Since a large fraction (on average 75%) of the SPICE runtime is spent in evaluating transistor model equations, a significant speedup can be availed if these evaluations are accelerated. This paper reports on our early efforts to accelerate transistor model evaluations using a Graphics Processing Unit (GPU). We have integrated this accelerator with a commercial fast SPICE tool. Our experiments demonstrate that significant speedups (2.36× on average) can be obtained. The asymptotic speedup that can be obtained is about 4×. We demonstrate that with circuits consisting of as few as about 1000 transistors, speedups in the neighborhood of this asymptotic value can be obtained. By utilizing the recently announced (but not currently available) quad GPU systems, this speedup could be enhanced further, especially for larger designs.

80 citations


Proceedings ArticleDOI
31 Dec 2009
TL;DR: An Integer Linear Programming (ILP) formulation for application mapping onto mesh based Network-on-Chips to minimize the energy consumption of the system and experimentally investigate the impact of the size of the mesh architecture on the application mapping and total communication.
Abstract: Ever shrinking technologies in VLSI era made it possible to place several modules onto a single die. However, the need for the new communication methods has also increased dramatically since traditional bus-based systems suffer from signal propagation delays, signal integrity, and scalability. Network-on-Chip (NoC) is the biggest step towards the communication bottleneck of System-on-Chip (SoC) architectures. In this paper, we present an Integer Linear Programming (ILP) formulation for application mapping onto mesh based Network-on-Chips to minimize the energy consumption of the system. The proposed method obtains optimal or close to optimal results within the given computation time limit. We also experimentally investigate the impact of the size of the mesh architecture on the application mapping and total communication.

Journal ArticleDOI
TL;DR: A probabilistic model is presented which incorporates processing and design parameters and enables quantitative analysis of the impact of metallic CNTs on leakage, noise margin, and delay variations of CNFET-based digital logic circuits and provides design and processing guidelines for very large scale integration (VLSI)-scale metallic-CNT-tolerant digital circuits.
Abstract: Metallic carbon nanotubes (CNTs) pose a major barrier to the design of digital logic circuits using CNT field-effect transistors (CNFETs). Metallic CNTs create source to drain shorts in CNFETs, resulting in undesirable effects such as excessive leakage and degraded noise margins. No known CNT growth technique guarantees 0% metallic CNTs. Therefore, special processing techniques are required for removing metallic CNTs after CNT growth. This paper presents a probabilistic model which incorporates processing and design parameters and enables quantitative analysis of the impact of metallic CNTs on leakage, noise margin, and delay variations of CNFET-based digital logic circuits. With practical constraints on these key circuit performance metrics, the model provides design and processing guidelines that are required for very large scale integration (VLSI)-scale metallic-CNT-tolerant digital circuits.

Journal ArticleDOI
TL;DR: An edge-oriented area-pixel scaling processor implemented with a low-complexity VLSI architecture to achieve the goal of low cost and performs better in terms of both quantitative evaluation and visual quality.
Abstract: Image scaling is a very important technique and has been widely used in many image processing applications. In this paper, we present an edge-oriented area-pixel scaling processor. To achieve the goal of low cost, the area-pixel scaling technique is implemented with a low-complexity VLSI architecture in our design. A simple edge catching technique is adopted to preserve the image edge features effectively so as to achieve better image quality. Compared with the previous low-complexity techniques, our method performs better in terms of both quantitative evaluation and visual quality. The seven-stage VLSI architecture of our image scaling processor contains 10.4-K gate counts and yields a processing rate of about 200 MHz by using TSMC 0.18-mum technology.

Proceedings ArticleDOI
05 Jan 2009
TL;DR: This paper presents application and effectiveness of Hierarchical particle swarm optimization (HPSO) algorithm for automatic sizing of low-power analog circuits and shows that HPSO algorithm converges to a better solution, compared to PSO and GA.
Abstract: This paper presents application and effectiveness of Hierarchical particle swarm optimization (HPSO) algorithm for automatic sizing of low-power analog circuits. For the purpose of comparison, circuits are also designed using PSO and Genetic Algorithm (GA). CMOS technologies from 0.35 µm down to 0.13 µm are used. PVT (process, voltage, temperature) variations are considered during the design of circuits. We show that HPSO algorithm converges to a better solution, compared to PSO and GA. For CMOS Miller OTA, even performance of the circuit designed by HPSO algorithm is better than the performance of recently reported manually designed circuit. For the first time, design of this OTA, in 0.4 V supply voltage, is also presented. For this new design, HPSO algorithm has taken 23.5 minutes of CPU time on a Sun system with1.2 GHz processor and 8 GB RAM.

01 Jan 2009
TL;DR: By applying energy-delay tradeoffs on various levels, adder topology is developed yielding up to 20% performance improvement and 4.5× energy reduction over existing designs.

Book
22 Oct 2009
TL;DR: This book presents algorithms to analyze the effects of radiation particle strikes and processing variations on the electrical behavior of VLSI circuits and circuit design techniques to mitigate the impact of these problems.
Abstract: This book describes the design of resilient VLSI circuits. VLSI design has become more challenging recently, due to the detrimental effects of radiation particle strikes and processing variations. This book presents algorithms to analyze the effects of these issues on the electrical behavior of VLSI circuits and circuit design techniques to mitigate the impact of these problems.

01 Jan 2009
TL;DR: Design of two different array multipliers are presented, one by using carry-look-ahead (CLA) logic for addition of partial product terms and another by introducing Carry Save Adder (CSA) in partial product lines.
Abstract: In this paper, design of two different array multipliers are presented, one by using carry-look-ahead (CLA) logic for addition of partial product terms and another by introducing Carry Save Adder (CSA) in partial product lines. The multipliers presented in this paper were all modeled using VHDL (Very High Speed Integration Hardware Description Language) for 32-bit unsigned data. The comparison is done on the basis of three performance parameters i.e. Area, Speed and Power consumption. To design an efficient integrated circuit in terms of area, power and speed, has become a challenging task in modern VLSI design field. Previously in the literature, performance analysis was carried out between multiplier using Ripple carry adder (RCA) and by using CLA. In this work, same multiplier is designed by using CSA logic and compare it's performance with the multiplier designed by using CLA logic. Multiplier with CSA gives better result in terms of speed (78.3% improvement), area (reduced by 4.2%) and power consumption (decreased by 1.4%).

Journal ArticleDOI
TL;DR: A modified approach for MIMO detection is proposed, which takes advantage of the quadratic-amplitude modulation (QAM) constellation structure to accelerate the detection procedure and achieves low-power operation by extending the minimum number of paths and reducing the number of required computations for each path extension.
Abstract: Maximum-likelihood (ML) detection for higher order multiple-input-multiple-output (MIMO) systems faces a major challenge in computational complexity. This limits the practicality of these systems from an implementation point of view, particularly for mobile battery-operated devices. In this paper, we propose a modified approach for MIMO detection, which takes advantage of the quadratic-amplitude modulation (QAM) constellation structure to accelerate the detection procedure. This approach achieves low-power operation by extending the minimum number of paths and reducing the number of required computations for each path extension, which results in an order-of-magnitude reduction in computations in comparison with existing algorithms. This paper also describes the very-large-scale integration (VLSI) design of the low-power path metric computation unit. The approach is applied to a 4times4, 64-QAM MIMO detector system. Results show negligible performance degradation compared with conventional algorithms while reducing the complexity by more than 50%.

Proceedings ArticleDOI
15 Sep 2009
TL;DR: This paper presents a new architecture and circuit implementation of 1-D median filter that has linear hardware complexity, minimal latency and achieves throughput of 1/2 of the sampling rate.
Abstract: This paper presents a new architecture and circuit implementation of 1-D median filter. The proposed circuit belongs to the class of non-recursive sorting network architectures that process the input samples sequentially in the word-based manner. In comparison to the related schemes, it maintains sorting of samples from the previous position of the sliding window, positioning only the incoming sample to the correct rank. Unlike existing 1-D filter implementations, the circuit has linear hardware complexity, minimal latency and achieves throughput of 1/2 of the sampling rate. Experimental evaluation and comparisons show high efficiency of our design.

Proceedings ArticleDOI
16 Dec 2009
TL;DR: Voltage mode quaternary CMOS circuit design using 90nm technology is presented, suitable to be implemented in classical CMOS VLSI technology.
Abstract: Good Characteristics and advantages of multi-valued logic (MVL) electronic systems and circuits are created great interest for its practical implementation. This paper presents voltage mode quaternary CMOS circuit design using 90nm technology. Basic gates such as quaternary inverter, NMAX, NMIN and Quaternary multiplexer are designed and simulated. Low power consumption of 14 µ W is observed at 2.2GHz with 1.2 V power supply. Circuits are verified using HSPICE simulations. The circuits described here are also suitable to be implemented in classical CMOS VLSI technology.

Proceedings ArticleDOI
26 Jul 2009
TL;DR: This work proposes GPU-based parallel computing techniques and applies them on simultaneous gate sizing and threshold voltage assignment for accelerating VLSI circuit optimization, aimed to fully utilize the benefits of GPU through efficient task scheduling and memory organization.
Abstract: The progress of GPU (Graphics Processing Unit) technology opens a new avenue for boosting computing power. This work is an attempt to exploit GPU for accelerating VLSI circuit optimization. We propose GPU-based parallel computing techniques and apply them on simultaneous gate sizing and threshold voltage assignment, which is often employed in practice for performance and power optimization. These techniques are aimed to fully utilize the benefits of GPU through efficient task scheduling and memory organization. Compared to conventional sequential computation, our techniques can provide up to 56× speedup without any sacrifice on solution quality.

Book
16 Aug 2009
TL;DR: SiLVR is a nonlinear response surface modeling (RSM) and performance-driven dimensionality reduction strategy, that uses the concepts of projection pursuit and latent variable regression to obtain an absolute improvement in modeling error of up to 34%, over the best quadratic RSM method.
Abstract: As VLSI technology moves to the nanometer scale for transistor feature sizes, the impact of manufacturing imperfections result in large variations in the circuit performance. Traditional CAD tools are not well-equipped to handle this scenario, since they do not model this statistical nature of the circuit parameters and performances, or if they do, the existing techniques tend to be over-simplified or intractably slow. We draw upon ideas for attacking parallel problems in other technical fields, such as computational finance, machine learning and hydrology, and synthesize them with innovative attacks for our problem domain of integrated circuits, to develop novel solutions to problems of efficient statistical analysis of circuits in the nanometer regime. In particular, this thesis makes three contributions: (1) SiLVR, a nonlinear response surface modeling (RSM) and performance-driven dimensionality reduction strategy, that uses the concepts of projection pursuit and latent variable regression to obtain an absolute improvement in modeling error of up to 34%, over the best quadratic RSM method. SiLVR also captures the designer's insight into the circuit behavior, by automatically extracting quantitative measures of relative global sensitivities and nonlinear correlation. (2) Fast Monte Carlo simulation of circuits using quasi-Monte Carlo, showing speedups of 2× to 50× over standard Monte Carlo. (3) Statistical blockade, an efficient method for sampling rare events and estimating their probability distribution using limit results from extreme value theory, applied to high replication circuits like SRAM cells.

Journal ArticleDOI
TL;DR: This paper proposes a scalable VLSI architecture for VBSME in H.264/AVC based on a full-search motion estimation algorithm that shows higher throughput rate with less hardware.
Abstract: Variable block-size motion estimation (VBSME) has become an important technique in H.264/AVC to improve video quality. In this paper, we propose a scalable VLSI architecture for VBSME in H.264/AVC based on a full-search motion estimation algorithm. A new scan order is introduced to re-use the sum of absolute differences (SAD) values of smaller sub-blocks on an "as-early-as-possible" basis, thus the complexity of the required hardware resources, such as registers, multiplexers, and controls is reduced. It also spreads the timing for the final SAD outputs so that the number of output buses is reduced. The architecture is flexible and scalable with regard to the size of the searching windows and PE arrays. Compared to the conventional approaches, the architecture shows higher throughput rate with less hardware. After logic synthesis using DongbuAnam 0.18 mum standard cell library, the number of gates is 39K (16 PEs) in two-input equivalent NAND gates and the maximum operating clock frequency is 416 MHz (256 fps@CIF).

Journal ArticleDOI
TL;DR: By exploiting the inherent symmetry of the discrete wavelet transform (DWT) algorithm and consequently storing only the nonrepetitive combinations of filter coefficients, the size of required memory can be significantly reduced.
Abstract: In this brief, we show that by exploiting the inherent symmetry of the discrete wavelet transform (DWT) algorithm and consequently storing only the nonrepetitive combinations of filter coefficients, the size of required memory can be significantly reduced. Subsequently, a memory-efficient architecture for DWT/inverse DWT is proposed. It occupies 6.5-mm2 silicon area and consumes 46.8-muW power at 1 MHz for 1.2 V using 0.13-mum standard cell technology.

Book
25 Feb 2009
TL;DR: Digital VLSI Chip Design with Cadence and Synopsys CAD Tools leads students through the complete process of building a ready-to-fabricate CMOS integrated circuit using popular commercial design software.
Abstract: Digital VLSI Chip Design with Cadence and Synopsys CAD Tools leads students through the complete process of building a ready-to-fabricate CMOS integrated circuit using popular commercial design software. Detailed tutorials include step-by-step instructions and screen shots of tool windows and dialog boxes. This hands-on book is for use in conjunction with a primary textbook on digital VLSI.

Proceedings ArticleDOI
29 Sep 2009
TL;DR: This paper extends an ILP-based optimization method of the inter-FPGA connections to improve the system performance and shows that the method improved the circuit performance on a 4- FPGA system by 26.4% compared with a conventional method, on average.
Abstract: Multi-FPGA systems are widely used for rapid prototyping and logic verification of VLSIs. To implement a huge logic circuit in a multi-FPGA system, the circuit needs to be partitioned into multiple FPGAs. Because of the limited interconnection resources between FPGAs, time-multiplexed I/Os are used for inter-FPGA connections. Due to the large delay of time-multiplexed I/Os, inter-FPGA connections strongly affect the system performance. In this paper, we extend an ILP-based optimization method of the inter-FPGA connections to improve the system performance. Our method uses both a normal I/O and a time-multiplexed I/O, and decides whether each inter-FPGA signal is transferred by a time-multiplexed I/O or not. Our extended method improves the system performance considering the variation of the amount of interconnection resources, and the variation of the number of inter-FPGA signals, from an FPGA pair to another FPGA pair. Experiments showed that our method improved the circuit performance on a 4-FPGA system by 26.4% compared with a conventional method, on average.

Proceedings ArticleDOI
24 May 2009
TL;DR: VLSI Implementation for a 4×4 multiple-input multiple-output orthogonal frequency division multiplexing (MIMO-OFDM) transceiver is described that targets 1-Gbps data transmission for next-generation wireless LAN systems and incorporates a minimum meansquare error MIMO detector that drastically shortens processing latency.
Abstract: VLSI Implementation for a 4×4 multiple-input multiple-output orthogonal frequency division multiplexing (MIMO-OFDM) transceiver is described that targets 1-Gbps data transmission for next-generation wireless LAN systems. The IEEE802.11 Very High Throughput (VHT) Study Group concluded that a signal bandwidth of more than 80 MHz is needed to achieve 1-Gbps throughput in the MAC layer. The proposed architecture is suitable for VLSI implementation that meets this specification and enables real-time processing in a 4×4 MIMO-OFDM configuration. It incorporates a minimum meansquare error (MMSE) MIMO detector that drastically shortens processing latency. Evaluation of a MIMO-OFDM transceiver implemented in CMOS with 128, 256, or 512 OFDM subcarriers showed that the power dissipation ranged from 451 to 577 mW.

Journal Article
TL;DR: This paper evaluates and compares the performance of various XOR-XNOR circuits based on TSMC 0.18µm process models and reveals that the proposed circuit exhibit lower PDP and EDP, more power efficient and faster when compared with best available Xor-X NOR circuits in the literature.
Abstract: New methodologies for XOR-XNOR circuits are proposed to improve the speed and power as these circuits are basic building blocks of many arithmetic circuits. This paper evaluates and compares the performance of various XOR-XNOR circuits. The performance of the XOR-XNOR circuits based on TSMC 0.18µm process models at all range of the supply voltage starting from 0.6V to 3.3V is evaluated by the comparison of the simulation results obtained from HSPICE. Simulation results reveal that the proposed circuit exhibit lower PDP and EDP, more power efficient and faster when compared with best available XOR-XNOR circuits in the literature. Keywords—Exclusive-OR (XOR), Exclusive-NOR (XNOR), High speed, Low power, Arithmetic Circuits. I. INTRODUCTION HILE the growth of the electronics market has driven the VLSI industry towards very high integration density and system on chip designs and beyond few GHz operating frequencies, critical concerns have been arising to the severe increase in power consumption and the need to further reduce it. Moreover, with the explosive growth the demand and popularity of portable electronics is driving designers to strive for smaller silicon area, higher speeds, longer battery life, and more reliability. Power is one of the premium resources a designer tries to save when designing a system. The XOR- XNOR circuits are basic building blocks in various circuit especially-Arithmetic circuits (Full adder, and multipliers), Compressors, Comparators, Parity Checkers, Code converters, Error-detecting or Error-correcting codes, and Phase detector circuit in PLL. The performance of the complex logic circuits is affected by the individual performance of the XOR-XNOR circuits that are included in them (1)-(6). Therefore, careful design and analysis is required for XOR-XNOR circuits to obtained -full output

Journal ArticleDOI
TL;DR: The authors develop a mapping flow for the dual-rail logic and quantify its cost in both logical product terms and physical implementation area and also develop area and timing models for all three schemes.
Abstract: A programmable logic array (PLA) needs its inputs available in both the positive and negative polarities. In lithographic-scale VLSI PLAs, programmable array logics (PALs) and programmable logic devices (PLDs) a buffer and inverter at the PLA input typically produces both polarities from a single polarity input. However, the extreme regularity required for sublithographic designs has driven nanoscale architectures to consider alternate solutions. Consequently, the authors compare three schemes: one based on producing both polarities in a restoration stage (selective inversion), one based on a local inversion stage and one based on a full dual-rail logic implementation. The authors develop a mapping flow for the dual-rail logic and quantify its cost in both logical product terms and physical implementation area and also develop area and timing models for all three schemes. Mapping benchmarks from the Toronto 20 set, the authors are able to show that the local inversion scheme is faster (less than one-fifth the latency), lower energy (one-half the energy) and comparable size to the selective inversion scheme and faster (less than half the latency), smaller (one-third of the area) and lower energy (one-ninth the energy) than the dual-rail scheme.

Proceedings ArticleDOI
16 Mar 2009
TL;DR: This paper gives an efficient buffer and interlayer via planning algorithm with linear complexity, which make sure buffer andinterlayer via are inserted as successfully as possible in 3D ICs.
Abstract: As technology advances, the interconnect delay among modules plays dominant role in chip performance. Buffer insertion, as a traditional approach to reduce wire delay in 2D ICs, is still necessary in 3D ICs to further optimize interconnects. Since those cross multi-layer nets in 3D ICs need to go through vertical interlayer via, the traditional buffer planning turns into simultaneous buffer and interlayer via planning in 3D ICs. In this paper, we give an efficient buffer and interlayer via planning algorithm with linear complexity, which make sure buffer and interlayer via are inserted as successfully as possible. Experimental results show that 3D ICs can significantly improve the interconnect delay.