Showing papers on "Gate count published in 2005"

PDF

Open Access

Journal Article•DOI•

Toffoli network synthesis with templates

[...]

Dmitri Maslov¹, Gerhard W. Dueck², D.M. Miller¹•Institutions (2)

University of Victoria¹, University of New Brunswick²

23 May 2005-IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

TL;DR: A basic method and a bidirectional synthesis algorithm which produces a network of Toffoli gates realizing a given reversible specification, and an asymptotically optimal modification of the basic synthesis algorithm employing generalized mEXOR gates is presented.

...read moreread less

Abstract: Reversible logic functions can be realized as networks of Toffoli gates. The synthesis of Toffoli networks can be divided into two steps. First, find a network that realizes the desired function. Second, transform the network such that it uses fewer gates, while realizing the same function. This paper addresses the above synthesis approach. We present a basic method and, based on that, a bidirectional synthesis algorithm which produces a network of Toffoli gates realizing a given reversible specification. An asymptotically optimal modification of the basic synthesis algorithm employing generalized mEXOR gates is also presented. Transformations are then applied using template matching. The basis for a template is a network of gates that realizes the identity function. If a sequence of gates in the synthesized network matches a sequence comprised of more than half the gates in a template, then a transformation using the remaining gates in the template can be applied resulting in a reduction in the gate count for the synthesized network. All templates with up to six gates are described in this paper. Experimental results including an exhaustive examination of all 3-variable reversible functions and a collection of benchmark problems are presented. The paper concludes with suggestions for further research.

...read moreread less

220 citations

Journal Article•DOI•

Synthesis of Fredkin-Toffoli reversible networks

[...]

Dmitri Maslov¹, Gerhard W. Dueck², D.M. Miller¹•Institutions (2)

University of Victoria¹, University of New Brunswick²

01 Jun 2005-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: This paper presents a method that synthesizes a network with the most common reversible gates, the Toffoli gate and the Fredkin gate, and compares the results to the optimal results.

...read moreread less

Abstract: Reversible logic has applications in quantum computing, low power CMOS, nanotechnology, optical computing, and DNA computing. The most common reversible gates are the Toffoli gate and the Fredkin gate. We present a method that synthesizes a network with these gates in two steps. First, our synthesis algorithm finds a cascade of Toffoli and Fredkin gates with no backtracking and minimal look-ahead. Next we apply transformations that reduce the number of gates in the network. Transformations are accomplished via template matching. The basis for a template is a network with m gates that realizes the identity function. If a sequence of gates in the network to be reduced matches a sequence of gates comprising more than half of a template, then a transformation that reduces the gate count can be applied. We have synthesized all three input, three output reversible functions and here compare our results to the optimal results. We also present the results of applying our synthesis tool to obtain networks for a number of benchmark functions.

...read moreread less

107 citations

Journal Article•DOI•

Threshold network synthesis and optimization and its application to nanotechnologies

[...]

Rui Zhang¹, P. Gupta¹, Lin Zhong¹, Niraj K. Jha¹•Institutions (1)

Princeton University¹

01 Jan 2005-IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

TL;DR: The novelty of this work lies in the introduction of the first comprehensive synthesis methodology and tool for general multilevel threshold logic design, built on top of an existing Boolean logic synthesis tool.

...read moreread less

Abstract: We propose an algorithm for efficient threshold network synthesis of arbitrary multioutput Boolean functions. Many nanotechnologies, such as resonant tunneling diodes, quantum cellular automata, and single electron tunneling, are capable of implementing threshold logic efficiently. The main purpose of this work is to bridge the current wide gap between research on nanoscale devices and research on synthesis methodologies for generating optimized networks utilizing these devices. While functionally-correct threshold gates and circuits based on nanotechnologies have been successfully demonstrated, there exists no methodology or design automation tool for general multilevel threshold network synthesis. We have built the first such tool, threshold logic synthesizer (TELS), on top of an existing Boolean logic synthesis tool. Experiments with 56 multioutput benchmarks indicate that, compared to traditional logic synthesis, up to 80.0% and 70.6% reduction in gate count and interconnect count, respectively, is possible with the average being 22.7% and 12.6%, respectively. Furthermore, the synthesized networks are well-balanced structurally. The novelty of this work lies in the introduction of the first comprehensive synthesis methodology and tool for general multilevel threshold logic design.

...read moreread less

91 citations

Patent•

Reed-solomon decoder systems for high speed communication and data storage applications

[...]

Hanho Lee¹•Institutions (1)

University of Connecticut¹

08 Sep 2005

TL;DR: In this article, a high-speed, low-complexity Reed-Solomon (RS) decoder architecture using a novel pipelined recursive Modified Euclidean (PrME) algorithm block for very high speed optical communications is provided.

...read moreread less

Abstract: A high-speed, low-complexity Reed-Solomon (RS) decoder architecture using a novel pipelined recursive Modified Euclidean (PrME) algorithm block for very high-speed optical communications is provided. The RS decoder features a low-complexity Key Equation Solver using a PrME algorithm block. The recursive structure enables the low-complexity PrME algorithm block to be implemented. Pipelining and parallelizing allow the inputs to be received at very high fiber optic rates, and outputs to be delivered at correspondingly high rates with minimum delay. An 80-Gb/s RS decoder architecture using 0.13-μm CMOS technology in a supply voltage of 1.2 V is disclosed that features a core gate count of 393 K and operates at a clock rate of 625 MHz. The RS decoder has a wide range of applications, including fiber optic telecommunication applications, hard drive or disk controller applications, computational storage system applications, CD or DVD controller applications, fiber optic systems, router systems, wireless communication systems, cellular telephone systems, microwave link systems, satellite communication systems, digital television systems, networking systems, high-speed modems and the like.

...read moreread less

86 citations

Proceedings Article•DOI•

High-Throughput and Low-Power Architectures for Reed Solomon Decoder

[...]

Akash Kumar, Sergei Sawitzki

01 Jan 2005

TL;DR: A uniform comparison between various algorithms and architectures used for Reed Solomon (RS) decoder, and the results obtained are very encouraging both in terms of silicon area and power.

...read moreread less

Abstract: This paper presents a uniform comparison between various algorithms and architectures used for Reed Solomon (RS) decoder. For each design option, a detailed hardware analysis is provided, in terms of gate count, latency and critical path delay. A new low-power syndrome computation is proposed in the paper. Dual-line architecture of modified Berlekamp Massey algorithm was chosen for Ultra Wide-band (UWB) as an application example. The results obtained are very encouraging both in terms of silicon area and power. A detailed analysis of results is presented and they are also compared with other published industrial and academic designs. I. INTRODUCTION Reed Solomon (RS) codes have been widely used in a variety of communication systems. Continual demand for ever higher data rates and storage capacity makes it necessary to devise very high-speed implementations of RS decoders. A number of algorithms are available and this often makes it difficult to determine the best choice due to the number of variables and trade-offs available. For IEEE 802.15-03 standard proposal (commonly known as UWB) in particular, very high data rates for transmission are needed. Since the standard is also meant for portable devices, power consumption is of prime concern. There is no clear algorithm or architecture that can meet the low-power and high-throughput requirements of UWB. In this paper, a uniform comparison of various designs and architecture is presented. Dual-line architecture of BerleKamp Massey algorithm was implemented, with a lot of other optimisations to the conventional design. In the next section we present an introduction to RS codes and the decoder structure, followed by syndrome computation architecture. The design space is explored in the following section. We then present the results obtained for the archi- tecture chosen for UWB followed by some optimisations to the design. The results are then compared with existing architectures in the section on benchmarking followed by conclusions.

...read moreread less

73 citations

Proceedings Article•DOI•

An efficient architecture for the AES mix columns operation

[...]

Hua Li, Zachary Friggstad

23 May 2005

TL;DR: The design has a lower gate count than other designs that implement both the forward and the inverse mix columns operation and its inverse, and is compared with previous work done in this area.

...read moreread less

Abstract: In this paper, a compact architecture for the AES mix columns operation and its inverse is presented. The hardware implementation is compared with previous work done in this area. We show that our design has a lower gate count than other designs that implement both the forward and the inverse mix columns operation.

...read moreread less

72 citations

Journal Article•DOI•

A high-speed low-complexity Reed-Solomon decoder for optical communications

[...]

Hanho Lee¹•Institutions (1)

Inha University¹

15 Aug 2005-IEEE Transactions on Circuits and Systems Ii-express Briefs

TL;DR: In this paper, a high-speed low-complexity Reed-Solomon (RS) decoder architecture using a pipelined recursive modified Euclidean (PrME) algorithm block for very high speed optical communications is presented.

...read moreread less

Abstract: This paper presents a high-speed low-complexity Reed-Solomon (RS) decoder architecture using a novel pipelined recursive modified Euclidean (PrME) algorithm block for very high-speed optical communications. The RS decoder features a low-complexity key equation solver using a PrME algorithm block. The recursive structure enables the novel low-complexity PrME algorithm block to be implemented. Pipelining and parallelizing allow the inputs to be received at very high fiber-optic rates, and outputs to be delivered at correspondingly high rates with minimum delay. This paper presents the key ideas applied to the design of an 80-Gb/s RS decoder architecture, especially that for achieving high throughput and reducing complexity. The 80-Gb/s 16-channel RS decoder has been designed and implemented using 0.13-/spl mu/m CMOS technology in a supply voltage of 1.2 V. The proposed RS decoder has a core gate count of 393 K and operates at a clock rate of 625 MHz.

...read moreread less

67 citations

Design And Implementation Of 2 Bit Ternary ALU Slice

[...]

A. P. Dhande, V. T. Ingole

01 Jan 2005

TL;DR: ALU capable of performing basic ternary arithmetic & logic operations is proposed, designed for two -bit operation & can be used for n bit operations by cascading n/2 ALU slices.

...read moreread less

Abstract: This paper describes the architecture, design & implementation of 2 bit ternary ALU (T-ALU) slice. The proposed ALU is designed for two -bit operation & can be used for n bit operations by cascading n/2 ALU slices. This ALU is implemented using C-MOS ternary logic gates (T-Gates) for ternary arithmetic & logic circuits. Ternary gates are implemented using enhancement / depletion MOSFET technology, thus proposed ALU is suitable for LSI / VLSI implementation. The designed technique used here requires only two stages i.e . decoder & T-gates, as against three stages i.e. decoder, binary gates & encoder require in conventional ternary logic implementation . Index Terms : Ternary, Unary function, T -Gates, Literal. I. Introduction Alexander [1964] showed that natural base (e= 2.71828) is the most efficient radix for implementation of switching circuits. It seems that most efficient radix for the implementation of digital system is 3 than 2. Ternary logic system, meaning that it has 3 valued switching. Ternary system has several important advantages over binary. It can be summarized as reductions in the interconnections require to implement logic functions, thereby reducing chip area, more information can be transmitted over a given set of lines, lesser memory requirement for a given data length. Besides this serial & some serial-parallel operations can be carried out at higher speed [1][2][3]. Its advantages have been confirmed in the application like memories, communications and digital signal processing etc. [7]. It has been proven that realization & implementation of combinational & sequential function is possible for ternary systems [4][5][6][7]. The implementation is based around bipolar transistors, MOSFETs etc. a basic switching elements, which is refereed to as T-Ga tes [8]. Besides this several authors have proposed reduction techniques to realize ternary functions [9][10][11][12]. In this contribution, we propose ALU capable of performing basic ternary arithmetic & logic operations as mentioned in table 1. We also suggest a scheme that takes the advantage of minimization techniques proposed by [9][11][13] & implemented using T-gates designed for ternary operations. This scheme shows reduction in the number of gate count to implement ternary functions. Firstly we describe the design of 2 bit ALU and then integrate over ALU slice. The organization of paper is: Section II describes basic T-Gate implementation, 2 bit ALU architecture is given in section III, section IV describes 2 bit ALU design and ALU slice design. Experimental results & performance evaluation is given in section V. Finally conclusion is given in section VI. Table 1:Functional Table of T -ALU

...read moreread less

65 citations

Journal Article•

Minimization of reversible adder circuits

[...]

Saiful Islam, Md. Rafiqul Islam

01 Jan 2005-Asian Journal of Information Technology

TL;DR: This study has emphasized on the design of reversible adder circuits that is efficient in terms of gate count, garbage outputs and quantum cost and that can be technologically mapped.

...read moreread less

Abstract: Losing information causes losing power. Information is lost when the input vector cannot be uniquely recovered from the output vector of a combinational circuit. The input vector of reversible circuit can be uniquely recovered from the output vector. In this study we have emphasized on the design of reversible adder circuits that is efficient in terms of gate count, garbage outputs and quantum cost and that can be technologically mapped. It has been analyzed and demonstrated that the results of our proposed adder circuits shows better performance compared to similar type of existing designs. Technology independent equations required to evaluate these circuits have also been given.

...read moreread less

54 citations

Posted Content•

Comparison of the Cost Metrics for Reversible and Quantum Logic Synthesis

[...]

Dmitri Maslov, D. Michael Miller

02 Nov 2005-arXiv: Quantum Physics

TL;DR: In this paper, a breadth-first search method for determining optimal 3-line circuits composed of quantum NOT, CNOT, controlled-V, and controlled V+ gates is introduced.

...read moreread less

Abstract: A breadth-first search method for determining optimal 3-line circuits composed of quantum NOT, CNOT, controlled-V and controlled-V+ (NCV) gates is introduced. Results are presented for simple gate count and for technology motivated cost metrics. The optimal NCV circuits are also compared to NCV circuits derived from optimal NOT, CNOT and Toffoli (NCT) gate circuits. The work presented here provides basic results and motivation for continued study of the direct synthesis of NCV circuits, and establishes relations between function realizations in different circuit cost metrics.

...read moreread less

48 citations

Patent•

Efficient and flexible gps receiver baseband architecture

[...]

Hansheng Wang, Chi-Shin Wang

06 May 2005

TL;DR: In this paper, the authors proposed a new baseband integrated circuit (IC) architecture for direct sequence spread spectrum (DSSS) communication receivers, which has a single set of baseband correlators serving all channels in succession.

...read moreread less

Abstract: The present invention provides a new baseband integrated circuit (IC) architecture for direct sequence spread spectrum (DSSS) communication receivers. The baseband IC has a single set of baseband correlators serving all channels in succession. No complex parallel channel hardware is required. A single on-chip code Numerically Controlled Oscillator (NCO) drives a pseudorandom number (PN) sequence generator, generates all code sampling frequencies, and is capable of self-correct through feedback from an off-chip processor. A carrier NCO generates corrected local frequencies. These on-chip NCOs generate all the necessary clocks. This architecture advantageously reduces the total hardware necessary for the receiver and the baseband IC thus can be realized with a minimal number of gate count. The invention can accommodate any number of channels in a navigational system such as the Global Positioning System (GPS), GLONASS, WAAS, LAAS, etc. The number of channels can be increased by increasing the circuit clock speed.

...read moreread less

Proceedings Article•DOI•

A simple and cost effective video encoder with memory-reducing CAVLC

[...]

Yeong-Kang Lai¹, Chih-Chung Chou¹, Yu-Chieh Chung¹•Institutions (1)

National Chung Hsing University¹

23 May 2005

TL;DR: The results show that a low-cost encoder is feasible, and the memory size of the proposed architecture is smaller than others.

...read moreread less

Abstract: In this paper, a simple and cost effective video encoder with memory efficient context adaptive variable length coder (CAVLC) is proposed for low cost multimedia applications. According to the proposed memory reduction architecture, three coding level variables (prefix, length, and codeword) can be calculated on-the-fly to eliminate seven (level-VLCN, N=0 to 6) 28/spl times/64 k bit coding table memories. We implemented the design on a Xilinx FPGA prototyping board. Its maximum working frequency is 28 MHz. And the gate count is 9171 (NAND2) in TSMC 0.35 /spl mu/m technology (only the video encoder). The results show that a low-cost encoder is feasible, and the memory size of the proposed architecture is smaller than others.

...read moreread less

Proceedings Article•DOI•

An hardware efficient deblocking filter for H.264/AVC

[...]

Chao-Chung Cheng¹, Tian-Sheuan Chang¹•Institutions (1)

National Chiao Tung University¹

05 Dec 2005

TL;DR: This work presents an efficient VLSI architecture for the deblocking filter in H.264/AVC standard that can easily support real-time deblocking of 2K /spl times/ 1K @ 30 Hz video application; this high performance can meet high resolution real- time application requirement.

...read moreread less

Abstract: This work presents an efficient VLSI architecture for the deblocking filter in H.264/AVC standard. The computing flow is reordered for easy hardware implementation. The resulting design can achieve 100 MHz with a gate count of 9.16 K when synthesized from Verilog RTL design by using UMC 0.18 /spl mu/m CMOS technology. When clocked at 82.58 MHz, our design can easily support real-time deblocking of 2K /spl times/ 1K @ 30 Hz video application; this high performance can meet high resolution real-time application requirement.

...read moreread less

Proceedings Article•DOI•

A single-chip FPGA design for real-time ICA-based blind source separation algorithm

[...]

Charayaphan Charoensak¹, Farook Sattar¹•Institutions (1)

Nanyang Technological University¹

23 May 2005

TL;DR: An efficient hardware architecture for the implementation of real-time BSS that can be implemented using a low-cost FPGA is proposed and a good balance between hardware requirement (gate count and minimal clock speed) and separation performance is offered.

...read moreread less

Abstract: Blind source separation (BSS) of independent sources from their mixtures is a common problem in real world multi-sensor applications. In this paper, we propose an efficient hardware architecture for the implementation of real-time BSS that can be implemented using a low-cost FPGA. The architecture offers a good balance between hardware requirement (gate count and minimal clock speed) and separation performance. The FPGA design implements the modified Torkkola BSS algorithm for audio signals based on the ICA (independent component analysis) technique. The separation is performed by implementing noncausal filters, instead of the typical causal filters, within the feedback network. The architecture of the hardware is described. Results of various FPGA simulations and real-time testing of the final hardware design in a real environment are given.

...read moreread less

Proceedings Article•DOI•

Efficient FPGA implementation of an adaptive noise canceller

[...]

A. Di Stefano, A. Scaglione, C. Giaconia

04 Jul 2005

TL;DR: A hardware implementation of an adaptive noise canceller (ANC) synthesized within an FPGA, using a modified version of the least mean square (LMS) error algorithm, useful for enhancing the S/N ratio of data collected from sensors working in noisy environment, or dealing with potentially weak signals.

...read moreread less

Abstract: A hardware implementation of an adaptive noise canceller (ANC) is presented. It has been synthesized within an FPGA, using a modified version of the least mean square (LMS) error algorithm. The results obtained so far show a significant decrease of the required gate count when compared with a standard LMS implementation, while increasing the ANC bandwidth and signal to noise (S/N) ratio. This novel adaptive noise canceller is then useful for enhancing the S/N ratio of data collected from sensors (or sensor arrays) working in noisy environment, or dealing with potentially weak signals.

...read moreread less

Proceedings Article•DOI•

An efficient deblocking filter architecture with 2-dimensional parallel memory for H.264/AVC

[...]

Lingfeng Li¹, Satoshi Goto¹, Takeshi Ikenaga¹•Institutions (1)

Waseda University¹

18 Jan 2005

TL;DR: An efficient architecture for deblocking filter in H.264/AVC is presented and a novel 2-dimensional parallel memory scheme is employed in order to achieve highly efficient parallel access in both horizontal and vertical directions.

...read moreread less

Abstract: In this paper, we present an efficient architecture for deblocking filter in H.264/AVC. A novel 2-dimensional parallel memory scheme is employed in order to achieve highly efficient parallel access in both horizontal and vertical directions. By using this parallel memory scheme, we also eliminate the need for a transpose circuit. Our design is implemented under 0.35/spl mu/m technology. Synthesis results show that the equivalent gate count is only 9.35K (not including SRAMs) when the maximum frequency is 100MHz.

...read moreread less

Proceedings Article•DOI•

A 16-bit fixed-point digital signal processor for digital power converter control

[...]

Eamon O'malley¹, Karl Rinne¹•Institutions (1)

University of Limerick¹

06 Mar 2005

TL;DR: This paper describes a novel and highly versatile reduced instruction set (RISC) based fixed-point digital signal processor (DSP) that has been optimized for digitally controlled switched mode power converters (SMPCs).

...read moreread less

Abstract: This paper describes a novel and highly versatile reduced instruction set (RISC) based fixed-point digital signal processor (DSP). Its architecture, instruction set, and integrated programmable digital pulse width modulator (DPWM) have been optimized for digitally controlled switched mode power converters (SMPCs). Designed using the Verilog hardware description language (HDL), the prototype DSP integrated circuit (IC) was built on a standard 0.35 mum digital CMOS process (with a 20 K gate count). It occupies less then 1.5 mm2 and dissipates approximately 5 mW from a 3.3 V supply at 50 MIPs. The device provides a programmable and cost effective solution for digitally controlled SMPCs

...read moreread less

Journal Article•DOI•

Bit manipulation accelerator for communication systems digital signal processor

[...]

Sug H. Jeong¹, Myung Hoon Sunwoo¹, Seong Keun Oh¹•Institutions (1)

Ajou University¹

01 Jan 2005-EURASIP Journal on Advances in Signal Processing

TL;DR: Application-specific instructions and their bit manipulation unit (BMU), which efficiently support scrambling, convolutional encoding, puncturing, interleaving, and bit stream multiplexing, are proposed.

...read moreread less

Abstract: This paper proposes application-specific instructions and their bit manipulation unit (BMU), which efficiently support scrambling, convolutional encoding, puncturing, interleaving, and bit stream multiplexing. The proposed DSP employs the BMU supporting parallel shift and XOR (exclusive-OR) operations and bit insertion/extraction operations on multiple data. The proposed architecture has been modeled by VHDL and synthesized using the SEC 0.18µm standard cell library and the gate count of the BMU is only about 1700 gates. Performance comparisons show that the number of clock cycles can be reduced about 40%-80% for scrambling, convolutional encoding, and interleaving compared with existing DSPs.

...read moreread less

Proceedings Article•DOI•

Design of a single chip block coder for the EBCOT engine in JPEG2000

[...]

A. Kumar Gupta¹, M. Dyer¹, A. Hirsch¹, Saeid Nooshabadi¹, David Taubman¹ - Show less +1 more•Institutions (1)

University of New South Wales¹

01 Jan 2005

TL;DR: The VLSI design of a BC system that can process 21 mega pixels per second is presented, which is the highest ever reported for a JPEG2000 BC engine capable of handling both normal and causal modes of operation.

...read moreread less

Abstract: The main challenge in the VLSI design of an efficient JPEG2000 hardware is the block coder (BC) engine for the embedded block coding with optimised truncation (EBCOT). In this paper, we present the VLSI design of a BC system that can process 21 mega pixels per second. For the bit plane coder (BPC), we employ a concurrent symbol processing (CSP) algorithm to process of all 4 sample locations within a stripe-column in a single clock cycle during a pass. The BPC produces on average, 1.21 context data (CxD) pairs per clock cycle. In addition, we have designed an arithmetic coder (AC) that processes 2 CxDs/clock cycle. To allow for an efficient coupling of the proposed BPC and AC modules, we also propose a novel architecture for an intermediate buffer. The BC chip implemented on TSMC 0.18 /spl mu/m technology, occupies an area of 1.6 mm/sup 2/, with an equivalent gate count of 95,000, that includes 24576 memory bits. It runs at a clock frequency of 100 MHz. Its high processing throughput is the highest ever reported for a JPEG2000 BC engine capable of handling both normal and causal modes of operation.

...read moreread less

Proceedings Article•DOI•

An optimal adder-based hardware architecture for the DCT/SA-DCT

[...]

Andrew Kinane¹, Valentin Muresan¹, Noel E. O'Connor¹•Institutions (1)

Dublin City University¹

24 Jun 2005

TL;DR: This work proposes a new multiplier-less serial datapath based solely on adders and multiplexers to improve area and power and implements the SA-DCT packing with minimal switching using efficient addressing logic with a transpose memory RAM.

...read moreread less

Abstract: The explosive growth of the mobile multimedia industry has accentuated the need for efficient VLSI implementations of the associated computationally demanding signal processing algorithms. This need becomes greater as end-users demand increasingly enhanced features and more advanced underpinning video analysis. One such feature is object-based video processing as supported by MPEG-4 core profile, which allows content-based interactivity. MPEG-4 has many computationally demanding underlying algorithms, an example of which is the Shape Adaptive Discrete Cosine Transform (SA-DCT). The dynamic nature of the SA-DCT processing steps pose significant VLSI implementation challenges and many of the previously proposed approaches use area and power consumptive multipliers. Most also ignore the subtleties of the packing steps and manipulation of the shape information. We propose a new multiplier-less serial datapath based solely on adders and multiplexers to improve area and power. The adder cost is minimised by employing resource re-use methods. The number of (physical) adders used has been derived using a common sub-expression elimination algorithm. Additional energy efficiency is factored into the design by employing guarded evaluation and local clock gating. Our design implements the SA-DCT packing with minimal switching using efficient addressing logic with a transpose memory RAM. The entire design has been synthesized using TSMC 0.09μm TCBN90LP technology yielding a gate count of 12028 for the datapath and its control logic.

...read moreread less

Proceedings Article•DOI•

Improved FFSBM algorithm and its VLSI architecture for variable block size motion estimation of H.264

[...]

Li Zhang¹, Wen Guo¹•Institutions (1)

Chinese Academy of Sciences¹

01 Jan 2005

TL;DR: Experimental result shows that this algorithm-hardware co-design gives better area/throughput tradeoff than the existing ones and is a proper solution for H.264's variable block size motion estimation.

...read moreread less

Abstract: The video coding standard H264/AVC has adopted variable block size motion estimation to improve coding efficiency, which has brought heavy computation burden The FFSBM (fast full search block matching) algorithm has been proposed to reduce the complexity This paper proposes an improved FFSBM to adaptively reduce the complexity of FFSBM according to the degree of motion activity A modular 2-D VLSI architecture to implement the improved algorithm is also proposed, the size of the PE array is carefully selected to reduce the gate count Experimental result shows that this algorithm-hardware co-design gives better area/throughput tradeoff than the existing ones and is a proper solution for H264's variable block size motion estimation

...read moreread less

Journal Article•DOI•

Design of low-cost FPGA hardware for real-time ICA-based blind source separation algorithm

[...]

Charayaphan Charoensak¹, Farook Sattar¹•Institutions (1)

Nanyang Technological University¹

01 Jan 2005-EURASIP Journal on Advances in Signal Processing

TL;DR: The FPGA design implements the modified Torkkola's BSS algorithm for audio signals based on ICA (independent component analysis) technique, which reduces the required length of the unmixing filters as well as provides better separation and faster convergence.

...read moreread less

Abstract: Blind source separation (BSS) of independent sources from their convolutive mixtures is a problem in many real-world multi-sensor applications. In this paper, we propose and implement an efficient FPGA hardware architecture for the realization of a real-time BSS. The architecture can be implemented using a low-cost FPGA (field programmable gate array). The architecture offers a good balance between hardware requirement (gate count and minimal clock speed) and separation performance. The FPGA design implements the modified Torkkola's BSS algorithm for audio signals based on ICA (independent component analysis) technique. Here, the separation is performed by implementing noncausal filters, instead of the typical causal filters, within the feedback network. This reduces the required length of the unmixing filters as well as provides better separation and faster convergence. Description of the hardware as well as discussion of some issues regarding the practical hardware realization are presented. Results of various FPGA simulations as well as real-time testing of the final hardware design in real environment are given.

...read moreread less

Proceedings Article•DOI•

FPGA-based conformance testing and system prototyping of an MPEG-4 SA-DCT hardware accelerator

[...]

Andrew Kinane¹, A. Casey¹, Valentin Muresan¹, Noel E. O'Connor¹•Institutions (1)

Dublin City University¹

11 Dec 2005

TL;DR: Two FPGA implementations of a shape adaptive discrete cosine transform (SA-DCT) accelerator are presented and the proposed accelerator meets real time constraints on both platforms with a gate count of approximately 40k, and outperforms the optimised reference software implementation by 20times.

...read moreread less

Abstract: Two FPGA implementations of a shape adaptive discrete cosine transform (SA-DCT) accelerator are presented in this paper: one PCI-based and the other AMBA-based The former is used for conformance testing with the MPEG-4 standard requirements The latter is an alternative platform for system prototyping and has an architecture more representative of a mobile device The proposed accelerator meets real time constraints on both platforms with a gate count of approximately 40k, and outperforms the optimised reference software implementation by 20times It is estimated that the accelerator consumes 250mW on a Virtex-E FPGA and 79mW on a Virtex-II FPGA in the worst case scenario

...read moreread less

Proceedings Article•DOI•

A 16,000-gate-count optically reconfigurable gate array in a standard 0.35 /spl mu/m CMOS technology

[...]

Minoru Watanabe¹, Fuminori Kobayashi¹•Institutions (1)

Kyushu Institute of Technology¹

23 May 2005

TL;DR: This paper presents the new design of a 16000-gate-count ORGA using a standard 0.35 /spl mu/m 3-metal CMOS process technology and extracts photodiode characteristics from experimental results using an estimation chip and an evaluation of optical reconfiguration circuits using HSPICE simulation.

...read moreread less

Abstract: Up to now, we have fabricated 68-gate-count optically reconfigurable gate arrays (ORGA), the reconfiguration period of which has been confirmed as less than 10 ns. As the next step, we have begun development of high-gate-count ORGA. The new ORGA-VLSI chip can achieve a 16000-gate-count through reduction of photodiode size, photodiode spacing, and through introduction of a small optical reconfiguration circuit, that do not exceed the resolution of available optical components. This paper presents the new design of a 16000-gate-count ORGA using a standard 0.35 /spl mu/m 3-metal CMOS process technology. In addition, photodiode characteristics are extracted from experimental results using an estimation chip and an evaluation of optical reconfiguration circuits using HSPICE simulation.

...read moreread less

Patent•

Processors for multi-dimensional sequence comparisons

[...]

Laurence Cooke, Stephen Zweig

08 Jun 2005

TL;DR: In this paper, the Smith-Waterman algorithm is used for high-speed computerized comparison analysis of multiple linear symbol or character sequences, such as biological nucleic acid sequences, protein sequences, or other long linear arrays of characters.

...read moreread less

Abstract: Improved processors and processing methods are disclosed for high-speed computerized comparison analysis of multiple linear symbol or character sequences, such as biological nucleic acid sequences, protein sequences, or other long linear arrays of characters. These improved processors and processing methods, which are suitable for use with recursive analytical techniques such as the Smith-Waterman algorithm, and the like, are optimized for minimum gate count and maximum clock cycle computing efficiency. This is done by interleaving multiple linear sequence comparison operations per processor, which optimizes use of the processor's resources. In use, a plurality of such processors are embedded in high-density integrated circuit chips, and run synchronously to efficiently analyze long sequences. Such processor designs and methods exceed the performance of currently available designs, and facilitate lossless higher dimensional sequence comparison analysis between three or more linear sequences.

...read moreread less

Proceedings Article•DOI•

Universal Architectures for Reed-Solomon Error-and-Erasure Decoder

[...]

Fu-Ke Chang¹, Chien-Ching Lin¹, Hsie-Chia Chang¹, Chen-Yi Lee¹•Institutions (1)

National Chiao Tung University¹

01 Jan 2005

TL;DR: The proposed design, based on the Montgomery multiplication algorithm, can support various finite field degrees, different primitive polynomials, and erasure decoding functions, and features an on-the-fly finite field inversion table for high speed error evaluation.

...read moreread less

Abstract: This paper presents the universal architecture for Reed Solomon (RS) error-and-erasure decoder that can accommodate any codeword with different code parameters and finite field definitions. In comparison with other reconfigurable RS decoders, the proposed design, based on the Montgomery multiplication algorithm, can support various finite field degrees, different primitive polynomials, and erasure decoding functions. In addition, the decoder features an on-the-fly finite field inversion table for high speed error evaluation. The area efficient design approach is also presented. Implemented with 1.2V 0.13mum 1P8M technology, this decoder, correcting up to 16 errors, can operate at 300MHz and reach a 2.4Gb/s data rate. The total gate count is about 54K and the core size is 0.36mm2. The average power consumption is 20.2 mW

...read moreread less

Proceedings Article•DOI•

A 51,272-gate-count Dynamic Optically Reconfigurable Gate Array in a standard 0.35 μm CMOS Technology

[...]

Minoru Watanabe, Fuminori Kobayashi

13 Sep 2005-The Japan Society of Applied Physics

Proceedings Article•DOI•

High level synthesis for data-driven applications

[...]

Etienne Bergeron¹, X. Saint-Mleux¹, Marc Feeley¹, Jean-Pierre David¹•Institutions (1)

Université de Montréal¹

08 Jun 2005

TL;DR: This paper presents an environment for the high level description, refinement, synthesis and verification of data driven architectures and shows how HDL can be used as the intermediate language of a compiler for an even higher level functional programming language.

...read moreread less

Abstract: John von Neumann proposed his famous architecture in a context where hardware was very expensive and bulky. His goal was to maximize functionality with minimal hardware. Presently, logical gates are nearly free and single chips contain billions of gates. However, most current designs are still based on Von Neumann's architecture because processors are built on this model. Nevertheless, the main current challenge is to be able to design, refine, synthesize and verify new architectures in a minimum time and with a maximum computational performance regardless of the gate count. Data driven architectures enable a high level of parallelism because instead of a single controller managing all the resources (and often a single ALU), tens or hundreds of small controllers can now operate in parallel on local processing units. This paper presents an environment for the high level description, refinement, synthesis and verification of such systems. Our own HDL is presented with its compiler and we show how it can be used as the intermediate language of a compiler for an even higher level functional programming language. Ongoing work enables the interfacing with other languages (from both hardware and software communities). We also intend to target asynchronous designs.

...read moreread less

Journal Article•DOI•

Design and FPGA implementation of an MPEG based video scalar with reduced on-chip memory utilization

[...]

S. Ramachandran¹, Sumana Srinivasan¹•Institutions (1)

Indian Institutes of Technology¹

01 Jun 2005-Journal of Systems Architecture

TL;DR: Transmission of high resolution pictures of XGA format and above, even after effecting compression, demand very high serial channel bandwidth requirement, far exceeding the prescribed maximum by MPEG-2 standards, which can be circumvented by down scaling and then effected compression before transmission, trading off for a little image quality, as presented in this paper.

...read moreread less

Real-time FPGA realization of an UWB transceiver physical layer

[...]

Darryn Lowe¹•Institutions (1)

University of Wollongong¹

01 Jan 2005

TL;DR: In this paper, an original ultra-wideband (UWB) physical layer (PHY) specification is developed and implemented in digital logic, which is based on a combination of complementary code division multiplexing (CCDM) and multicode interleaved direct sequence (MCIDS) spreading, which provides an additional fixed process gain as well as multipath robustness.

...read moreread less

Abstract: An original ultra-wideband (UWB) physical layer (PHY) specification is developed and implemented in digital logic. The novelty of this UWB PHY is based on a combination of complementary code division multiplexing (CCDM), which yields a low-interference signal with a variable process gain, and multicode interleaved direct sequence (MCIDS) spreading, which provides an additional fixed process gain as well as multipath robustness. To operate at the high sample rates needed for UWB, the digital logic, realized in a Virtex-II field programmable gate array (FPGA), has a highly-pipelined architecture for real-time signal processing. In addition, the gate count is minimized by avoiding the use of explicit buffer memory wherever possible. The performance of the transceiver is analyzed under a variety of UWB channels and impairments. It is concluded that the proposed UWB PHY offers robust performance in real-world environments and that it is viable for use in future communication systems.

...read moreread less