scispace - formally typeset
Search or ask a question

Showing papers on "Gate count published in 2005"


Journal ArticleDOI
TL;DR: A basic method and a bidirectional synthesis algorithm which produces a network of Toffoli gates realizing a given reversible specification, and an asymptotically optimal modification of the basic synthesis algorithm employing generalized mEXOR gates is presented.
Abstract: Reversible logic functions can be realized as networks of Toffoli gates. The synthesis of Toffoli networks can be divided into two steps. First, find a network that realizes the desired function. Second, transform the network such that it uses fewer gates, while realizing the same function. This paper addresses the above synthesis approach. We present a basic method and, based on that, a bidirectional synthesis algorithm which produces a network of Toffoli gates realizing a given reversible specification. An asymptotically optimal modification of the basic synthesis algorithm employing generalized mEXOR gates is also presented. Transformations are then applied using template matching. The basis for a template is a network of gates that realizes the identity function. If a sequence of gates in the synthesized network matches a sequence comprised of more than half the gates in a template, then a transformation using the remaining gates in the template can be applied resulting in a reduction in the gate count for the synthesized network. All templates with up to six gates are described in this paper. Experimental results including an exhaustive examination of all 3-variable reversible functions and a collection of benchmark problems are presented. The paper concludes with suggestions for further research.

220 citations


Journal ArticleDOI
TL;DR: This paper presents a method that synthesizes a network with the most common reversible gates, the Toffoli gate and the Fredkin gate, and compares the results to the optimal results.
Abstract: Reversible logic has applications in quantum computing, low power CMOS, nanotechnology, optical computing, and DNA computing. The most common reversible gates are the Toffoli gate and the Fredkin gate. We present a method that synthesizes a network with these gates in two steps. First, our synthesis algorithm finds a cascade of Toffoli and Fredkin gates with no backtracking and minimal look-ahead. Next we apply transformations that reduce the number of gates in the network. Transformations are accomplished via template matching. The basis for a template is a network with m gates that realizes the identity function. If a sequence of gates in the network to be reduced matches a sequence of gates comprising more than half of a template, then a transformation that reduces the gate count can be applied. We have synthesized all three input, three output reversible functions and here compare our results to the optimal results. We also present the results of applying our synthesis tool to obtain networks for a number of benchmark functions.

107 citations


Journal ArticleDOI
TL;DR: The novelty of this work lies in the introduction of the first comprehensive synthesis methodology and tool for general multilevel threshold logic design, built on top of an existing Boolean logic synthesis tool.
Abstract: We propose an algorithm for efficient threshold network synthesis of arbitrary multioutput Boolean functions. Many nanotechnologies, such as resonant tunneling diodes, quantum cellular automata, and single electron tunneling, are capable of implementing threshold logic efficiently. The main purpose of this work is to bridge the current wide gap between research on nanoscale devices and research on synthesis methodologies for generating optimized networks utilizing these devices. While functionally-correct threshold gates and circuits based on nanotechnologies have been successfully demonstrated, there exists no methodology or design automation tool for general multilevel threshold network synthesis. We have built the first such tool, threshold logic synthesizer (TELS), on top of an existing Boolean logic synthesis tool. Experiments with 56 multioutput benchmarks indicate that, compared to traditional logic synthesis, up to 80.0% and 70.6% reduction in gate count and interconnect count, respectively, is possible with the average being 22.7% and 12.6%, respectively. Furthermore, the synthesized networks are well-balanced structurally. The novelty of this work lies in the introduction of the first comprehensive synthesis methodology and tool for general multilevel threshold logic design.

91 citations


Patent
08 Sep 2005
TL;DR: In this article, a high-speed, low-complexity Reed-Solomon (RS) decoder architecture using a novel pipelined recursive Modified Euclidean (PrME) algorithm block for very high speed optical communications is provided.
Abstract: A high-speed, low-complexity Reed-Solomon (RS) decoder architecture using a novel pipelined recursive Modified Euclidean (PrME) algorithm block for very high-speed optical communications is provided. The RS decoder features a low-complexity Key Equation Solver using a PrME algorithm block. The recursive structure enables the low-complexity PrME algorithm block to be implemented. Pipelining and parallelizing allow the inputs to be received at very high fiber optic rates, and outputs to be delivered at correspondingly high rates with minimum delay. An 80-Gb/s RS decoder architecture using 0.13-μm CMOS technology in a supply voltage of 1.2 V is disclosed that features a core gate count of 393 K and operates at a clock rate of 625 MHz. The RS decoder has a wide range of applications, including fiber optic telecommunication applications, hard drive or disk controller applications, computational storage system applications, CD or DVD controller applications, fiber optic systems, router systems, wireless communication systems, cellular telephone systems, microwave link systems, satellite communication systems, digital television systems, networking systems, high-speed modems and the like.

86 citations


Proceedings ArticleDOI
01 Jan 2005
TL;DR: A uniform comparison between various algorithms and architectures used for Reed Solomon (RS) decoder, and the results obtained are very encouraging both in terms of silicon area and power.
Abstract: This paper presents a uniform comparison between various algorithms and architectures used for Reed Solomon (RS) decoder. For each design option, a detailed hardware analysis is provided, in terms of gate count, latency and critical path delay. A new low-power syndrome computation is proposed in the paper. Dual-line architecture of modified Berlekamp Massey algorithm was chosen for Ultra Wide-band (UWB) as an application example. The results obtained are very encouraging both in terms of silicon area and power. A detailed analysis of results is presented and they are also compared with other published industrial and academic designs. I. INTRODUCTION Reed Solomon (RS) codes have been widely used in a variety of communication systems. Continual demand for ever higher data rates and storage capacity makes it necessary to devise very high-speed implementations of RS decoders. A number of algorithms are available and this often makes it difficult to determine the best choice due to the number of variables and trade-offs available. For IEEE 802.15-03 standard proposal (commonly known as UWB) in particular, very high data rates for transmission are needed. Since the standard is also meant for portable devices, power consumption is of prime concern. There is no clear algorithm or architecture that can meet the low-power and high-throughput requirements of UWB. In this paper, a uniform comparison of various designs and architecture is presented. Dual-line architecture of BerleKamp Massey algorithm was implemented, with a lot of other optimisations to the conventional design. In the next section we present an introduction to RS codes and the decoder structure, followed by syndrome computation architecture. The design space is explored in the following section. We then present the results obtained for the archi- tecture chosen for UWB followed by some optimisations to the design. The results are then compared with existing architectures in the section on benchmarking followed by conclusions.

73 citations


Proceedings ArticleDOI
23 May 2005
TL;DR: The design has a lower gate count than other designs that implement both the forward and the inverse mix columns operation and its inverse, and is compared with previous work done in this area.
Abstract: In this paper, a compact architecture for the AES mix columns operation and its inverse is presented. The hardware implementation is compared with previous work done in this area. We show that our design has a lower gate count than other designs that implement both the forward and the inverse mix columns operation.

72 citations


Journal ArticleDOI
Hanho Lee1
TL;DR: In this paper, a high-speed low-complexity Reed-Solomon (RS) decoder architecture using a pipelined recursive modified Euclidean (PrME) algorithm block for very high speed optical communications is presented.
Abstract: This paper presents a high-speed low-complexity Reed-Solomon (RS) decoder architecture using a novel pipelined recursive modified Euclidean (PrME) algorithm block for very high-speed optical communications. The RS decoder features a low-complexity key equation solver using a PrME algorithm block. The recursive structure enables the novel low-complexity PrME algorithm block to be implemented. Pipelining and parallelizing allow the inputs to be received at very high fiber-optic rates, and outputs to be delivered at correspondingly high rates with minimum delay. This paper presents the key ideas applied to the design of an 80-Gb/s RS decoder architecture, especially that for achieving high throughput and reducing complexity. The 80-Gb/s 16-channel RS decoder has been designed and implemented using 0.13-/spl mu/m CMOS technology in a supply voltage of 1.2 V. The proposed RS decoder has a core gate count of 393 K and operates at a clock rate of 625 MHz.

67 citations


01 Jan 2005
TL;DR: ALU capable of performing basic ternary arithmetic & logic operations is proposed, designed for two -bit operation & can be used for n bit operations by cascading n/2 ALU slices.
Abstract: This paper describes the architecture, design & implementation of 2 bit ternary ALU (T-ALU) slice. The proposed ALU is designed for two -bit operation & can be used for n bit operations by cascading n/2 ALU slices. This ALU is implemented using C-MOS ternary logic gates (T-Gates) for ternary arithmetic & logic circuits. Ternary gates are implemented using enhancement / depletion MOSFET technology, thus proposed ALU is suitable for LSI / VLSI implementation. The designed technique used here requires only two stages i.e . decoder & T-gates, as against three stages i.e. decoder, binary gates & encoder require in conventional ternary logic implementation . Index Terms : Ternary, Unary function, T -Gates, Literal. I. Introduction Alexander [1964] showed that natural base (e= 2.71828) is the most efficient radix for implementation of switching circuits. It seems that most efficient radix for the implementation of digital system is 3 than 2. Ternary logic system, meaning that it has 3 valued switching. Ternary system has several important advantages over binary. It can be summarized as reductions in the interconnections require to implement logic functions, thereby reducing chip area, more information can be transmitted over a given set of lines, lesser memory requirement for a given data length. Besides this serial & some serial-parallel operations can be carried out at higher speed [1][2][3]. Its advantages have been confirmed in the application like memories, communications and digital signal processing etc. [7]. It has been proven that realization & implementation of combinational & sequential function is possible for ternary systems [4][5][6][7]. The implementation is based around bipolar transistors, MOSFETs etc. a basic switching elements, which is refereed to as T-Ga tes [8]. Besides this several authors have proposed reduction techniques to realize ternary functions [9][10][11][12]. In this contribution, we propose ALU capable of performing basic ternary arithmetic & logic operations as mentioned in table 1. We also suggest a scheme that takes the advantage of minimization techniques proposed by [9][11][13] & implemented using T-gates designed for ternary operations. This scheme shows reduction in the number of gate count to implement ternary functions. Firstly we describe the design of 2 bit ALU and then integrate over ALU slice. The organization of paper is: Section II describes basic T-Gate implementation, 2 bit ALU architecture is given in section III, section IV describes 2 bit ALU design and ALU slice design. Experimental results & performance evaluation is given in section V. Finally conclusion is given in section VI. Table 1:Functional Table of T -ALU

65 citations


Journal Article
TL;DR: This study has emphasized on the design of reversible adder circuits that is efficient in terms of gate count, garbage outputs and quantum cost and that can be technologically mapped.
Abstract: Losing information causes losing power. Information is lost when the input vector cannot be uniquely recovered from the output vector of a combinational circuit. The input vector of reversible circuit can be uniquely recovered from the output vector. In this study we have emphasized on the design of reversible adder circuits that is efficient in terms of gate count, garbage outputs and quantum cost and that can be technologically mapped. It has been analyzed and demonstrated that the results of our proposed adder circuits shows better performance compared to similar type of existing designs. Technology independent equations required to evaluate these circuits have also been given.

54 citations


Posted Content
TL;DR: In this paper, a breadth-first search method for determining optimal 3-line circuits composed of quantum NOT, CNOT, controlled-V, and controlled V+ gates is introduced.
Abstract: A breadth-first search method for determining optimal 3-line circuits composed of quantum NOT, CNOT, controlled-V and controlled-V+ (NCV) gates is introduced. Results are presented for simple gate count and for technology motivated cost metrics. The optimal NCV circuits are also compared to NCV circuits derived from optimal NOT, CNOT and Toffoli (NCT) gate circuits. The work presented here provides basic results and motivation for continued study of the direct synthesis of NCV circuits, and establishes relations between function realizations in different circuit cost metrics.

48 citations


Patent
06 May 2005
TL;DR: In this paper, the authors proposed a new baseband integrated circuit (IC) architecture for direct sequence spread spectrum (DSSS) communication receivers, which has a single set of baseband correlators serving all channels in succession.
Abstract: The present invention provides a new baseband integrated circuit (IC) architecture for direct sequence spread spectrum (DSSS) communication receivers. The baseband IC has a single set of baseband correlators serving all channels in succession. No complex parallel channel hardware is required. A single on-chip code Numerically Controlled Oscillator (NCO) drives a pseudorandom number (PN) sequence generator, generates all code sampling frequencies, and is capable of self-correct through feedback from an off-chip processor. A carrier NCO generates corrected local frequencies. These on-chip NCOs generate all the necessary clocks. This architecture advantageously reduces the total hardware necessary for the receiver and the baseband IC thus can be realized with a minimal number of gate count. The invention can accommodate any number of channels in a navigational system such as the Global Positioning System (GPS), GLONASS, WAAS, LAAS, etc. The number of channels can be increased by increasing the circuit clock speed.

Proceedings ArticleDOI
23 May 2005
TL;DR: The results show that a low-cost encoder is feasible, and the memory size of the proposed architecture is smaller than others.
Abstract: In this paper, a simple and cost effective video encoder with memory efficient context adaptive variable length coder (CAVLC) is proposed for low cost multimedia applications. According to the proposed memory reduction architecture, three coding level variables (prefix, length, and codeword) can be calculated on-the-fly to eliminate seven (level-VLCN, N=0 to 6) 28/spl times/64 k bit coding table memories. We implemented the design on a Xilinx FPGA prototyping board. Its maximum working frequency is 28 MHz. And the gate count is 9171 (NAND2) in TSMC 0.35 /spl mu/m technology (only the video encoder). The results show that a low-cost encoder is feasible, and the memory size of the proposed architecture is smaller than others.

Proceedings ArticleDOI
05 Dec 2005
TL;DR: This work presents an efficient VLSI architecture for the deblocking filter in H.264/AVC standard that can easily support real-time deblocking of 2K /spl times/ 1K @ 30 Hz video application; this high performance can meet high resolution real- time application requirement.
Abstract: This work presents an efficient VLSI architecture for the deblocking filter in H.264/AVC standard. The computing flow is reordered for easy hardware implementation. The resulting design can achieve 100 MHz with a gate count of 9.16 K when synthesized from Verilog RTL design by using UMC 0.18 /spl mu/m CMOS technology. When clocked at 82.58 MHz, our design can easily support real-time deblocking of 2K /spl times/ 1K @ 30 Hz video application; this high performance can meet high resolution real-time application requirement.

Proceedings ArticleDOI
23 May 2005
TL;DR: An efficient hardware architecture for the implementation of real-time BSS that can be implemented using a low-cost FPGA is proposed and a good balance between hardware requirement (gate count and minimal clock speed) and separation performance is offered.
Abstract: Blind source separation (BSS) of independent sources from their mixtures is a common problem in real world multi-sensor applications. In this paper, we propose an efficient hardware architecture for the implementation of real-time BSS that can be implemented using a low-cost FPGA. The architecture offers a good balance between hardware requirement (gate count and minimal clock speed) and separation performance. The FPGA design implements the modified Torkkola BSS algorithm for audio signals based on the ICA (independent component analysis) technique. The separation is performed by implementing noncausal filters, instead of the typical causal filters, within the feedback network. The architecture of the hardware is described. Results of various FPGA simulations and real-time testing of the final hardware design in a real environment are given.

Proceedings ArticleDOI
04 Jul 2005
TL;DR: A hardware implementation of an adaptive noise canceller (ANC) synthesized within an FPGA, using a modified version of the least mean square (LMS) error algorithm, useful for enhancing the S/N ratio of data collected from sensors working in noisy environment, or dealing with potentially weak signals.
Abstract: A hardware implementation of an adaptive noise canceller (ANC) is presented. It has been synthesized within an FPGA, using a modified version of the least mean square (LMS) error algorithm. The results obtained so far show a significant decrease of the required gate count when compared with a standard LMS implementation, while increasing the ANC bandwidth and signal to noise (S/N) ratio. This novel adaptive noise canceller is then useful for enhancing the S/N ratio of data collected from sensors (or sensor arrays) working in noisy environment, or dealing with potentially weak signals.

Proceedings ArticleDOI
18 Jan 2005
TL;DR: An efficient architecture for deblocking filter in H.264/AVC is presented and a novel 2-dimensional parallel memory scheme is employed in order to achieve highly efficient parallel access in both horizontal and vertical directions.
Abstract: In this paper, we present an efficient architecture for deblocking filter in H.264/AVC. A novel 2-dimensional parallel memory scheme is employed in order to achieve highly efficient parallel access in both horizontal and vertical directions. By using this parallel memory scheme, we also eliminate the need for a transpose circuit. Our design is implemented under 0.35/spl mu/m technology. Synthesis results show that the equivalent gate count is only 9.35K (not including SRAMs) when the maximum frequency is 100MHz.

Proceedings ArticleDOI
06 Mar 2005
TL;DR: This paper describes a novel and highly versatile reduced instruction set (RISC) based fixed-point digital signal processor (DSP) that has been optimized for digitally controlled switched mode power converters (SMPCs).
Abstract: This paper describes a novel and highly versatile reduced instruction set (RISC) based fixed-point digital signal processor (DSP). Its architecture, instruction set, and integrated programmable digital pulse width modulator (DPWM) have been optimized for digitally controlled switched mode power converters (SMPCs). Designed using the Verilog hardware description language (HDL), the prototype DSP integrated circuit (IC) was built on a standard 0.35 mum digital CMOS process (with a 20 K gate count). It occupies less then 1.5 mm2 and dissipates approximately 5 mW from a 3.3 V supply at 50 MIPs. The device provides a programmable and cost effective solution for digitally controlled SMPCs

Journal ArticleDOI
TL;DR: Application-specific instructions and their bit manipulation unit (BMU), which efficiently support scrambling, convolutional encoding, puncturing, interleaving, and bit stream multiplexing, are proposed.
Abstract: This paper proposes application-specific instructions and their bit manipulation unit (BMU), which efficiently support scrambling, convolutional encoding, puncturing, interleaving, and bit stream multiplexing. The proposed DSP employs the BMU supporting parallel shift and XOR (exclusive-OR) operations and bit insertion/extraction operations on multiple data. The proposed architecture has been modeled by VHDL and synthesized using the SEC 0.18µm standard cell library and the gate count of the BMU is only about 1700 gates. Performance comparisons show that the number of clock cycles can be reduced about 40%-80% for scrambling, convolutional encoding, and interleaving compared with existing DSPs.

Proceedings ArticleDOI
01 Jan 2005
TL;DR: The VLSI design of a BC system that can process 21 mega pixels per second is presented, which is the highest ever reported for a JPEG2000 BC engine capable of handling both normal and causal modes of operation.
Abstract: The main challenge in the VLSI design of an efficient JPEG2000 hardware is the block coder (BC) engine for the embedded block coding with optimised truncation (EBCOT). In this paper, we present the VLSI design of a BC system that can process 21 mega pixels per second. For the bit plane coder (BPC), we employ a concurrent symbol processing (CSP) algorithm to process of all 4 sample locations within a stripe-column in a single clock cycle during a pass. The BPC produces on average, 1.21 context data (CxD) pairs per clock cycle. In addition, we have designed an arithmetic coder (AC) that processes 2 CxDs/clock cycle. To allow for an efficient coupling of the proposed BPC and AC modules, we also propose a novel architecture for an intermediate buffer. The BC chip implemented on TSMC 0.18 /spl mu/m technology, occupies an area of 1.6 mm/sup 2/, with an equivalent gate count of 95,000, that includes 24576 memory bits. It runs at a clock frequency of 100 MHz. Its high processing throughput is the highest ever reported for a JPEG2000 BC engine capable of handling both normal and causal modes of operation.

Proceedings ArticleDOI
24 Jun 2005
TL;DR: This work proposes a new multiplier-less serial datapath based solely on adders and multiplexers to improve area and power and implements the SA-DCT packing with minimal switching using efficient addressing logic with a transpose memory RAM.
Abstract: The explosive growth of the mobile multimedia industry has accentuated the need for efficient VLSI implementations of the associated computationally demanding signal processing algorithms. This need becomes greater as end-users demand increasingly enhanced features and more advanced underpinning video analysis. One such feature is object-based video processing as supported by MPEG-4 core profile, which allows content-based interactivity. MPEG-4 has many computationally demanding underlying algorithms, an example of which is the Shape Adaptive Discrete Cosine Transform (SA-DCT). The dynamic nature of the SA-DCT processing steps pose significant VLSI implementation challenges and many of the previously proposed approaches use area and power consumptive multipliers. Most also ignore the subtleties of the packing steps and manipulation of the shape information. We propose a new multiplier-less serial datapath based solely on adders and multiplexers to improve area and power. The adder cost is minimised by employing resource re-use methods. The number of (physical) adders used has been derived using a common sub-expression elimination algorithm. Additional energy efficiency is factored into the design by employing guarded evaluation and local clock gating. Our design implements the SA-DCT packing with minimal switching using efficient addressing logic with a transpose memory RAM. The entire design has been synthesized using TSMC 0.09μm TCBN90LP technology yielding a gate count of 12028 for the datapath and its control logic.

Proceedings ArticleDOI
01 Jan 2005
TL;DR: Experimental result shows that this algorithm-hardware co-design gives better area/throughput tradeoff than the existing ones and is a proper solution for H.264's variable block size motion estimation.
Abstract: The video coding standard H264/AVC has adopted variable block size motion estimation to improve coding efficiency, which has brought heavy computation burden The FFSBM (fast full search block matching) algorithm has been proposed to reduce the complexity This paper proposes an improved FFSBM to adaptively reduce the complexity of FFSBM according to the degree of motion activity A modular 2-D VLSI architecture to implement the improved algorithm is also proposed, the size of the PE array is carefully selected to reduce the gate count Experimental result shows that this algorithm-hardware co-design gives better area/throughput tradeoff than the existing ones and is a proper solution for H264's variable block size motion estimation

Journal ArticleDOI
TL;DR: The FPGA design implements the modified Torkkola's BSS algorithm for audio signals based on ICA (independent component analysis) technique, which reduces the required length of the unmixing filters as well as provides better separation and faster convergence.
Abstract: Blind source separation (BSS) of independent sources from their convolutive mixtures is a problem in many real-world multi-sensor applications. In this paper, we propose and implement an efficient FPGA hardware architecture for the realization of a real-time BSS. The architecture can be implemented using a low-cost FPGA (field programmable gate array). The architecture offers a good balance between hardware requirement (gate count and minimal clock speed) and separation performance. The FPGA design implements the modified Torkkola's BSS algorithm for audio signals based on ICA (independent component analysis) technique. Here, the separation is performed by implementing noncausal filters, instead of the typical causal filters, within the feedback network. This reduces the required length of the unmixing filters as well as provides better separation and faster convergence. Description of the hardware as well as discussion of some issues regarding the practical hardware realization are presented. Results of various FPGA simulations as well as real-time testing of the final hardware design in real environment are given.

Proceedings ArticleDOI
11 Dec 2005
TL;DR: Two FPGA implementations of a shape adaptive discrete cosine transform (SA-DCT) accelerator are presented and the proposed accelerator meets real time constraints on both platforms with a gate count of approximately 40k, and outperforms the optimised reference software implementation by 20times.
Abstract: Two FPGA implementations of a shape adaptive discrete cosine transform (SA-DCT) accelerator are presented in this paper: one PCI-based and the other AMBA-based The former is used for conformance testing with the MPEG-4 standard requirements The latter is an alternative platform for system prototyping and has an architecture more representative of a mobile device The proposed accelerator meets real time constraints on both platforms with a gate count of approximately 40k, and outperforms the optimised reference software implementation by 20times It is estimated that the accelerator consumes 250mW on a Virtex-E FPGA and 79mW on a Virtex-II FPGA in the worst case scenario

Proceedings ArticleDOI
23 May 2005
TL;DR: This paper presents the new design of a 16000-gate-count ORGA using a standard 0.35 /spl mu/m 3-metal CMOS process technology and extracts photodiode characteristics from experimental results using an estimation chip and an evaluation of optical reconfiguration circuits using HSPICE simulation.
Abstract: Up to now, we have fabricated 68-gate-count optically reconfigurable gate arrays (ORGA), the reconfiguration period of which has been confirmed as less than 10 ns. As the next step, we have begun development of high-gate-count ORGA. The new ORGA-VLSI chip can achieve a 16000-gate-count through reduction of photodiode size, photodiode spacing, and through introduction of a small optical reconfiguration circuit, that do not exceed the resolution of available optical components. This paper presents the new design of a 16000-gate-count ORGA using a standard 0.35 /spl mu/m 3-metal CMOS process technology. In addition, photodiode characteristics are extracted from experimental results using an estimation chip and an evaluation of optical reconfiguration circuits using HSPICE simulation.

Patent
08 Jun 2005
TL;DR: In this paper, the Smith-Waterman algorithm is used for high-speed computerized comparison analysis of multiple linear symbol or character sequences, such as biological nucleic acid sequences, protein sequences, or other long linear arrays of characters.
Abstract: Improved processors and processing methods are disclosed for high-speed computerized comparison analysis of multiple linear symbol or character sequences, such as biological nucleic acid sequences, protein sequences, or other long linear arrays of characters. These improved processors and processing methods, which are suitable for use with recursive analytical techniques such as the Smith-Waterman algorithm, and the like, are optimized for minimum gate count and maximum clock cycle computing efficiency. This is done by interleaving multiple linear sequence comparison operations per processor, which optimizes use of the processor's resources. In use, a plurality of such processors are embedded in high-density integrated circuit chips, and run synchronously to efficiently analyze long sequences. Such processor designs and methods exceed the performance of currently available designs, and facilitate lossless higher dimensional sequence comparison analysis between three or more linear sequences.

Proceedings ArticleDOI
01 Jan 2005
TL;DR: The proposed design, based on the Montgomery multiplication algorithm, can support various finite field degrees, different primitive polynomials, and erasure decoding functions, and features an on-the-fly finite field inversion table for high speed error evaluation.
Abstract: This paper presents the universal architecture for Reed Solomon (RS) error-and-erasure decoder that can accommodate any codeword with different code parameters and finite field definitions. In comparison with other reconfigurable RS decoders, the proposed design, based on the Montgomery multiplication algorithm, can support various finite field degrees, different primitive polynomials, and erasure decoding functions. In addition, the decoder features an on-the-fly finite field inversion table for high speed error evaluation. The area efficient design approach is also presented. Implemented with 1.2V 0.13mum 1P8M technology, this decoder, correcting up to 16 errors, can operate at 300MHz and reach a 2.4Gb/s data rate. The total gate count is about 54K and the core size is 0.36mm2. The average power consumption is 20.2 mW


Proceedings ArticleDOI
08 Jun 2005
TL;DR: This paper presents an environment for the high level description, refinement, synthesis and verification of data driven architectures and shows how HDL can be used as the intermediate language of a compiler for an even higher level functional programming language.
Abstract: John von Neumann proposed his famous architecture in a context where hardware was very expensive and bulky. His goal was to maximize functionality with minimal hardware. Presently, logical gates are nearly free and single chips contain billions of gates. However, most current designs are still based on Von Neumann's architecture because processors are built on this model. Nevertheless, the main current challenge is to be able to design, refine, synthesize and verify new architectures in a minimum time and with a maximum computational performance regardless of the gate count. Data driven architectures enable a high level of parallelism because instead of a single controller managing all the resources (and often a single ALU), tens or hundreds of small controllers can now operate in parallel on local processing units. This paper presents an environment for the high level description, refinement, synthesis and verification of such systems. Our own HDL is presented with its compiler and we show how it can be used as the intermediate language of a compiler for an even higher level functional programming language. Ongoing work enables the interfacing with other languages (from both hardware and software communities). We also intend to target asynchronous designs.

Journal ArticleDOI
TL;DR: Transmission of high resolution pictures of XGA format and above, even after effecting compression, demand very high serial channel bandwidth requirement, far exceeding the prescribed maximum by MPEG-2 standards, which can be circumvented by down scaling and then effected compression before transmission, trading off for a little image quality, as presented in this paper.

01 Jan 2005
TL;DR: In this paper, an original ultra-wideband (UWB) physical layer (PHY) specification is developed and implemented in digital logic, which is based on a combination of complementary code division multiplexing (CCDM) and multicode interleaved direct sequence (MCIDS) spreading, which provides an additional fixed process gain as well as multipath robustness.
Abstract: An original ultra-wideband (UWB) physical layer (PHY) specification is developed and implemented in digital logic. The novelty of this UWB PHY is based on a combination of complementary code division multiplexing (CCDM), which yields a low-interference signal with a variable process gain, and multicode interleaved direct sequence (MCIDS) spreading, which provides an additional fixed process gain as well as multipath robustness. To operate at the high sample rates needed for UWB, the digital logic, realized in a Virtex-II field programmable gate array (FPGA), has a highly-pipelined architecture for real-time signal processing. In addition, the gate count is minimized by avoiding the use of explicit buffer memory wherever possible. The performance of the transceiver is analyzed under a variety of UWB channels and impairments. It is concluded that the proposed UWB PHY offers robust performance in real-world environments and that it is viable for use in future communication systems.