scispace - formally typeset
Search or ask a question

Showing papers on "Gate count published in 2006"


Journal ArticleDOI
Cary Gunn1
TL;DR: Luxtera has demonstrated the technology required to implement CMOS photonics, and product development is underway as discussed by the authors for 10-Gbps operation, in addition to that required to scale to 100 Gbps and 1 Tbps.
Abstract: Luxtera has demonstrated the technology required to implement CMOS photonics, and product development is underway. It has also demonstrated all the technology required for 10-Gbps operation, in addition to that required to scale to 100 Gbps and 1 Tbps. A single 10-Gbps channel today integrates tens of optical components into a single die alongside circuitry of modest gate count, 100,000 per transceiver. For the first time, high-speed optical communications directly between silicon die are possible at a price-performance point competitive with traditional electrical interconnects

493 citations


Journal ArticleDOI
TL;DR: The algorithm uses the positive-polarity Reed-Muller expansion of a reversible function to synthesize the function as a network of Toffoli gates, and is able to quickly synthesize all four-variable and most five-variable reversible functions that were in the test suite.
Abstract: Reversible logic finds many applications, especially in the area of quantum computing. A completely specified n-input, n-output Boolean function is called reversible if it maps each input assignment to a unique output assignment and vice versa. Logic synthesis for reversible functions differs substantially from traditional logic synthesis and is currently an active area of research. The authors present an algorithm and tool for the synthesis of reversible functions. The algorithm uses the positive-polarity Reed-Muller expansion of a reversible function to synthesize the function as a network of Toffoli gates. At each stage, candidate factors, which represent subexpressions common between the Reed-Muller expansions of multiple outputs, are explored in the order of their attractiveness. The algorithm utilizes a priority-based search tree, and heuristics are used to rapidly prune the search space. The synthesis algorithm currently targets the generalized n-bit Toffoli gate library. However, other algorithms exist that can convert an n-bit Toffoli gate into a cascade of smaller Toffoli gates. Experimental results indicate that the authors' algorithm quickly synthesizes circuits when tested on the set of all reversible functions of three variables. Furthermore, it is able to quickly synthesize all four-variable and most five-variable reversible functions that were in the test suite. The authors also present results for some benchmark functions widely discussed in literature and some new benchmarks that the authors have developed. The algorithm is shown to synthesize many, but not all, randomly generated reversible functions of as many as 16 variables with a maximum gate count of 25

377 citations


Journal ArticleDOI
TL;DR: Two hardware architectures are proposed that can support traditional fixed block-size motion estimation as well as VBSME with less chip area overhead compared to previous approaches and an eight-parallel SAD tree with a shared reference buffer for H.264/AVC integer motion estimation is proposed.
Abstract: Variable block-size motion estimation (VBSME) has become an important video coding technique, but it increases the difficulty of hardware design. In this paper, we use inter-/intra-level classification and various data flows to analyze the impact of supporting VBSME in different hardware architectures. Furthermore, we propose two hardware architectures that can support traditional fixed block-size motion estimation as well as VBSME with less chip area overhead compared to previous approaches. By broadcasting reference pixel rows and propagating partial sums of absolute differences (SADs), the first design has the fewer reference pixel registers and a shorter critical path. The second design utilizes a two-dimensional distortion array and one adder tree with the reference buffer that can maximize the data reuse between successive searching candidates. The first design is suitable for low resolution or a small search range, and the second design has advantages of supporting a high degree of parallelism and VBSME. Finally, we propose an eight-parallel SAD tree with a shared reference buffer for H.264/AVC integer motion estimation (IME). Its processing ability is eight times of the single SAD tree, but the reference buffer size is only doubled. Moreover, the most critical issue of H.264 IME, which is huge memory bandwidth, is overcome. We are able to save 99.9% off-chip memory bandwidth and 99.22% on-chip memory bandwidth. We demonstrate a 720-p, 30-fps solution at 108 MHz with 330.2k gate count and 208k bits on-chip memory

269 citations


Journal ArticleDOI
TL;DR: In this article, the synthesis of reversible networks of Toffoli gates has been studied and two new iterative synthesis procedure employing Reed-Muller spectra has been introduced and shown to complement earlier synthesis approaches.
Abstract: This paper presents novel techniques for the synthesis of reversible networks of Toffoli gates, as well as improvements to previous methods. Gate count and technology oriented cost metrics are used. Our synthesis techniques are independent of the cost metrics. Two new iterative synthesis procedure employing Reed-Muller spectra are introduced and shown to complement earlier synthesis approaches. The template simplification suggested in earlier work is enhanced through introduction of a faster and more efficient template application algorithm, updated (shorter) classification of the templates, and presentation of the new templates of sizes 7 and 9. A novel ``resynthesis'' approach is introduced wherein a sequence of gates is chosen from a network, and the reversible specification it realizes is resynthesized as an independent problem in hopes of reducing the network cost. Empirical results are presented to show that the methods are effective both in terms of the realization of all 3x3 reversible functions and larger reversible benchmark specifications.

114 citations


Proceedings ArticleDOI
24 Jan 2006
TL;DR: A near optimal hardware architecture for deblocking filter in H.264/MPEG-4 AVC with novel filtering order and a data reuse strategy that result in significant saving in filtering time, local memory usage, and memory traffic is proposed.
Abstract: We propose a near optimal hardware architecture for deblocking filter in H.264/MPEG-4 AVC. We propose a novel filtering order and a data reuse strategy that result in significant saving in filtering time, local memory usage, and memory traffic. Every 16-16 macroblock requires 192 filtering operations. After a few initialization cycles, our 5-stage pipelined architecture is able to perform one filtering operation per cycle. Compared with some state-of-the-art designs, our architecture delivers the fastest level of performance while using much smaller gate count and memory. We have implemented and integrated the proposed deblocking filter into an H.264 main profile video decoder and verified it with an FPGA prototype.

67 citations


Journal ArticleDOI
TL;DR: A new degree computationless modified Euclid (DCME) algorithm and its dedicated architecture for Reed-Solomon (RS) decoder, which can completely remove the degree computation and comparison circuits and provide the short latency and low-cost RS decoding.
Abstract: This paper proposes a new degree computationless modified Euclid (DCME) algorithm and its dedicated architecture for Reed-Solomon (RS) decoder. This architecture has low hardware complexity compared with conventional modified Euclid (ME) architectures, since it can completely remove the degree computation and comparison circuits. The architecture employing a systolic array requires only the latency of 2t clock cycles to solve the key equation without initial latency. In addition, the DCME architecture using 3t+2 basic cells has regularity and scalability since it uses only one processing element. Hence, the proposed DCME architecture provides the short latency and low-cost RS decoding. The DCME architecture has been synthesized using the 0.25-mum Faraday CMOS standard cell library and operates at 200 MHz. The gate count of the DCME architecture is 21 760. Hence, the RS decoder using the proposed DCME architecture can reduce the total gate count by at least 23% and the total latency to at least 10% compared with conventional ME decoders

65 citations


Journal ArticleDOI
TL;DR: The proposed lossless IntFFT architecture can achieve comparative SQNR and BER performance with reduced memory usage and quantization loss analysis of these two types of FFT is derived and compared.
Abstract: In this paper, a VLSI architecture based on radix-22 integer fast Fourier transform (IntFFT) is proposed to demonstrate its efficiency. The IntFFT algorithm guarantees the perfect reconstruction property of transformed samples. For a 64-points radix-22 FFT architecture, the proposed architecture uses 2 sets of complex multipliers (six real multipliers) and has 6 pipeline stages. By exploiting the symmetric property of lossless transform, the memory usage is reduced by 27.4%. The whole design is synthesized and simulated with a 0.18-mum TSMC 1P6M standard cell library and its reported equivalent gate count usage is 17,963 gates. The whole chip size is 975 mumtimes977 mum with a core size of 500 mumtimes500 mum. The core power consumption is 83.56 mW. A Simulink-based orthogonal frequency demodulation multiplexing platform is utilized to compare the conventional fixed-point FFT and proposed IntFFT from the viewpoint of system-level behavior in items of signal-to-quantization-noise ratio (SQNR) and bit error rate (BER). The quantization loss analysis of these two types of FFT is also derived and compared. Based on the simulation results, the proposed lossless IntFFT architecture can achieve comparative SQNR and BER performance with reduced memory usage

36 citations


Journal ArticleDOI
TL;DR: A genetic algorithm based novel approach that employs a gate-level encoding scheme that allows flexible changes of functions and interconnections of logic cells comprised, and it adopts a multi-objective evaluation mechanism of fitness with weight-vector adaptation and circuit simulation is developed.
Abstract: Evolutionary design of circuits (EDC), an important branch of evolvable hardware which emphasizes circuit design, is a promising way to realize automated design of electronic circuits. In order to improve evolutionary design of logic circuits in efficiency, scalability and capability of optimization, a genetic algorithm based novel approach was developed. It employs a gate-level encoding scheme that allows flexible changes of functions and interconnections of logic cells comprised, and it adopts a multi-objective evaluation mechanism of fitness with weight-vector adaptation and circuit simulation. Besides, it features an adaptation strategy that enables crossover probability and mutation probability to vary with individuals' diversity and genetic-search process. It was validated by the experiments on arithmetic circuits especially digital multipliers, from which a few functionally correct circuits with novel structures, less gate count and higher operating speed were obtained. Some of the evolved circuits are the most efficient or largest ones (in terms of gate count or problem scale) as far as we know. Moreover, some novel and general principles have been discerned from the EDC results, which are easy to verify but difficult to dig out by human experts with existing knowledge. These results argue that the approach is promising and worthy of further research.

33 citations


Patent
21 Apr 2006
TL;DR: In this article, an authentication system and a method for signing data are disclosed, which uses a hardware software partitioned approach and compares favourably with performance and other parameters with a complete hardware or full software implementation.
Abstract: An authentication system and a method for signing data are disclosed. The system uses a hardware software partitioned approach. In its implementation the system of the invention compares favourably with performance and other parameters with a complete hardware or full software implementation. Particularly, advantageously there is a reduced gate count. Also as disclosed in the invention the system makes it difficult for hackers to attack the system using simple power analysis.

29 citations


Proceedings ArticleDOI
14 May 2006
TL;DR: A new CORDIC algorithm and architectures which can generate close-to-optimum rotation sequences easily with small lookup table sizes is proposed, which has better performances in terms of area, speed and power consumption.
Abstract: In this paper, we propose a new CORDIC algorithm and architectures which can generate close-to- optimum rotation sequences easily with small lookup table sizes. This new design is particularly suitable for the applications of adjust- able-length FFT. In all, the required number of shift-and- add operations for micro-rotations and scale-factor compensations is only n/2, where n is the output precision. For design sign verification, we synthesized both serial and pipelined architectures, by using Synopsys Design Complier based on UMC 0.18 µm, 1P6M CMOS technology. The synthesized 16-bit pipelined FFT PE runs at 222MHz, with a total gate count of 89263 and a low-power consumption of 26.75 mW. It meets the FFT speed requirements of most OFDM-based communication systems, including DAB, DVB, 802.16 and VDSL. Compared with a conventional multiplier-based FFT PE and the existing CORDIC-based FFT PE's, the proposed designs has better performances in terms of area, speed and power consumption.

23 citations


Proceedings ArticleDOI
21 May 2006
TL;DR: A multi-mode Reed-Solomon decoder based on the reformulation inversionless Berlekamp-Massey algorithm, which can retain the throughput rate of the reformulated architecture in many practical applications and provides both area and speed advantages.
Abstract: We present a multi-mode Reed-Solomon decoder based on the reformulated inversionless Berlekamp-Massey algorithm, which can retain the throughput rate of the reformulated architecture in many practical applications. With the developed coefficient-selector-free multi-mode arrangement, the resulting design possesses not only area-efficient property but also very simple and regular interconnect topology that makes it very suitable for VLSI realization. Implementation results exhibit that the achievable throughput rate of the developed decoder for n/spl les/255 and 0/spl les/t/spl les/8, implemented in UMC 0.18/spl mu/m 1P6M process, is 3.2Gbps at the maximum clock rate of 400MHz and the total gate count is 22,931. Compared with the existing work based on extended Euclidean algorithm, our development provides both area and speed advantages and can be used for multi-standard applications.

Proceedings ArticleDOI
21 May 2006
TL;DR: The proposed enhanced degree computationless modified Euclid's (E-DCME) algorithm for Reed-Solomon decoder has short critical path delay and small area compared with the conventional modified Euclidean algorithm (ME) and the existing DCME algorithm.
Abstract: This paper proposes an enhanced degree computationless modified Euclid's (E-DCME) algorithm for Reed-Solomon decoder. The critical path delay of the proposed E-DCME algorithm requires only T/sub Mul/ + T/sub ADD/ + T/sub MUX/. In addition, the proposed E-DCME algorithm can reduce the used basic cells and has the latency of 2t -1 clock cycles for solving the key equations. Hence, the proposed E-DCME algorithm has short critical path delay and small area compared with the conventional modified Euclid's algorithm (ME) and the existing DCME algorithm. The gate count of the proposed E-DCME architecture is 17,840. Therefore, the E-DCME architecture can reduce the gate count about 18% compared with the existing DCME architecture.

Proceedings ArticleDOI
01 Aug 2006
TL;DR: This research exploits the multi-casting nature present in various application system task graphs and presents a novel & improved MLPR architecture with broadcast capability, resulting in reduced logic usage & increased performance.
Abstract: Modern FPGAs provide increased gate count with decreased power consumption. Several IP cores along with embedded processor and memory provide a great opportunity of implementing system-on-chip (SoC) designs on configurable devices. Networks-on-Chip (NoC) is an emerging style of SoC design, introduced to overcome the communication and performance bottlenecks of a shared-bus approach. Multi local port router (MLPR) present a novel design alternative for the traditional NoC design. This new methodology offers numerous advantages including bandwidth optimization and reduced network area & power consumption, resulting eventually in improved performance of the NoC system. Unlike the bus-based systems, communication in NoCs until now have been between pair of cores, with no scope of multi-casting. In this research, we advance a step further in the pursuit of a high performance FPGA-based NoC system. We exploit the multi-casting nature present in various application system task graphs and present a novel & improved MLPR architecture with broadcast capability. We present the modified architecture, the decoding scheme and the stripped-down crosspoint matrix, resulting in reduced logic usage & increased performance. We report the synthesis and the simulation results.

Proceedings ArticleDOI
25 Apr 2006
TL;DR: An optically differential reconfigurable gate array (ODRGA-VLSI) with no overhead and fast reconfiguration capability is developed with the aim of providing a virtual gate count that is much larger than those of currently available VLSIs.
Abstract: Optically reconfigurable gate arrays (ORGAs) offer the possibility of providing a virtual gate count that is much larger than those of currently available VLSIs by exploiting the large storage capacity of holographic memory. We developed an optically differential reconfigurable gate array (ODRGA-VLSI) with no overhead and fast reconfiguration capability. This paper presents the results of development of a perfect optical reconfigurable system with the ODRGA-VLSI chip and holographic memory. Experimental results of the reconfiguration procedure and circuit performance on a gate array are also presented.

Cary Gunn1
30 Jun 2006
TL;DR: For the first time, high-speed optical communications directly between silicon die are possible at a price-performance point competitive with traditional electrical interconnects.
Abstract: Luxtera has demonstrated the technology required to implement CMOS photonics, and product development is underway. It has also demonstrated all the technology required for 10-Gbps operation, in addition to that required to scale to 100 Gbps and 1 Tbps. A single 10-Gbps channel today integrates tens of optical components into a single die alongside circuitry of modest gate count, 100,000 per transceiver. For the first time, high-speed optical communications directly between silicon die are possible at a price-performance point competitive with traditional electrical interconnects

Patent
18 May 2006
TL;DR: In this article, an authentication system and a method for signing data are disclosed, which uses a hardware software partitioned approach and compares favourably with performance and other parameters with a complete hardware or full software implementation.
Abstract: An authentication system and a method for signing data are disclosed. The system uses a hardware software partitioned approach. In its implementation the system of the invention compares favourably with performance and other parameters with a complete hardware or full software implementation. Particularly, advantageously there is a reduced gate count. Also as disclosed in the invention the system makes it difficult for hackers to attack the system using simple power analysis.

Book ChapterDOI
01 Mar 2006
TL;DR: In this article, the first 1,632 gate-count zero-overhead VLSI chip fabricated using 035 um CMOS process technology is presented, which is also the largest gate count ORGA.
Abstract: A Zero-Overhead Dynamic Optically Reconfigurable Gate Array (ZO-DORGA), based on a concept using junction capacitance of photodiodes and load capacitance of gates constructing a gate array as configuration memory, has been proposed to realize a single instruction set computer that requires zero-overhead fast reconfiguration To date, although the concept and architecture have been proposed and some simulation results of designs have been presented, a ZO-ORGA VLSI chip has never been fabricated In this paper, the first 1,632 gate-count zero-overhead VLSI chip fabricated using 035 um CMOS process technology is presented The 1,632 ZO-DORGA-VLSI is not only the first prototype VLSI chip; it is also the largest gate-count ORGA Such a large gate count ORGA had never been fabricated until this study The performance of ZO-DORGA-VLSI is clarified and discussed using experimental results

Proceedings ArticleDOI
18 Jun 2006
TL;DR: This paper investigates the integration of a 64-bit LNS arithmetic unit into a conventional microprocessor to devise an LNS unit that can be faster than an FPU for a broad range of applications, and to minimize the added hardware.
Abstract: This paper investigates the integration of a 64-bit LNS arithmetic unit into a conventional microprocessor. The goals are to devise an LNS unit that can be faster than an FPU for a broad range of applications, and to minimize the added hardware. Two ways of implementing the logarithmic sum and difference functions are studied. One way uses higher-order Taylor series implemented by look-up tables and interpolation, while the other is based on a CORDIC engine. It is shown that a look-up table based implementation is fairly competitive to a floating-point unit in terms of clock rate, overall latency and repeat rate, at the expense of some cache pressure, while the CORDIC-based implementation is fast, has a repeat rate of one clock cycle, and supports complex operations but at the cost of a higher gate count.

Journal Article
TL;DR: The 1,632 ZO-DORGA-VLSI is not only the first prototype VLSI chip; it is also the largest gate-count ORGA, and such a large gate count ORGA had never been fabricated until this study.
Abstract: A Zero-Overhead Dynamic Optically Reconfigurable Gate Array (ZO-DORGA), based on a concept using junction capacitance of photodiodes and load capacitance of gates constructing a gate array as configuration memory, has been proposed to realize a single instruction set computer that requires zero-overhead fast reconfiguration. To date, although the concept and architecture have been proposed and some simulation results of designs have been presented, a ZO-ORGA VLSI chip has never been fabricated. In this paper, the first 1,632 gate-count zero-overhead VLSI chip fabricated using 0.35 um CMOS process technology is presented. The 1,632 ZO-DORGA-VLSI is not only the first prototype VLSI chip; it is also the largest gate-count ORGA. Such a large gate count ORGA had never been fabricated until this study. The performance of ZO-DORGA-VLSI is clarified and discussed using experimental results.

Proceedings ArticleDOI
21 May 2006
TL;DR: A MDCT-based PAM algorithm and its dedicated architecture to accelerate PAM calculation is proposed and the filterbanks in AAC can be reduced form three to two and look-up table method is used to replace computation of spreading-function.
Abstract: In this paper, we proposed a low complexity architecture design for psycho-acoustic model (PAM). PAM is key component of MPEG-2/4 advanced audio coding (AAC) encoder. It occupies heavy computation load in AAC encoder and makes the AAC encoder hard to be implemented on portable devices for real-time condition. In order to conquer questions describe above, we propose a MDCT-based PAM algorithm and its dedicated architecture to accelerate PAM calculation. The main advantage of MDCT-based PAM is the filterbanks in AAC can be reduced form three to two. Furthermore, we use look-up table method to replace computation of spreading-function. Second, the logarithmic number system (LNS) is used to reduce computation load of many special functions and data word-length. In the hardware architecture design, the pipeline is used to increase throughput of the PAM. Besides, a logarithmic unit that converts data into log scale at one cycle uses in our design. The proposed PAM architecture is implemented in UMC 0.18 CMOS technology. The total gate count is 69476.

Proceedings ArticleDOI
01 Jan 2006
TL;DR: A DAB receiver baseband chip consisting of a DAB baseband decoder and a MPEG L2 audio decoder, fabricated using standard 0.18 micron CMOS technology and achieved extremely low power dissipation of 30 mW and low gate count of only 100K logic.
Abstract: This paper reports a DAB receiver baseband chip consisting of a DAB baseband decoder and a MPEG L2 audio decoder. The chip complies with European DAB standard Eureka 147 and newly announced Chinese DAB standard GY/T214 2006. The chip was fabricated using standard 0.18 micron CMOS technology and achieved extremely low power dissipation of 30 mW and low gate count of only 100K logic. A DAB receiving prototype was built and the test was carried out. The test results prove that the IC design is successful

Proceedings Article
01 Jan 2006
TL;DR: An area efficient AR processor which use 32-bit architecture and using new implementation technique of diffusion layer is presented, which has 7% in area and 13% in speed improved results from previous cases.
Abstract: Recently, the importance of the area efficient implementation of cryptographic algorithm for the portable device is increasing. Previous ARIA(Academy, Research Institute, Agency) implementation styles that usually concentrate upon speed, we not suitable for mobile devices in area and power aspects. Thus in this paper, we present an area efficient AR processor which use 32-bit architecture. Using new implementation technique of diffusion layer, the proposed processor has 11301 gates chip area. For 128-bit master key, the ARIA processor needs 87 clock cycles to generate initial round keys, n8 clock cycles to encrypt, and 256 clock cycles to decrypt a 128-bit block of data. Also the processor supports 192-bit and 256-bit master keys. These performances are 7% in area and 13% in speed improved results from previous cases.

Book ChapterDOI
10 Dec 2006
TL;DR: A low-power H.264 deblocking filter algorithm that can be skipped on some pixels when pixel differences satisfy some specific conditions and its power consumption can be significantly reduced up to 20.3%.
Abstract: This paper proposed a low-power H.264 deblocking filter algorithm. In H.264 deblocking filter, filtering can be skipped on some pixels when pixel differences satisfy some specific conditions. Furthermore, whole filtering can be skipped when quantization parameter is less than 16. By exploiting this feature, whole deblocking filter or its some parts can be deactivated during execution, and its power consumption can be significantly reduced up to 20.3%. A low-power H.264 deblocking filter architecture was also proposed. Simple control circuit can totally or partially deactivate deblocking filter, and common hardware performs both horizontal and vertical filtering. The proposed low-power deblocking filter was implemented in silicon chip using 0.35 μm standard cell technology. The gate count is about 20,000 gates. The maximum operation frequency is 108 MHz. The maximum throughput is 30 frame/s with CCIR601 image format.

Proceedings ArticleDOI
26 Apr 2006
TL;DR: The decoder achieves a throughput of 312 Mb/s at an operating frequency of 69 MHz with 20 iterative decoding and the gate count is 2M gates.
Abstract: We have designed and implemented the LDPC decoder with memory-reduction method to achieve high-throughput and practical hardware size for long code-length The decoder decodes (3,6)-11520-bit regular LDPC codes using modified min-sum algorithm The decoder achieves a throughput of 312 Mb/s at an operating frequency of 69 MHz with 20 iterative decoding The gate count is 2M gates

Proceedings ArticleDOI
21 May 2006
TL;DR: Wang et al. as mentioned in this paper presented a novel preview-based coarse-grain reconfigurable image signal processor (CRISP) for digital still cameras (DSCs), which considers simpler image pipelines in preview mode and extends flexibility required in picture-taking mode with proper hardware resources.
Abstract: This paper presents a novel preview-based coarse-grain reconfigurable image signal processor (CRISP) for digital still cameras (DSCs). The two modes in DSCs, which have quite different hardware considerations, make traditional implementation methods inefficient. One is preview mode, which needs realtime constraints and the other one is picture-taking mode, which requires high flexibility and capability for various algorithms in it. Low cost design of CRISP considers simpler image pipelines in preview mode and extends flexibility required in picture-taking mode with proper hardware resources devotion. Algorithmic similarity in image pipelines and successful hardware classification lead it to a combination of low cost and high efficiency. Coarse-grain modules connected by reconfigurable interconnection make it a good compromise between dedicated hardware and DSPs, which are suitable for only one, not all of two modes in DSCs respectively. The experimental results show that the total gate count of it is 38.6K with 5.8K byte memory. It can save more than 75% area from high end DSP, such as Trimedia TM1300. Besides, CRISP reduces execution cycle number of image pipeline tasks, such as 2-D filters to only 0.17% of that required by TM1300.

Proceedings ArticleDOI
24 Jan 2006
TL;DR: High-speed reconfigurable processors can be changed from one context to another context at every clock cycle in a few nanoseconds, but their die size limits the number of reconfiguration contexts of currently available DAP/DNA and DRP chips to 4-16.
Abstract: High-speed reconfigurable processors have been developed in recent years: they are DAP/DNA chips and DRP chips [1][2]. These devices can be changed from one context to another context at every clock cycle in a few nanoseconds. However, their die size limits the number of reconfiguration contexts of currently available DAP/DNA and DRP chips to 4-16.

Proceedings ArticleDOI
01 Dec 2006
TL;DR: Experimental results show that the HCGP architecture is scalable and can be used with the state-of-the-art, high gate count FPGAs and is shown to provide superior speed and cost compared to partial crossbar.
Abstract: Multi-FPGA systems (MFSs) are used as custom computing machines, logic emulators, and rapid prototyping vehicles. A key aspect of these systems is their programmable routing architecture which is the manner in which wires, FPGAs, and Field-Programmable Interconnect Devices (FPIDs) are connected. Several routing architectures for MFSs have been proposed and previous research has shown that the partial crossbar is one of the best existing architectures. A new routing architecture, called the Hybrid Complete-Graph and Partial- Crossbar (HCGP), was proposed by Khalid and was shown to provide superior speed and cost compared to partial crossbar. In this paper we address the issue of scalability of the HCGP routing architecture. The motivation for this work was to evaluate the suitability of the HCGP architecture for a future rapid prototyping system product that was being developed at Cadence. Experimental results show that the HCGP architecture is scalable and can be used with the state-of-the-art, high gate count FPGAs.

01 Jan 2006
TL;DR: In this article, a near optimal hardware architecture for deblocking filter in H264/MPEG-4 AVC is proposed, which uses a novel filtering order and a data reuse strategy that result in significant saving in filtering time, local memory usage, and memory traffic.
Abstract: We propose a near optimal hardware architecture for deblocking filter in H264/MPEG-4 AVC We propose a novel filtering order and a data reuse strategy that result in significant saving in filtering time, local memory usage, and memory traffic Every 16-16 macroblock requires 192 filtering operations After a few initialization cycles, our 5-stage pipelined architecture is able to perform one filtering operation per cycle Compared with some state-of-the-art designs, our architecture delivers the fastest level of performance while using much smaller gate count and memory We have implemented and integrated the proposed deblocking filter into an H264 main profile video decoder and verified it with an FPGA prototype

Proceedings ArticleDOI
01 Dec 2006
TL;DR: This paper proposes an enhanced degree computationless modified Euclid's (E-DCME) algorithm for Reed-Solomon decoder that can reduce the number of multiplexers compared with the existing DCME algorithm.
Abstract: This paper proposes an enhanced degree computationless modified Euclid's (E-DCME) algorithm for Reed-Solomon decoder. The proposed E-DCME algorithm can reduce the number of multiplexers compared with the existing DCME algorithm. The critical path delay of the proposed E-DCME algorithm requires only TMul + TADD + TMUX while that of the existing DCME algorithm requires TMul + TADD + 2TMUX. In addition the proposed E-DCME algorithm uses 3t basic cells and has the latency of 2t - 1 clock cycles for solving the key equation. However, the existing DCME algorithm requires 3t + 2 basic cells and 2t clock cycles for solving the key equation. The gate count of the proposed E-DCME architecture is 17,840. Therefore, the E-DCME architecture can reduce the gate count about 18% compared with the existing DCME architecture.

Patent
21 Dec 2006
TL;DR: In this paper, a convolutional interleaving and de-interleaving circuit and method are disclosed, where the controller enables those address generators to be provided or stored in corresponding channel addresses, and an adder commonly used is utilized.
Abstract: A convolutional interleaving and de-interleaving circuit and method are disclosed. The convolutional interleaving and de-interleaving circuit comprises an initial address generator, a first address generator, a second address generator, an address mixer, an adder, a controller and a memory. Wherein the controller enables those address generators to be provided or stored in the corresponding channel addresses, and an adder commonly used is utilized. Arranging appropriately memory addresses can reduce the requirement of registers. Then, the decrease of the required gate count and the chip layout area can be easily achieved.