scispace - formally typeset
Search or ask a question

Showing papers on "Gate count published in 2008"


Book ChapterDOI
01 Apr 2008
TL;DR: A new family of stream ciphers, Grain, is proposed, based on two shift registers and a nonlinear output function, that have the additional feature that the speed can be easily increased at the expense of extra hardware.
Abstract: A new family of stream ciphers, Grain, is proposed. Two variants, a 80-bit and a 128-bit variant are specified, denoted Grain and Grain-128 respectively. The designs target hardware environments where gate count, power consumption and memory are very limited. Both variants are based on two shift registers and a nonlinear output function. The ciphers also have the additional feature that the speed can be easily increased at the expense of extra hardware.

225 citations


Journal ArticleDOI
TL;DR: In this paper, the authors demonstrate the feasibility of incorporating fully automated frequency response measurement capabilities in digital PWM controllers at relatively low additional cost using a Verilog-coded implementation with low tens of thousands of logic gates and about 10 kB of memory.
Abstract: Recent work has shown the feasibility of integrating nonparametric frequency-domain system identification functionality into digital controllers for switched-mode pulse-width modulated (PWM) dc-dc power converters. The resulting discrete-time frequency response can be used for design, diagnostic, or self-tuning purposes. The success of these applications depends on the fidelity of the identified frequency responses and the degree to which the process is automated, as well as the costs, in terms of gate count, time duration of identification, and effect on output voltage, incurred to obtain these benefits. This paper demonstrates the feasibility of incorporating fully automated frequency response measurement capabilities in digital PWM controllers at relatively low additional cost. In particular, it is shown that relatively accurate and smooth frequency response data can be obtained using a Verilog-coded implementation with low tens of thousands of logic gates and about 10 kB of memory. The identification process can be accomplished in several hundred milliseconds and the output voltage can be kept within specified bounds during the entire process. Experimental results are provided for four different PWM dc-dc converters, including a synchronous buck with two different filter capacitors, a boost operating in continuous conduction mode (CCM), and a boost operating in discontinuous conduction mode (DCM).

94 citations


Journal ArticleDOI
TL;DR: This article extends RMRLS, a reversible logic synthesis tool, to include additional gate types, and finds that these additional gates reduce the average gate count for three-variable functions from 6.10 to 4.56, and improve the synthesis results of many larger functions, both in terms of gate count and quantum cost.
Abstract: Reversible logic has applications in low-power computing and quantum computing. Most reversible logic synthesis methods are tied to particular gate types, and cannot synthesize large functions. This article extends RMRLS, a reversible logic synthesis tool, to include additional gate types. While classic RMRLS can synthesize functions using NOT, CNOT, and n-bit Toffoli gates, our work details the inclusion of n-bit Fredkin and Peres gates. We find that these additional gates reduce the average gate count for three-variable functions from 6.10 to 4.56, and improve the synthesis results of many larger functions, both in terms of gate count and quantum cost.

85 citations


Journal ArticleDOI
TL;DR: This paper presents a perfect dynamic optically reconfigurable gate array (DORGA) architecture emulation using a holographic memory and a conventional ORGA-VLSI and the performance, particularly the reconfiguration context retention time, was measured experimentally.
Abstract: This paper presents a perfect dynamic optically reconfigurable gate array (DORGA) architecture emulation using a holographic memory and a conventional ORGA-VLSI. In ORGAs, although a large virtual gate count can be realized by exploiting the large-capacity storage capability of a holographic memory, the actual gate count, which is the gate count of a programmable gate array VLSI, is important to increase the instantaneous performance. Nevertheless, in previously proposed ORGA-VLSIs, the static configuration memory to store a single configuration context consumed a large implementation area of the ORGA-VLSIs and prevented the realization of large-gate-count ORGA-VLSIs. Therefore, a DORGA architecture has been proposed in order to increase the gate density. It uses the junction capacitance of photodiodes as dynamic memory, thereby obviating the static configuration memory. However, to date, demonstration of a perfect optically reconfigurable architecture for DORGA-VLSIs has never been presented. Therefore, in this study, the DORGA architecture was perfectly emulated, and the performance, particularly the reconfiguration context retention time, was measured experimentally. The advantages of this architecture are discussed in relation to the results.

77 citations


Journal ArticleDOI
TL;DR: Two design techniques are proposed for high-throughput low-density parity-check (LDPC) decoders: a broadcasting technique mitigates routing congestion by reducing the total global wirelength and an interlacing technique increases the decoder throughput by processing two consecutive frames simultaneously.
Abstract: Two design techniques are proposed for high-throughput low-density parity-check (LDPC) decoders. A broadcasting technique mitigates routing congestion by reducing the total global wirelength. An interlacing technique increases the decoder throughput by processing two consecutive frames simultaneously. The brief discusses how these techniques can be used for both fully parallel and partially parallel LDPC decoders. For fully parallel decoders with code lengths in the range of a few thousand bits, the half-broadcasting technique reduces the total global wirelength by about 26% without any hardware overhead. The block interlacing scheme is applied to the design of two fully parallel decoders, increasing the throughput by 60% and 71% at the cost of 5.5% and 9.5% gate count overhead, respectively.

72 citations


Journal ArticleDOI
TL;DR: A low-latency and hardware-efficient ME design with three design techniques that adopts parallel instead of serial multiresolution search, and applies a mode-filtering approach to further reduce the bandwidth and cycles and share the buffer of IME and FME.
Abstract: Motion estimation (ME) in high-definition H.264 video coding presents a significant design challenge for memory bandwidth, latency, and cost because of its large search range and various modes. To conquer this problem, this paper presents a low-latency and hardware-efficient ME design with three design techniques. The first technique on integer-pel ME (IME) adopts parallel instead of serial multiresolution search so that we can process 1080 p @ 60 fps videos with plusmn128 search range within just 256 cycles, 5.95-KB buffers, and 213.7 K gates. The second technique on fractional-pel ME (FME) uses a single-iteration six-point search to reduce the cycle count by half with similar gate count and negligible quality loss. The third technique applies a mode-filtering approach to further reduce the bandwidth and cycles and share the buffer of IME and FME. The final ME implementation with 0.13-mum process can support processing of 1080 p @ 60 fps with just 128.8 MHz, 282.6 K gates, and 8.54-KB buffer, which saves 60% gate count, and 68.9% SRAM buffers when compared with the previous design.

64 citations


Journal ArticleDOI
TL;DR: This paper describes the design and VLSI implementation of a highly efficient, single-port SRAM-based deblocking filter which can achieve 204 cycles/macroblock throughput for H.264/AVC real-time decoding and achieves zero stall cycle in normal pipeline flow, making the best out of a pipelined architecture.
Abstract: This paper describes the design and VLSI implementation of a highly efficient, single-port SRAM-based deblocking filter. It can achieve 204 cycles/macroblock throughput for H.264/AVC real-time decoding. Several deblocking filter designs in the literature have been compared and the possibility of realizing them in a pipeline is studied. Eventually we came up with a completely new design which has a five-stage pipeline with gated clock to increase system throughput while reducing power. Data hazards and structure hazards, which are the two most critical issues for a pipelined filter, are analyzed and resolved. Efficient memory organization for both on-chip SRAM and transposition buffers is employed. By using innovative hybrid edge filtering sequence and out-of-order memory update scenario, we obtain zero stall cycle in normal pipeline flow, making the best out of a pipelined architecture. Compared with existing designs, our design achieves at least 18% clock cycle reduction, as well as 20% lower power consumption owing to its efficient pipeline and memory architecture. The total gate count is comparable to other designs in literature without using any expensive two-port or dual-port on-chip SRAMs.

63 citations


Book ChapterDOI
01 Jan 2008
TL;DR: This paper will formulate a framework for defining the problem space around low cost RFID systems to enable the engineering of solutions and for evaluating those solutions for their effectiveness in the contest of networked low costRFID systems.
Abstract: There are various solutions expounded upon to address security vulnerabilities and privacy violations of low cost RFID systems. This paper will formulate a framework for defining the problem space around low cost RFID systems to enable the engineering of solutions and for evaluating those solutions for their effectiveness in the contest of networked low cost RFID systems.

57 citations


Journal ArticleDOI
TL;DR: An improved butterfly structure and an address generation method for fast Fourier transform (FFT) using reduced logic to generate the addresses, avoiding the parity check and barrel shifters commonly used in FFT implementations are presented.
Abstract: In this study, an improved butterfly structure and an address generation method for fast Fourier transform (FFT) are presented. The proposed method uses reduced logic to generate the addresses, avoiding the parity check and barrel shifters commonly used in FFT implementations. A general methodology for radix-2 N-point transforms is derived and the signal flow graph for a 16-point FFT is presented. Furthermore, as a case study, a 16-point FFT with 32-bit complex numbers is synthesized using a CMOS 0.18 mum technology. The circuit gate count analysis indicates that significant logic reduction can be achieved with improved throughput compared to the conventional implementations.

36 citations


Proceedings ArticleDOI
18 May 2008
TL;DR: The Givens Rotation based factorization algorithm is revised and an efficient scheme working in the real number domain is developed, which can reduce the computing complexity to almost one half by exploiting the symmetric property.
Abstract: Complex QR factorization is a fundamental operation used in various MIMO signal detection algorithms. In this paper, we revise the Givens Rotation based factorization algorithm and develop an efficient scheme working in the real number domain. The complex matrix is first extended into a block-wise symmetric real number counterpart. The proposed scheme can reduce the computing complexity to almost one half by exploiting the symmetric property. Computing complexity analysis also shows the superiority of our scheme over various factorization schemes. Finally, subject to the EWC 802.11n recommendation, a novel systolic array design featuring fully parallel and deeply pipelined processing was presented. CORDIC algorithm is employed to implement the required rotation operations with low circuit complexity. Synthesis results in TSMC 0.18mum process indicate the proposed design, with a gate count of merely 17.06 K and a maximum clock rate of 202 MHz, can admit a new 2 x 2 complex matrix for factorization in every 8 clock cycles.

36 citations


Journal ArticleDOI
TL;DR: This work presents an FFT/IFFT core compiler particularly suited for the VLSI implementation of OFDM communication systems, which employs an architecture template based on the pipelined cascade principle and produces macrocells with lower circuit complexity expressed as gate count and RAM/ROM bits.
Abstract: This work presents an FFT/IFFT core compiler particularly suited for the VLSI implementation of OFDM communication systems. The tool employs an architecture template based on the pipelined cascade principle. The generated cores support run-time programmable length and transform type selection, enabling seamless integration into multiple mode and multiple standard terminals. A distinctive feature of the tool is its accuracy-driven configuration engine which automatically profiles the internal arithmetic and generates a core with minimum operands bit-width and thus minimum circuit complexity. The engine performs a closed-loop optimization over three different internal arithmetic models (fixed-point, block floating-point and convergent block floating-point) using the numerical accuracy budget given by the user as a reference point. The flexibility and re-usability of the proposed macrocell are illustrated through several case studies which encompass all current state-of-the-art OFDM communications standards (WLAN, WMAN, xDSL, DVB-T/H, DAB and UWB). Implementations results of the generated macrocells are presented for two deep sub-micron standard-cells libraries (65 and 90nm) and commercially available FPGA devices. When compared with other tools for automatic FFT core generation, the proposed environment produces macrocells with lower circuit complexity expressed as gate count and RAM/ROM bits, while keeping the same system level performance in terms of throughput, transform size and numerical accuracy.

Proceedings ArticleDOI
21 Jan 2008
TL;DR: A new decomposition theory that is based on the properties of threshold functions is presented, which produces circuits that are better than the previous state of art and uses a new method of algebraic factorization called the min-max factorization.
Abstract: Scaling is currently the most popular technique used to improve performance metrics of CMOS circuits. This cannot go on forever because the properties that are responsible for the functioning of MOSFETs no longer hold in nano dimensions. Recent research into nano devices has shown that nano devices can be an alternative to CMOS when scaling of CMOS becomes infeasible in the near future. This is motivating the need for stable and mature design automation techniques for threshold logic since it is the design abstraction used for most nano- devices. This paper presents a new decomposition theory that is based on the properties of threshold functions. The main contributions of this paper are: (1) A new method of algebraic factorization called the min-max factorization. (2) A decomposition theory that uses this new factorization to identify and characterize threshold functions. (3) A new threshold logic synthesis methodology that uses the decomposition theory. This synthesis methodology produces circuits that are better than the previous state of art (27% better gate count and comparable circuit depth).

Proceedings ArticleDOI
01 Nov 2008
TL;DR: The proposed receiver achieves system performance as PER = 0.01 at SNR less than 5 dB, which is better than standard specifications, and a new decision-feedback algorithm for residue phase error tracking is presented.
Abstract: This paper presents an IEEE 802.15.4 (ZigBee) practicable baseband processor including transmitter and receiver. To estimate and compensate carrier phase error at baseband, the receiver allows full digital solution for carrier phase synchronization. An existing packet detection algorithm for spread spectrum communication system is used to estimate large carrier frequency offset. This paper also presents a new decision-feedback algorithm for residue phase error tracking. The proposed receiver achieves system performance as PER = 0.01 at SNR less than 5 dB, which is better than standard specifications. The baseband processor was implemented in simple hardware architecture and designed with low power technique. The chip is fabricated with the TSMC 0.18 mum 1P6M CMOS technology with a gate count of 78 k. The area is 1.633 mm times 1.633 mm, and power consumption is about 1.7 mW at receiver mode under supply voltage of 1.8 V and operating frequency of 4 MHz.

Proceedings ArticleDOI
01 Oct 2008
TL;DR: A novel and scalable technique for inserting observation points to aid compression by reducing pattern count and data volume is presented.
Abstract: As digital circuits grow in gate count so does the data volume required for manufacturing test. To address this problem several test compression techniques have been developed. This paper presents a novel and scalable technique for inserting observation points to aid compression by reducing pattern count and data volume. Experimental results presented for industrial circuits demonstrate the effectiveness of the method.

Proceedings ArticleDOI
24 Oct 2008
TL;DR: High throughput architecture of an encoder and a decoder for a quasi-cyclic low-density parity-check (LDPC) code and a new systematic encoding method carried out by polynomial manipulation are proposed.
Abstract: High throughput architecture of an encoder and a decoder for a quasi-cyclic low-density parity-check (LDPC) code is proposed. A new systematic encoding method is carried out by polynomial manipulation. The proposed decoder architecture, where the check-node process is split into two processes so that the memory access becomes column-wise, enables overlapped message-passing for any parity-check matrix. The hardware architecture for the check-node processes utilizing a quasi-cyclic structure does not require complex multiplexers. Hardware employing the proposed architecture for a (1440,1344) LDPC code designed for high throughput millimeter wave application is evaluated using 65 nm CMOS technology. The gate count of the encoder for 3 Gbps and 6 Gbps throughput is 2.5 k and 3.1 k, respectively, and the gate count of the decoder for 8 iterations is 304 k and 409 k, respectively. A bit-error rate of 10-6 is obtained at Eb/N0 of 5.9 dB, and the estimated power consumption of the decoder is 58 mW for 3 Gbps and 86 mW for 6 Gbps.

Proceedings ArticleDOI
01 Nov 2008
TL;DR: This paper presents coarse-grained reconfigurable architecture supporting floating-point operations, where each integer processing element is paired with its neighbor to perform floating point operations.
Abstract: This paper presents coarse-grained reconfigurable architecture supporting floating-point operations, where each integer processing element is paired with its neighbor to perform floating point operations. One processing element in a pair is in charge of the mantissa part, and the other is in charge of the exponent part. With an 8 times 2 array of processing elements, 8 floating-point operations can be performed at the same time. The chip is fabricated in MagnaChip/Hynix 0.18 mum technology with the gate count of 363,013 and clock frequency of 116.8 MHz in the typical case.

Proceedings ArticleDOI
01 Nov 2008
TL;DR: All module (vertex shader, clipping engine, triangle setup engine, rasterizer, pixel shader and raster operator) of 3D pipeline on FPGA using RTL design is developed to support the OpenGL ES 2.0 and Shader model 3.0.
Abstract: This paper proposes effective 3D graphics hardware. It is designed to support the OpenGL ES 2.0 and Shader model 3.0. We develop all module (vertex shader, clipping engine, triangle setup engine, rasterizer, pixel shader and raster operator) of 3D pipeline on FPGA using RTL design. The proposed hardware of which total gate count is about 1.486 M operates with 100 Mpixels/sec at pixel shader. Compared to the other product of a company, the proposed architecture result in about 50% improvement in term of cycle.

Proceedings ArticleDOI
01 Nov 2008
TL;DR: A novel pipeline architecture to speed up the CABAC decoder, which has strong data dependency while decoding a plurality of bins, and re-arrange the context models in memory by applying two principles in order to reduce the usage of memory space and to lower the frequency in memory access.
Abstract: We present a high-throughput and low-cost context adaptive binary arithmetic (CABAC) decoder for H.264/AVC. Since the CABAC decoder has strong data dependency while decoding a plurality of bins, we propose a novel pipeline architecture to speed up this operation. Based on different types of syntax elements, two approaches to improve throughput are proposed. In addition, we re-arrange the context models in memory by applying two principles in order to reduce the usage of memory space and to lower the frequency in memory access. The proposed CABAC decoder is already integrated in a H.264 decoder and is able to achieve real-time decoding for H.264/AVC high profile HD level 4.1. The implemented design can operate at 250 MHz with 35.6 k gate count under 0.18 mum silicon technology.

Proceedings ArticleDOI
25 May 2008
TL;DR: A testable 2-D motion estimation design at the bit level (TMEbit) based on the C-testability conditions are proposed, and the bit-level cell functions are made bijective in order to meet the testability conditions.
Abstract: In this paper, a testable 2-D motion estimation (TME) design at the bit level (TMEbit) based on the C-testability conditions are proposed. In order to meet the testability conditions, the bit-level cell functions are made bijective. Our C-testability conditions guarantee about 100% fault coverage for single cell fault model with a constant number of test patterns. The number of test patterns is 128. To verify the proposed technique, an experimental chip is implemented with TSMC 0.18 mum technology. According to experimental results, the gate count of the design is about 159 K and the design can operate at the frequency up to 100 MHz. The hardware overhead used to make it C-testable is about 7%.

Journal ArticleDOI
01 Dec 2008
TL;DR: The proposed unified pipelined architecture outperforms many recent implementations in terms of gate count and is capable of processing a 4 × 4 residual block in 4 clock cycles.
Abstract: This paper deals with the process of Transformation and Quantization that is carried out on each inter-predicted residual block in a video encoding process and their reduced complexity hardware implementation. H.264/AVC utilizes 4?×?4 integer transform, which is derived from the 4?×?4 DCT. We propose, a reduced complexity algorithm and a pipelined structure for the Core forward integer transform module. A multiplier-less architecture is realized with less number of shifts and adds compared to existing works. The corresponding inverse transform is exactly reversible. Each of the transformed coefficients is quantized by a scalar quantizer. The quantization step size can be varied from macroblock to macroblock. The proposed unified pipelined architecture outperforms many recent implementations in terms of gate count and is capable of processing a 4?×?4 residual block in 4 clock cycles.

Proceedings ArticleDOI
24 Feb 2008
TL;DR: A novel programmable pattern recognition system employing all-optical logic gates and experimentally demonstrate key functions at 42 Gbit/s is proposed.
Abstract: We propose a novel programmable pattern recognition system employing all-optical logic gates and experimentally demonstrate key functions at 42 Gbit/s. Gate count is independent of target length and the temporal position of the target is identified.

Journal ArticleDOI
01 Dec 2008
TL;DR: A fast GME algorithm is proposed that combines temporal prediction and skipping the redundant computation, 91% memory bandwidth and 80% iterations are saved, while the performance is kept, compared to Gradient Descent in MPEG-4 Verification Model.
Abstract: Global motion estimation and compensation (GME/GMC) is an important video processing technique and has been applied to many applications including video segmentation, sprite/mosaic generation, and video coding. In MPEG-4 Advanced Simple Profile (ASP), GME/GMC is adopted to compensate camera motions. Since GME is important, many GME algorithms have been proposed. These algorithms have two common characteristics, huge computation complexity and ultra large memory bandwidth. Hence for realtime applications, a hardware accelerator of GME is required. However, there are many hardware design challenges of GME like irregular memory access and huge memory bandwidth, and only few hardware architectures have been proposed. In this paper, we first analyzed three typical algorithms of GME, and a fast GME algorithm is proposed. By using temporal prediction and skipping the redundant computation, 91% memory bandwidth and 80% iterations are saved, while the performance is kept, compared to Gradient Descent in MPEG-4 Verification Model. Based on our proposed algorithm, a hardware architecture of GME is also presented. A new scheduling, Reference-Based Scheduling, is developed to solve the irregular memory access problem. An interleaved memory arrangement is applied to satisfy the memory access requirement of interpolation. The total gate count of hardware implementation is 131 K with Artisan 0.18 um cell library, and the internal memory size is about 7.9 Kb. Its processing ability is MPEG-4 ASP@L3, which is 352×288 with 30 fps, at 30 MHz.

Journal ArticleDOI
TL;DR: A decoupled MFD architecture is introduced in order to easily add or remove the codecs and it reduces the gate count by sharing the large-size common resources.
Abstract: We propose a VLSI design of Multi-Format Decoder (MFD) to support multiple video codec standards such as MPEG-2, MPEG-4, H.264 and VC-1. A decoupled MFD architecture is introduced in order to easily add or remove the codecs. The decoupled architecture preserves the stability of the previously designed and verified codecs. It also reduces the gate count by sharing the large-size common resources. The design size is 2.4M gates and the operating clock frequency is 225MHz in the 65nm process.

Proceedings ArticleDOI
04 Jan 2008
TL;DR: This work presents a systematic method for the designing reversible arithmetic circuits for finite field or Galois fields of form GF(2m), and shows that an adder over GF( 2m) can be designed with m garbage bits and that of a PB multiplier with 2m garbage bits.
Abstract: Motivated by the potential of reversible computing, we present a systematic method for the designing reversible arithmetic circuits for finite field or Galois fields of form GF(2m) It is shown that an adder over GF(2m) can be designed with m garbage bits and that of a PB multiplier with 2m garbage bits To tackle the problem of errors in computation, we also extend the circuit with error detection feature Gate count and technology oriented cost metrics are used for evaluation The expression for the upper bound for gate size is also derived for special primitive polynomials Our technique, when compared with existing CAD tool gives the same gate size and quantum cost

Proceedings ArticleDOI
01 Nov 2008
TL;DR: This paper proposes a new application specific processor and compiler targeting H.264 inverse transform and inverse quantization based on the 6-stage pipelined dual issue VLIW+SIMD architecture, and compiler mapping techniques such as CKF (compiler known function), inline assembly and CGD (code generator description).
Abstract: This paper proposes a new application specific processor and compiler targeting H.264 inverse transform and inverse quantization. They are based on the 6-stage pipelined dual issue VLIW+SIMD architecture, efficient instructions for inverse transform and inverse quantization, and compiler mapping techniques such as CKF (compiler known function), inline assembly and CGD (code generator description). The proposed architecture whose approximate gate count is about 130 K runs at 100 MHz. Compared to the ARM1020E processor, the proposed architecture and compiler result in about 20~46% improvement in terms of total cycles as well as smaller hardware complexity.

Journal ArticleDOI
Donghyun Kim1, Lee-Sup Kim2
TL;DR: The proposed texture coordinate interpolation architecture uses less silicon gates than the architecture using dividers, and the gate count reduction ratios are 25.2% and 37.0% for 16- and 32-bit texture coordinates, respectively.

Proceedings ArticleDOI
14 Apr 2008
TL;DR: This paper proposes a novel application-specific hybrid coarsegrained reconfigurable architecture with a flexible network on chip (NoC) mechanism which supports reuse of reference frame blocks between the processing elements through NoC routers and reduces the transactions from/to the main memory.
Abstract: This paper proposes a novel application-specific hybrid coarsegrained reconfigurable architecture with a flexible network on chip (NoC) mechanism. Architecture supports variable block size motion estimation (VBSME) with much less resources than ASIC based and coarse grained reconfigurable architectures. The intelligent NoC router supports full search motion estimation algorithm as well as other fast search algorithms like diamond, hexagon, big hexagon and spiral. Our model is a hierarchical hybrid processing element based 2D architecture which supports reuse of reference frame blocks between the processing elements through NoC routers. This reduces the transactions from/to the main memory. Proposed architecture is designed with Verilog-HDL description and synthesized by 90 nm CMOS standard cell library. Results show that our architecture reduces the gate count by 7x compared to its ASIC counterpart that only supports full search method. Moreover, the proposed architecture operates at a frequency comparable to ASIC based implementation to sustain 30 fps. Our approach is based on a simple design which utilizes a high-level of parallelism with an intensive data reuse. Therefore, proposed architecture supports run-time reconfiguration for any block size and for any search pattern depending on the application requirement.

Journal ArticleDOI
01 Sep 2008
TL;DR: Comparisons with other studies show the excellent properties of the proposed architecture in terms of gate count, memory size and clock-cycle/macroblock.
Abstract: This work presents an efficient architecture design for deblocking filter in H.264/AVC using a novel fast-deblocking boundary-strength (FDBS) technique. Based on the FDBS technique, the proposed architecture divides the deblocking process into three filtering modes, namely offset-based, standard-based and diagonal-based filtering modes, to reduce the blocking artifact and improve the video quality in H.264/AVC. The proposed architecture is designed in Verilog HDL, simulated with Quartus II and synthesized using 0.18 μm CMOS cells library with the Synopsys Design Compiler. Simulation results demonstrate good performance in PSNR improvement and bit-rate reduction. Additionally, verification results through physical chip design reveal that the proposed architecture design can support 1,280?×?720@30 Hz processing throughput while clocking at 100 MHz. Comparisons with other studies show the excellent properties of the proposed architecture in terms of gate count, memory size and clock-cycle/macroblock.

Proceedings ArticleDOI
01 Jun 2008
TL;DR: Smart pulse-based stochastic-logic blocks are constructed to provide an efficient architecture that is able to implement Bayesian techniques, thus providing a low-cost solution in terms of gate count and power dissipation.
Abstract: We present a pattern recognition methodology based on stochastic logic. The technique implements a parallel comparison of input data from a set of sensors to various pre-stored categories. Smart pulse-based stochastic-logic blocks are constructed to provide an efficient architecture that is able to implement Bayesian techniques, thus providing a low-cost solution in terms of gate count and power dissipation. The proposed architecture is applied to a specific navigation problem demonstrating that the system provides an almost optimal solution.

Proceedings ArticleDOI
07 Jul 2008
TL;DR: Two hardware oriented algorithms are proposed to increase the coding speed and reduce the computation complexity of the fast motion estimation (FME) algorithm to speed up 71% coding time of the original standard.
Abstract: In this paper, two hardware-oriented fast motion estimation algorithms and their implementations into a 2D systolic array for variable block size motion estimation architecture are presented. Two hardware oriented algorithms are proposed to increase the coding speed and reduce the computation complexity of the fast motion estimation (FME) algorithm. The results show that the proposed FME algorithm can speed up 71% coding time of the original standard with slightly PSNR loss and bit rate increase. Therefore, the hardware architecture designs for the proposed algorithms with considerations of both motion vector cost and the sum of absolute difference (SAD) distortion are implemented. The chip, which is realized in CMOS TSMC 0.13 mum 1P8M technology, can be operated at 200 MHz with gate count 191k including the memory modules.