scispace - formally typeset
Search or ask a question

Showing papers on "Gate count published in 2001"


Journal ArticleDOI
TL;DR: Experimental results show that exploiting integer bitwidth substantially reduces the gate count of PICO-synthesized hardware accelerators across a range of applications.
Abstract: Program-in chip-out (PICO) is a system for automatically synthesizing embedded hardware accelerators from loop nests specified in the C programming language. A key issue confronted when designing such accelerators is the optimization of hardware by exploiting information that is known about the varying number of bits required to represent and process operands. In this paper, we describe the handling and exploitation of integer bitwidth in PICO. A bitwidth analysis procedure is used to determine bitwidth requirements for all integer variables and operations in a C application. Given known bitwidths for all variables, complex problems arise when determining a program schedule that specifies on which function unit (FU) and at what time each operation executes. If operations are assigned to FUs with no knowledge of bitwidth, bitwidth-related cost benefit is lost when each unit is built to accommodate the widest operation assigned. By carefully placing operations of similar width on the same unit, hardware costs are decreased. This problem is addressed using a preliminary clustering of operations that is based jointly on width and implementation cost. These clusters are then honored during resource allocation and operation scheduling to create an efficient width-conscious design. Experimental results show that exploiting integer bitwidth substantially reduces the gate count of PICO-synthesized hardware accelerators across a range of applications.

95 citations


Patent
19 Apr 2001
TL;DR: A programmable error-correction decoder embodied in an integrated circuit and error correction decoding method that performs high-speed error correction for digital communication channels and digital data storage applications is described in this paper.
Abstract: A programmable error-correction decoder embodied in an integrated circuit and error correction decoding method that performs high-speed error correction for digital communication channels and digital data storage applications The decoder carries out error detection and correction for digital data in a variety of data transmission and storage applications The decoder has three basic modules, including a syndrome computation module, a Berlekamp-Massey computation module, and a Chien-Forney module The syndrome computation module calculates syndromes which are intermediate values required to find error locations and values The Berlekamp-Massey module implements a Berlekamp-Massey algorithm that converts the syndromes to intermediate results known as lambda (Λ) and omega (Ω) polynomials The Chien-Forney module uses modified Chien-search and Forney algorithms to calculate actual error locations and error values The decoder can decode a range of BCH and Reed-Solomon codes and shortened versions of these codes and can switch between these codes, and between different block lengths, while operating on the fly without any delay between adjacent blocks of data that use different codes Translator and inverse-translator circuits are employed that allow optimal choice of the internal on-chip Galois field representation for maximizing chip speed and minimizing chip gate count by making possible the use of a novel quadratic-subfield modular multiplier and a novel power-subfield integrated Galois-field divider A simplified Chien-Forney algorithm is implemented that requires fewer computations to determine error magnitudes for Reed-Solomon codes with offsets compared to conventional approaches, and which allows the same circuitry to be used for different codes with arbitrary offsets

90 citations


Journal ArticleDOI
TL;DR: This paper compares several postcursor equalization and trellis decoding algorithms with respect to performance, hardware complexity, and critical path and shows that parallel decision-feedback decoders (PDFD) offer the best tradeoff.
Abstract: 1000BASE-T Gigabit Ethernet employs eight-state 4-dimensional trellis-coded modulation to achieve robust 1-Gb/s transmission over four pairs of Category-5 copper cabling. This paper compares several postcursor equalization and trellis decoding algorithms with respect to performance, hardware complexity, and critical path. It is shown that parallel decision-feedback decoders (PDFD) offer the best tradeoff. The example of a 14-tap PDFD, however, shows that it is challenging to meet the required throughput of 1 Gb/s using current standard-cell CMOS technology. A modified approach is proposed which uses decision-feedback prefilters followed by a one-tap PDFD. This considerably reduces hardware complexity and improves the throughput while still meeting the bit-error-rate requirement. The critical path is further reduced by employing a look-ahead technique. The proposed joint equalizer and trellis decoder architecture has been implemented in 3.3-V 0.25-/spl mu/m standard-cell CMOS process. It achieves a throughput of 1 Gb/s with a 125 MHz clock. Compared to a 14-tap PDFD, the design improves both gate count and throughput by a factor of two, while suffering only from a 1.3-dB performance degradation.

50 citations


Proceedings ArticleDOI
06 May 2001
TL;DR: The hardwired DA (distributed arithmetic) method with radix-2 multibit coding for the minimum resource, and the used symmetric transpose memory for high speed.
Abstract: In this paper, we evaluate the hardware implementation method of general DCT/IDCT compatible architecture with minimum resource and high speed. We proposed and implemented the hardwired DA (distributed arithmetic) method with radix-2 multibit coding for the minimum resource, and we used symmetric transpose memory for high speed. Generally, IDCT procedure consists of two ID-IDCT procedures and one transpose. This architecture shows some resources of IDCT core are reusable for DCT process. We propose a general scheme for the processing element of which the gate count is 8.6 K for DCT and 9.2 K for IDCT, through Verilog HDL simulation in 0.65 um SOG technology. Also, we verify that the simulation results using Matlab are acceptable for IEEE Std 1180-1990.

23 citations


Proceedings ArticleDOI
06 May 2001
TL;DR: This paper describes a 10.7 Gb/s throughput FEC (Forward Error Correction) codec LSI for optical transmission systems that uses a time-multiplexed Reed-Solomon (RS) decoder, which is shared among 4 RS codewords and processes 5 parallel digits.
Abstract: This paper describes a 10.7 Gb/s throughput FEC (Forward Error Correction) codec LSI for optical transmission systems. In order to reduce the power consumption and logic size, the FEC codec uses a time-multiplexed Reed-Solomon (RS) decoder, which is shared among 4 RS codewords and processes 5 parallel digits. The time-multiplexed RS decoder requires only 58% of the gates and 75% of the power consumption of the conventional decoder. As a result, the codec achieves a low power consumption of only 3.31 W and a low gate count of only 1.1 Mgates using 0.18 /spl mu/m CMOS technology.

11 citations


01 Jan 2001
TL;DR: This thesis proposes and analyzes various methods of implementing barrel shifters to understand the tradeoffs of various barrel shifter design approaches in order to recognize where each may be most useful.
Abstract: Barrel shifters are arithmetic and logic circuits that may be utilized to shift or rotate data in a general-purpose microprocessor or digital signal processor. This thesis proposes and analyzes various methods of implementing barrel shifters. The purpose of this thesis is to understand the tradeoffs of various barrel shifter design approaches in order to recognize where each may be most useful. Each design is a compromise between gate count and critical path latency. In an attempt to reduce both, the proposed designs utilize a number of innovative design techniques. The techniques can be divided into two categories: those addressing uni-directional result computation and those providing the logic necessary to implement all operations with' uni-directional hardware support. Four design schemes were employed to test each of the techniques; Mux-based Data Reversal, Mask-based Data Reversal, Mask-based Two's Complement, and Mask-based One's Complement. The mux-based and mask-based descriptor indicates the uni-directional result computation method, while the rest specify the mechanism used to emulate bi-directional operations with uni-directional hardware support. Analysis of each design reveals some unique fmdings. First of all, the designs using the two's complement and one's complement mechanisms were found to have a critical path latency much higher than expected, thus they are of very limited use unless the shift/rotate amount arrives earlier than the data to be shifted or rotated. Second, the optimal designs were found to be the Mux-based Data Reversal and Mask-based Data Reversal approaches. Each had comparable area-delay products. If gate count minimization is the primary concern, then the mux-based approach is preferred. Likewise, critical path latency minimization is achieved with the maskbased approach. Thus, no single design is preferred for all circumstances. Instead, use is highly dependent on the particular demands placed on the circuit.

10 citations


Proceedings ArticleDOI
23 Oct 2001
TL;DR: A new VLSI architecture for high-radix modular multiplier to compute RSA public-key cryptosystem based on the modified Montgomery algorithm can achieve good performance in chip area and speed for smart cards.
Abstract: We propose a new VLSI architecture for high-radix modular multiplier to compute RSA public-key cryptosystem based on the modified Montgomery algorithm. A 1024-bit RSA crypto-coprocessor has been implemented based our proposed VLSI architecture. The proposed architecture is performed in a pipelined fashion and takes about u+6/spl radic/u- clock cycles to compute one u-bit modular multiplication and about 1.5u(u+6/spl radic/u) clock cycles to calculate u-bit modular exponentiation. The simulation shows that gate count of the processor is about 38K, and the time to calculate 1024-bit modular exponentiation is about 374ms at 5MHz. Compared with previous methods, our proposed architecture can achieve good performance in chip area and speed for smart cards.

6 citations


Journal ArticleDOI
TL;DR: It was found that the RSFQ system is superior in terms of the operating speed though it requires extremely large chip area, compared with the CMOS microprocessor with the same architecture.
Abstract: We propose a cell-based top-down design methodology for rapid single flux quantum (RSFQ) digital circuits. Our design methodology employs a binary decision diagram (BDD), which is currently used for the design of CMOS pass-transistor logic circuits. The main features of the BDD RSFQ circuits are the limited primitive number, dual rail nature, non-clocking architecture, and small gate count. We have made a standard BDD RSFQ cell library and prepared a top-down design CAD environment, by which we can perform logic synthesis, logic simulation, circuit simulation and layout view extraction. In order to clarify problems expected in large-scale RSFQ circuits design, we have designed a small RSFQ microprocessor based on simple architecture using our top-down design methodology. We have estimated its system performance and compared it with that of the CMOS microprocessor with the same architecture. It was found that the RSFQ system is superior in terms of the operating speed though it requires extremely large chip area.

5 citations


Patent
28 Sep 2001
TL;DR: In this paper, look-up tables in the field programmable gate array are used to store preselected values for the substitution box used in many encryption/decryption schemes, which reduces the overall gate count in the FPGA device resulting in quicker speeds, lower power consumption, and the ability to reconfigure the device for different encryption and decryption implementations.
Abstract: To improve data encryption and/or decryption, look-up tables in the field programmable gate array are used to store preselected values for the substitution box used in many encryption/decryption schemes. Utilizing look-up tables in such a manner reduces the overall gate count in the FPGA device resulting in quicker speeds, lower power consumption, and the ability to reconfigure the device for different encryption/decryption implementations.

5 citations


Proceedings ArticleDOI
29 Nov 2001
TL;DR: The prototyping of a 32-bit Java embedded multimedia processor (MmP) by a 20k-gate FPGA with an enough instruction set is significantly faster than other Java-embedded processors by similar speed grade FPGAs.
Abstract: Still further improvement of multimedia processors is strongly expected to exploit more profitable features because they are essential for the ever-growing Internet. Those processors embedded with a user-friendly programming language Java as a common base are sophisticated mixed systems merging hardware and software. In order to reduce cost and effort in their designing, Field Programmable Gate Array (FPGA) prototyping is really an attractive approach. We describe here the prototyping of a 32-bit Java embedded multimedia processor (MmP) by a 20k-gate FPGA. The main scope with respect to our FPGA prototyping is the compatibility of space and clock speed. The switching speed of the FPGA is not so high grade due to its inevitable tradeoffs against gate count, which we cannot but keep in some degree to cover inherent Java features. Nevertheless, MmP works at the rate of more than 25 MHz owing to our totally minute adjustment through the developing process from hardware description to chip implementation. MmP with an enough instruction set is significantly faster than other Java-embedded processors by similar speed grade FPGAs. Our approach described here naturally abstracts maximum benefit even in case of other type ASIC and custom designs.

2 citations


Journal ArticleDOI
TL;DR: This paper presents a new digital FM demodulator using a discriminator with a cosine-shaped S curve, which reduces the gate count of the functional operations to 1.5k gates, equivalent to 1/10 of the present gate count.
Abstract: This paper presents a new digital FM demodulator using a discriminator with a cosine-shaped S curve. In the present arc-tangent type demodulator, three functional operations are required to calculate the instantaneous frequency. Since the gate count for such functional operations is large, it is necessary for this to be reduced in order to promote system on a chip. Therefore, a cosine type demodulator using one divider is proposed. As a result, the gate count of the functional operations becomes 1.5k gates, equivalent to 1/10 of the present gate count. An experimental VCR using the proposed demodulator implemented on the device is developed to evaluate and confirm the validity of the cosine type demodulator.

Book ChapterDOI
01 Jan 2001
TL;DR: This work focuses on optimising the behaviour of the conditional code which is dominated by complex condition test expressions and achieves a factor 2 reduction of the critical path with a smaller gate count overhead when compared to traditional RT or high-level synthesis based approaches.
Abstract: Data intensive applications (i.e., multimedia) are clearly dominated by data transfer and storage issues. However, after removing the data transfer and address related bottlenecks, the control-flow mapping issues remain as important implementation overhead in a custom hardware realisation. The source of this overhead can be due to the presence of complex conditional code execution, loops or the mixed of both. In this work, we focus on optimising the behaviour of the conditional code which is dominated by complex condition test expressions. Our transformations aim in a first stage at increasing the degree of mutually exclusiveness of the initial condition trees. This step is complemented by optimising the decoding of the test expressions. In a second stage, architecture exploration is performed by trading-off at the high-level gate count against critical-path delay for the resulting code. We demonstrate the proposed transformations on a real-life driver using conventional behavioral synthesis tools as synthesis back-end. The driver selected represents the crucial timing bottleneck in a scalable architecture for MPEG-4 Wavelet Quantisation. Using our approach, we have explored in a very short time the design space at the high level and we have obtained a factor 2 reduction of the critical path with a smaller gate count overhead when compared to traditional RT or high-level synthesis based approaches, even when applied by experienced designers

Proceedings ArticleDOI
01 Jan 2001
TL;DR: Through special extensions to the MIPS instruction and register set, the performance was increased by about 90% while only augmenting the gate count from 25,000 to 40,000.
Abstract: We present the hardware/software architecture of an application specific processor for real-time bit stream decoding of MPEG-4 object based profiles. Inclusion of complete motion vector and shape decoding enables efficient pipelining between the bit stream processor and the remaining parts of the MPEG decoder. Through special extensions to the MIPS instruction and register set, the performance was increased by about 90% while only augmenting the gate count from 25,000 to 40,000.

Journal ArticleDOI
S. Jones1
01 Feb 2001
TL;DR: A novel, parallel and compact design for phased binary coding is described, which is widely used in dictionary-based lossless data compression and permits more efficient coding of dictionary references during dictionary growth.
Abstract: A novel, parallel and compact design for phased binary coding is described. Such an algorithm is widely used in dictionary-based lossless data compression and permits more efficient coding of dictionary references during dictionary growth. Parallelism within the algorithm is exploited to reduce the critical path while still maintaining a low gate count.

Proceedings ArticleDOI
Dong Ku Kim1, Dong Wook Roh1, Young-Kwan Choi1, Sung Woo Kwon1, Moon Ki Lee1 
06 May 2001
TL;DR: In this article, a VSB (vestigial sideband) wideband CDMA system for an overlay system is proposed, which supports gradual evolution from a narrowband (VSB) to a wide-band (WBD) CDMA (WCDMA) system.
Abstract: In this paper, a VSB (vestigial sideband) wideband CDMA system for an overlay system is proposed. The proposed method, which supports gradual evolution from a narrowband CDMA system to a wide-band CDMA system, is evaluated using analysis and simulation. It is likewise described in VHDL and verified and synthesized using Synopsys. Savings on the gate count achieved by introducing CSD-CSCF and common energy calculator is about 30.6%.