scispace - formally typeset
Search or ask a question

Showing papers on "Gate count published in 2003"


Proceedings ArticleDOI
08 Sep 2003
TL;DR: All templates for m/spl les/7 are described in this paper and a transformation reducing the gate count can be applied via template matching.
Abstract: Reversible logic functions can be realized as networks of Toffoli gates. The synthesis of Toffoli networks can be divided into two steps. First, find a network that realizes the desired function. Second, transform the network such that it uses fewer gates, while realizing the same function. This paper addresses the second step. Transformations are accomplished via template matching. The basis for a template is a network with m gates that realizes the identity function. If a sequence in the network to be synthesized matches more than half of a template, then a transformation reducing the gate count can be applied. All templates for m ≤ 7 are described in this paper. are described in this paper.

52 citations


Proceedings ArticleDOI
09 Nov 2003
TL;DR: The synthesis algorithm first finds a cascade of Toffoli and Fredkin gates with no back-tracking and minimal look-ahead, and applies transformations that reduce the size of the circuit via template matching.
Abstract: Reversible logic has applications in quantum computing, low power CMOS, nanotechnology, optical computing, and DNA computing. The most common reversible gates are the Toffoli gate and the Fredkin gate. Our synthesis algorithm first finds a cascade of Toffoli and Fredkin gates with no backtracking and minimal look-ahead. Next we apply transformations that reduce the size of the circuit. Transformations are accomplished via template matching. The basis for a template is a network with m gates that realizes the identity function. If a sequence in the network to be synthesized matches more than half of a template, then a transformation that reduces the gate count can be applied. In this paper we show that Toffoli and Fredkin gates behave in a similar manner. Therefore, some gates in the templates may not need to be specified-they can match a Toffoli or a Fredkin gate. We formalize this by introducing the box gate. All templates with less than six gates are enumerated and classified. We synthesize all three input, three output reversible functions and compare our results to those obtained previously.

50 citations


Journal ArticleDOI
18 Jul 2003
TL;DR: A generic architecture for implementing the advanced encryption standard (AES) encryption algorithm in silicon is proposed, which allows the instantiation of a wide range of chip specifications, with these taking the form of semiconductor intellectual property (IP) cores.
Abstract: A generic architecture for implementing the advanced encryption standard (AES) encryption algorithm in silicon is proposed. This allows the instantiation of a wide range of chip specifications, with these taking the form of semiconductor intellectual property (IP) cores. Cores implemented from this architecture can perform both encryption and decryption and support four modes of operation: (i) electronic codebook mode; (ii) output feedback mode; (iii) cipher block chaining mode; and (iv) ciphertext feedback mode. Chip designs can also be generated to cover all three AES key lengths, namely 128 bits, 192 bits and 256 bits. On-the-fly generation of the round keys required during decryption is also possible. The general, flexible and multi-functional nature of the approach described contrasts with previous designs which, to date, have been focused on specific implementations. The presented ideas are demonstrated by implementation in FPGA technology. However, the architecture and IP cores derived from this are easily migratable to other silicon technologies including ASIC and PLD and are capable of covering a wide range of modern communication systems cryptographic requirements. Moreover, the designs produced have a gate count and throughput comparable with or better than the previous one-off solutions.

22 citations


Patent
28 Feb 2003
TL;DR: In this article, an improved WLAN solution for embedded systems incorporating optimized partitioning; it reduces power consumption and systems cost by up to 50%. All silicon gates associated with the redundant RISC processor, redundant SRAM and flash memories used in prior art WLAN solutions are eliminated.
Abstract: An improved WLAN solution for embedded systems incorporating optimized partitioning; it reduces power consumption and systems cost by up to 50%. All silicon gates associated with the redundant RISC processor, redundant SRAM and flash memories used in prior art WLAN solutions are eliminated. The invention includes a low gate count PHY Accelerator ASIC, a dual core processor (DCP), a portion of the PHY in software, and an innovative software MAC architecture supported by minimal hardware acceleration. The DCP is a standard off-the-shelf component incorporating DSP and RISC processors. It executes software portions of the MAC and PHY. The DCP communicates with the PHY Accelerator through a novel parallel interface that improves throughput while reducing processing requirements on DCP. Also, the PHY accelerator, or certain portions of it, may be embedded into the DCP. Invention includes a novel “resource utilization scheme”, whereby the various DCP resources get judiciously re-deployed.

20 citations


Proceedings ArticleDOI
23 May 2003
TL;DR: In connection with developing a compact beamformer architecture, recursive algorithms were investigated including an original design and a technique developed by another research group, and a piecewise-linear approximation approach was also investigated.
Abstract: Modern diagnostic ultrasound beamformers require delay information for each sample along the image lines. In order to avoid storing large amounts of focusing data, delay generation techniques have to be used. In connection with developing a compact beamformer architecture, recursive algorithms were investigated. These included an original design and a technique developed by another research group. A piecewise-linear approximation approach was also investigated. Two imaging setups were targeted -- conventional beamforming with a sampling frequency of 40 MHz and subsample precision of 2 bits, and an oversampled beamformer that performs a sparse sample processing by reconstructing the in-phase and quadrature components of the echo signal for 512 focal points. The algorithms were synthesized for a FPGA device XCV2000E-7, for a phased array image with a depth of 15 cm. Their performance was as follows: (1) For the best parametric approach, the gate count was 2095, the maximum operation speed was 131.9 MHz, the power consumption at 40 MHz was 10.6 mW, and it requires 4 12-bit words for each image line and channel. (2) For the piecewise-linear approximation, the corresponding numbers are 1125 gates, 184.9 MHz, 7.8 mW, and 15 16-bit words.

20 citations


01 Jan 2003
TL;DR: This paper addresses the second step of the synthesis of Tooli networks, which is to transform the network such that it uses fewer gates, while realizing the same function.
Abstract: Reversible logic functions can be realized as networks of Toffoli gates. The synthesis of Tooli networks can be divided into two steps. First, nd a network that realizes the desired function. Second, transform the network such that it uses fewer gates, while realizing the same function. This paper addresses the second step. Transformations are accomplished via template matching. The basis for a template is a network with m gates that realizes the identity function. If a sequence in the network to be synthesized matches more than half of a template, then a transformation reducing the gate count can be applied. All templates for m 7a re described in this paper.

12 citations


Proceedings ArticleDOI
22 Apr 2003
TL;DR: This paper presents the first 476 gate count ODRGA-VLSI chip with a standard 0.35 /spl mu/m 3-metal CMOS process technology using an improved dynamic optical differential reconfiguration circuit developed to reduce the implementation area of optical reconfigurations circuits.
Abstract: An optically differential reconfigurable gate array (ODRGA) is a type of field programmable gate array (FPGA), but its gate array can be reconfigured optically in less than 6 ns. We have fabricated a 68 gate-count ODRGA. However, optical differential reconfiguration circuits, which are capable of optical detection of configuration contexts and which can support reconfiguration of an arbitrary part of its gate array bit-by-bit, occupy up to 47% of the implementation area of ODRGA-VLSI chip and prevent realization of a high gate-count ODRGA. Therefore, a dynamic optical differential reconfiguration circuit was developed to reduce the implementation area of optical reconfiguration circuits. It has been evaluated separately. This paper presents the first 476 gate count ODRGA-VLSI chip with a standard 0.35 /spl mu/m 3-metal CMOS process technology using an improved dynamic optical differential reconfiguration circuit. In addition, the dynamic reconfiguration frequency and performance of logic blocks are shown using HSPICE simulation results. Finally, this paper presents an estimation of the use of a standard 0.35 /spl mu/m 3-metal 14.2/spl times/74.2 mm chip.

12 citations


Proceedings ArticleDOI
K.L. Heo1, Jaehyun Baek1, Myung Hoon Sunwoo1, B.G. Jo, B.S. Son 
03 Nov 2003
TL;DR: This paper proposes a fast Fourier transform (FFT) processor using a new in-place strategy and the mixed-radix algorithm that can reduce the gate count and memory size compared with existing FFT processors.
Abstract: This paper proposes a fast Fourier transform (FFT) processor using a new in-place strategy and the mixed-radix algorithm. The proposed processor uses only two N-word memories for a continuous flow FFT implementation, due to the new in-place strategy, while existing continuous FFT processors use four N-word memories. In addition, the proposed processor satisfies both small area and real-time processing requirement. The gate count of the processor is 37,000 and the number of clock cycles is 640 for a 512-point FFT. Hence, the proposed FFT processor can reduce the gate count and memory size compared with existing FFT processors.

8 citations


Journal Article
TL;DR: In this paper, the authors describe the efficient implementation of Maximum Distance Separable (MDS) mappings and Substitution-boxes (S-boxes) in gate-level hardware for application to Substitution Permutation Network (SPN) block cipher design.
Abstract: This paper describes the efficient implementation of Maximum Distance Separable (MDS) mappings and Substitution-boxes (S-boxes) in gate-level hardware for application to Substitution-Permutation Network (SPN) block cipher design. Different implementations of parameterized MDS mappings and S-boxes are evaluated using gate count as the space complexity measure and gate levels traversed as the time complexity measure. On this basis, a method to optimize MDS codes for hardware is introduced by considering the complexity analysis of bit parallel multipliers. We also provide a general architecture to implement any invertible S-box which has low space and time complexities. As an example, two efficient implementations of Rijndael, the Advanced Encryption Standard (AES), are considered to examine the different tradeoffs between speed and time.

8 citations


Proceedings ArticleDOI
01 Sep 2003
TL;DR: The FPGA implementation of the proposed video-scaling algorithm is capable of processing high-resolution, color pictures of sizes up to 1024x768 pixels at the real time video rate of 30 frames/second and compares favorably with another ASIC implementation.
Abstract: A novel architecture suitable for FPGA/ASIC implementation of a video scalar is presented. The scheme proposed here results in enormous savings of memory normally required, without compromising on the image quality. In the present work, SVGA compatible video sequence is scaled up to XGA format. The up scaling operation for a video sequence is carried out by scaling up the image input, followed by down scaling and filtering. The FPGA implementation of the proposed video-scaling algorithm is capable of processing high-resolution, color pictures of sizes up to 1024x768 pixels at the real time video rate of 30 frames/second. The design has been realized by RTL compliant Verilog coding, and fits into a single chip with a gate count utilization of two million gates. For lower resolution pictures, the mapped device can be scaled down. The present FPGA implementation compares favorably with another ASIC implementation.

8 citations


Journal ArticleDOI
TL;DR: A parameterized digital signal processor (DSP) core for an embedded digital signal processing system designed to achieve demodulation/synchronization with better performance and flexibility is proposed.
Abstract: This paper proposes a parameterized digital signal processor (DSP) core for an embedded digital signal processing system designed to achieve demodulation/synchronization with better performance and flexibility. The features of this DSP core include parameterizedb data path, dual MAC unit, subword MAC, and optional function-specific blocks for accelerating communication system modulation operations. This DSP core also has a low-power structure, which includes the gray-code addressing mode, pipeline sharing, and advanced hardware looping. Users can select the parameters and special functional blocks based on the character of their applications and then generating a DSP core. The DSP core has been implemented via a cell-based design method using a synthesizable Verilog code with TSMC 0.35 µm SPQM and 0.25 µm 1P5M library. The equivalent gate count of the core area without memory is approximately 50 k. Moreover, the maximum operating frequency of a 16 × 16 version is 100 MHz (0.35 µm) and 140 MHz (0.25 µm).

Journal ArticleDOI
TL;DR: By modifying the BPE logical combinational circuit, both IMDCT (inverse modified discrete cosine transform) and FFT functions can be obtained simultaneously from a single VLSI chip and significantly reduces the cost of DAB receivers.
Abstract: This paper proposes a circuit-sharing approach to improve efficiency for the key digital audio broadcasting (DAB) techniques, i.e., MPEG1-audio decoding and orthogonal frequency division multiplexing (OFDM). Because OFDM's fast Fourier transform (FFT) requires heavy computational power for implementation, a single butterfly processing element (BPE) is adopted to reduce the chip area required for FFT. Furthermore, by modifying the BPE logical combinational circuit, both IMDCT (inverse modified discrete cosine transform) and FFT functions can be obtained simultaneously from a single VLSI chip. Therefore, the proposed technique reduces hardware overhead, enhances circuit efficiency and significantly reduces the cost of DAB receivers. The proposed circuit is simulated as a VLSI prototype chip using a 0.35 /spl mu/m CMOS process, with a chip area of about 22.09 mm/sup 2/ and a total gate count of approximately 10839 (excluding ROM and RAM).

ReportDOI
01 Dec 2003
TL;DR: This project examines the use of optical logic for implementing encryption in the photonic domain to achieve the requisite encryption rates and applies techniques to the development of a 'toy' algorithm that may pave the way for more robust optical algorithms.
Abstract: With the build-out of large transport networks utilizing optical technologies, more and more capacity is being made available. Innovations in Dense Wave Division Multiplexing (DWDM) and the elimination of optical-electrical-optical conversions have brought on advances in communication speeds as we move into 10 Gigabit Ethernet and above. Of course, there is a need to encrypt data on these optical links as the data traverses public and private network backbones. Unfortunately, as the communications infrastructure becomes increasingly optical, advances in encryption (done electronically) have failed to keep up. This project examines the use of optical logic for implementing encryption in the photonic domain to achieve the requisite encryption rates. In order to realize photonic encryption designs, technology developed for electrical logic circuits must be translated to the photonic regime. This paper examines two classes of all optical logic (SEED, gain competition) and how each discrete logic element can be interconnected and cascaded to form an optical circuit. Because there is no known software that can model these devices at a circuit level, the functionality of the SEED and gain competition devices in an optical circuit were modeled in PSpice. PSpice allows modeling of the macro characteristics of the devices in context of a logic element as opposed to device level computational modeling. By representing light intensity as voltage, 'black box' models are generated that accurately represent the intensity response and logic levels in both technologies. By modeling the behavior at the systems level, one can incorporate systems design tools and a simulation environment to aid in the overall functional design. Each black box model of the SEED or gain competition device takes certain parameters (reflectance, intensity, input response), and models the optical ripple and time delay characteristics. These 'black box' models are interconnected and cascaded in an encrypting/scrambling algorithm based on a study of candidate encryption algorithms. We found that a low gate count, cascadable encryption algorithm is most feasible given device and processing constraints. The modeling and simulation of optical designs using these components is proceeding in parallel with efforts to perfect the physical devices and their interconnect. We have applied these techniques to the development of a 'toy' algorithm that may pave the way for more robust optical algorithms. These design/modeling/simulation techniques are now ready to be applied to larger optical designs in advance of our ability to implement such systems in hardware.

Proceedings ArticleDOI
01 Jan 2003
TL;DR: By analyzing and improving high-radix Montgomery algorithm and FIPS method, a new algorithm that is suitable for signature card application is proposed and the former architecture of two multipliers computing in parallel so that the longest critical path of the whole design is greatly shortened.
Abstract: In this paper, a new implementation method to optimize a 1024-bit RSA crypto processor is presented. By analyzing and improving high-radix Montgomery algorithm and FIPS method we propose a new algorithm that is suitable for signature card application. A corresponding RAM management approach is introduced. We have improved the former architecture of two multipliers computing in parallel so that the longest critical path of the whole design is greatly shortened. Performance analysis is performed. As a case study, a 1024-bit RSA crypto processor is implemented. The average operating time to calculate 1024-bit modular exponentiation is 1.58M cycles. Based on TSMC 0.25mm standard cell library, the synthesis gate count is about 36K and the highest frequency is 66M. At this speed, 1024-bit message encryption needs only 27.7ms.

Patent
Daniel Watkins1
29 Apr 2003
TL;DR: In this paper, a method for reducing circuit gate count is proposed, which is based on generating a new file from a source file and a parameter file, where the source file comprises a first circuit defined in a hardware description language, the parameter file comprises an equivalent second circuit, and the first circuit is functionally equivalent to the second circuit.
Abstract: A method for reducing circuit gate count is disclosed. The method generally comprises the steps of (A) generating a new file from a source file and a parameter file, wherein the source file comprises a first circuit defined in a hardware description language, the new file comprises a second circuit defined in the hardware description language, the parameter file comprises a second clock frequency for the second circuit that is faster than a first clock frequency for the first circuit, and the first circuit is functionally equivalent to the second circuit, (B) generating a first gate count by synthesizing a first design from the source file, (C) generating a second gate count by synthesizing a second design from the new file and (D) generating a statistic by comparing the first gate count to the second gate count.

Journal ArticleDOI
14 Oct 2003
TL;DR: This paper presents the implementation of a wireless multimedia DSP chip for mobile applications that employs parallel processing techniques, such as SIMD, vector processing, DSP schemes and adopts low power features for wireless applications.
Abstract: The paper presents the implementation of a wireless multimedia DSP chip for mobile applications. The implemented DSP chip supports communication instructions for Viterbi, timing synchronization, etc. as well as multimedia instructions. The DSP can handle variable length data and perform four MACs in a cycle. The proposed DSP employs parallel processing techniques, such as SIMD, vector processing, DSP schemes and adopts low power features for wireless applications. The implemented DSP chip includes test circuits and various peripherals, such as DMA, bus arbitration, timer, etc. This chip has been modeled by Verilog HDL and implemented using the 0.35 /spl mu/m HCB60 library. The total gate count excluding memory is about 170,000 gates and the clock frequency is 100 MHz.

Patent
19 Jun 2003
TL;DR: In this article, a complementary code decoder technique is provided where the encoded input data is first parallelized, and correlation values are generated by a correlator circuit that is capable of changing its correlation characteristics depending on at least one control signal.
Abstract: A complementary code decoder technique is provided where the encoded input data is first parallelized. From the parallelized data, correlation values are generated by a correlator circuit that is capable of changing its correlation characteristics depending on at least one control signal. Different control signals are sequentially provided to the correlator circuit thereby driving the correlator circuit to sequentially generate multiple correlation values from the parallelized data, based on different correlation characteristics. From the multiple correlation values, the correlation value that represents the optimum correlation is identified. This technique significantly reduces the gate count of the decoder structure, thus saving chip area and manufacturing costs.

Proceedings ArticleDOI
06 Apr 2003
TL;DR: The architecture has been successfully synthesized in a 0.13 /spl mu/m process, resulting in a net-list of about 23000 gates, and a clock frequency of 195 MHz, making the performance/gate count ratio very competitive.
Abstract: This paper describes the design and implementation of a 16-bit fixed point DSP processor. The processor is intended as a platform for hardware accelerators and allows additional computational units and assembler instructions to be added. The I/O facilities can also be customized to the needs of a specific application. Benchmarking has shown that the processor, without any hardware accelerators, has a performance comparable to single MAC commercial DSP processors. The architecture has been successfully synthesized in a 0.13 /spl mu/m process, resulting in a net-list of about 23000 gates, and a clock frequency of 195 MHz, making the performance/gate count ratio very competitive. It is also small enough to integrate 100 heterogeneous processors on a chip for example for communication infrastructure applications. The complete design time, including architecture and instruction set planning, assembler, debugger, instruction set simulator, RTL code and complete verification was about half a person-year.

Book ChapterDOI
Fei Li1, Lei He1, Joseph M. Basile2, Rakesh Patel2, Hema Ramamurthy2 
10 Sep 2003
TL;DR: An in-depth study of high-level leakage modeling and reduction in the context of a full custom design environment, and proposes a methodology to estimate the circuit area, minimum and maximum leakage current, and maximum power-up current, introduced by leakage reduction using sleep transistor insertion.
Abstract: Reducing the ever-growing leakage current is critical to high performance and power efficient designs. We present an in-depth study of high-level leakage modeling and reduction in the context of a full custom design environment. We propose a methodology to estimate the circuit area, minimum and maximum leakage current, and maximum power-up current, introduced by leakage reduction using sleep transistor insertion, for any given logic function. We build novel estimation metrics based on logic synthesis and gate level analysis using only a small number of typical circuits, but no further logic synthesis and gate level analysis are needed during our estimation. Compared to time-consuming logic synthesis and gate level analysis, the average errors for circuits from a leading industrial design project are 23.59% for area, 21.44% for maximum power-up current. In contrast, estimation based on quick synthesis leads to 11x area difference in gate count for an 8bit adder.

Proceedings ArticleDOI
23 Feb 2003
TL;DR: The study results indicate that the FHT design using 16-chip sequence achieves 90% reduction in hardware resources (equivalent gate count) as compared to the design which uses 256- chip sequence.
Abstract: In code division multiple access (CDMA) systems the base station identifies each user in a cell by unique orthogonal (Walsh) codes. The Walsh codes are generated at the transmitter using a Walsh-Hadamard function. A Fast Hadamard Transformer (FHT) is used at the receiver to decode the transmitted codes. The purpose of this study is to design a FHT which utilizes less hardware resources as compared to the existing designs and also suggest means for reducing the input length of the Walsh sequence. Our study results indicate that the FHT design using 16-chip sequence achieves 90% reduction in hardware resources (equivalent gate count) as compared to the design which uses 256-chip sequence. Also, the maximum frequency of operation of the 16-chip FHT (35.679 MHz) is more than double as compared to the 256-chip FHT (16.025 MHz).

Proceedings ArticleDOI
22 Jul 2003
TL;DR: A flexible motion estimator to meet the processing speed of all formats with a common architecture, wherein there are four searching algorithms built to satisfy the various processing-time required.
Abstract: Currently, various video formats, such as QCIF, CIF, CCIR601 and HDTV, are widely used in the world. Since their resolution is different, the processing speed required is different for motion estimation. Hence we need to design the specific hardware architecture for each format. In this study, we propose a flexible motion estimator to meet the processing speed of all formats with a common architecture, wherein there are four searching algorithms built to satisfy the various processing-time required. For applying to low-power systems, the computational kernel employs four processing-elements in this chip. With timing mode control, the throughput rate of the proposed motion estimator can achieve from 3k to 180k blocks to meet different applications while this chip works on 50MHz. The total gate count is less than 5k and the power dissipation is no more than 0.1mW in the worst case. Hence the very low-power motion estimation is appropriate for portable systems.

Journal ArticleDOI
TL;DR: Experimental results show that a pure combinational logic design can easily achieve a baseband‐interfacing throughput of over‐100Mbps with a cost‐effect gate count of 21702.
Abstract: In this paper, we propose a pure combinational logic design for the implementation of IEEE802.11 Medium Access Control (MAC) protocol, in contrast to firmware implementation based on an embedded micro-engine. In order to have a timely response, a Control Frame Handler is also included in our MAC controller. A further improvement for timely manipulation on the time-critical management frames (such as Beacon, ATIM and Probe Response) is subsequently developed in our revised edition. Equipped with a self-developed PCMCIA unit, the functions of our MAC controller have been verified using two Altera EPF 10K-100 ARC240-2 FPGAs in an on-line MAC-to-MAC data exchange fashion. Experimental results show that a pure combinational logic design can easily achieve a baseband-interfacing throughput of over-100Mbps with a cost-effect gate count of 21702.

Proceedings ArticleDOI
14 Oct 2003
TL;DR: This work introduces CalmADM, an audio DSP module based on Samsung's 16-bit microprocessor, CalmRISC16, and its 24-bit DSP coprocessor, and two new architectures are adopted, one of which is shared data cache and the other is sequential stream buffer.
Abstract: We introduce CalmADM, an audio DSP module based on CalmRISC. CalmADM is based on Samsung's 16-bit microprocessor, CalmRISC16, and its 24-bit DSP coprocessor, CalmMAC24. Two new architectures are adopted in CalmADM. One is shared data cache. With this new caching scheme, we can reduce on-chip memory area while not losing cache performance and programming flexibility. The other is sequential stream buffer. This small buffer takes full charge of input/output audio stream data. Therefore, data caches in CalmADM do not suffer from performance loss caused by cache misses for input/output stream data. The area of CalmADM is about 290K in gate count including on-chip cache memories. The performance we achieved is 38 MIPS for 5.1-channel Dolby AC3 decoding and 40 MIPS for 7.1-channel MPEG2 audio layer2 decoding, including off-chip memory access overhead.