scispace - formally typeset
Search or ask a question

Showing papers on "Gate count published in 2009"


Journal ArticleDOI
Jun Lin1, Zhongfeng Wang2, Li Li1, Jin Sha1, Minglun Gao1 
TL;DR: A new algorithm that can efficiently generate all the control signals for the shuffle network used in flexible low-density parity-check (LDPC) decoders is proposed and a low-complexity reconfigurable shuffle network architecture for flexible LDPC decmoders is developed.
Abstract: In this brief, a new algorithm that can efficiently generate all the control signals for the shuffle network used in flexible low-density parity-check (LDPC) decoders is proposed. Employing the proposed algorithm, the hardware complexity of the controller of shuffle networks using the Benes network structure can be significantly reduced. In addition, a low-complexity reconfigurable shuffle network architecture for flexible LDPC decoders is developed. Both the Benes network and the controller can be tailored to fit specific applications. Consequently, an efficient shuffle network for WiMAX LDPC decoders is presented. Synthesis results demonstrate that with the SMIC 0.18-mum complementary metal-oxide-semiconductor process, the total gate count of the proposed shuffle network is only 16 000. The area saving is between 26.6% and 71.1% compared to related works in the literature.

33 citations


Proceedings ArticleDOI
06 Nov 2009
TL;DR: In this paper, a linear/non-linear digital controller is presented which allows a Buck converter to recover from a load transient event with near-optimal voltage deviation and recovery time.
Abstract: A linear/non-linear digital controller is presented which allows a Buck converter to recover from a load transient event with near-optimal voltage deviation and recovery time. It is demonstrated that near-optimal transient performance can be obtained without information pertaining to the Buck converter's output inductor. The proposed controller can also be extended to applications which require load-line regulation. Unlike previous digital time-optimal controllers, the proposed controller does not require digital multiplier or divider blocks nor does it require two-dimensional look-up tables. Thus, the controller can be implemented with a significantly low gate count allowing for the use of low-cost FPGAs or CPLDs. Furthermore, the proposed controller provides an excellent transient response as it is capable of reacting asynchronously to a load transient event.

33 citations


Journal ArticleDOI
TL;DR: In this article, the authors proposed a new 4*4 parity preserving reversible logic gate, IG, which allows any fault that affects no more than a single signal readily detectable at the circuit's primary outputs.
Abstract: USER 11.9999 Normal 0 false false false MicrosoftInternetExplorer4 st1\:*{behavior:url(#ieooui) } /* Style Definitions */ table.MsoNormalTable {mso-style-name:"Table Normal"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-parent:""; mso-padding-alt:0cm 5.4pt 0cm 5.4pt; mso-para-margin:0cm; mso-para-margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:10.0pt; font-family:"Times New Roman"; mso-ansi-language:#0400; mso-fareast-language:#0400; mso-bidi-language:#0400;} Reversible logic is emerging as an important research area having its application in diverse fields such as low power CMOS design, digital signal processing, cryptography, quantum computing and optical information processing. This paper presents a new 4*4 parity preserving reversible logic gate, IG. The proposed parity preserving reversible gate can be used to synthesize any arbitrary Boolean function. It allows any fault that affects no more than a single signal readily detectable at the circuit's primary outputs. It is shown that a fault tolerant reversible full adder circuit can be realized using only two IGs. The proposed fault tolerant full adder (FTFA) is used to design other arithmetic logic circuits for which it is used as the fundamental building block. It has also been demonstrated that the proposed design offers less hardware complexity and is efficient in terms of gate count, garbage outputs and constant inputs than the existing counterparts. Keywords: Reversible Logic, Parity Preserving Reversible Gate, IG Gate, FTFA and Carry Skip Logic. doi: 10.3329/jbas.v32i2.2431 Journal of Bangladesh Academy of Sciences Vol.32(2) 2008 234-250

28 citations


Proceedings ArticleDOI
26 Jul 2009
TL;DR: This paper presents an NoC (Networks-on-Chip) router with an SDRAM-aware flow control based on a priority-based arbitration that improves memory latency and memory utilization and multi-scheduling scheme performed by the multiple SDRam-aware routers helps to achieve better S DRAM performance and save the hardware cost of NoC platform.
Abstract: In this paper, we present an NoC (Networks-on-Chip) router with an SDRAM-aware flow control. Based on a priority-based arbitration, it schedules packets to improve memory utilization and reduce memory latency. Moreover, our multi-scheduling scheme performed by the multiple SDRAM-aware routers helps to achieve better SDRAM performance and save the hardware cost of NoC platform. Experimental results show that our SDRAM-aware router improves memory latency by 18% and memory utilization by 4.9% on average with over 42% saving of gate count of the NoC platform with dual memory subsystem.

28 citations


Proceedings Article
16 Jun 2009
TL;DR: This paper presents a video decoder chip for H.264/AVC high profile, MPEG-1/2 main profile and AVS Jprofile, which is capable of 60fps 1080p decoding at 200MHz by applying a dedicated DRAM sub-system and 2-D cache architecture.
Abstract: In this paper, we present a video decoder chip for H.264/AVC high profile, MPEG-1/2 main profile and AVS Jprofile, which is capable of 60fps 1080p decoding at 200MHz. By applying a dedicated DRAM sub-system and a 2-D cache architecture, 50% of pins for DRAM connection and 36% of power consumption are saved, compared to state-of-the-art work in a system perspective. Meanwhile, 38% of gate count is reduced by applying resource sharing architectures between the 3 supported video formats.

27 citations


Proceedings ArticleDOI
Yexin Zheng1, Chao Huang1
19 Jan 2009
TL;DR: This paper proposes an efficient synthesis heuristic which provides high quality synthesis results of Toffoli network in more reasonable computation time and maximally decreases function complexity during synthesis steps using a weighted, directed graph.
Abstract: Reversible logic studies have promising potential on energy lossless circuit design, quantum computation, nanotechnology, etc. Reversible logic features a one-to-one input output correspondence which makes the logic synthesis for reversible functions differs greatly from traditional Boolean functions. Exact synthesis methods can provide optimal solutions in terms of the total number of reversible gates in the synthesis results. Unfortunately, they may suffer from long computation time, due to the fact that the search space is likely to grow exponentially as the circuit size increases. Therefore, in this paper, we propose an efficient synthesis heuristic which provides high quality synthesis results of Toffoli network in more reasonable computation time. We use a weighted, directed graph for reversible function representation and complexity measurement. The proposed algorithm maximally decreases function complexity during synthesis steps. It has the ability to climb out of local minimums and guarantees algorithm convergence. The experimental results show that our algorithm can achieve optimal or very close to optimal solutions with computation time several orders of magnitude less than the exact methods. Compared with other heuristics, our method demonstrates superior performance in terms of reversible gate count as well as computation time.

23 citations


20 Aug 2009
TL;DR: In this article, a compact architecture for AES mix columns operation and its inverse is presented, which has a lower gate count than other designs that implement both the forward and the inverse mix column operation.
Abstract: Since the debut of the Advanced Encryption Standard (AES), it has been thoroughly studied by hardware designers with the goal of reducing the area and delay of the hardware implementation of this cryptosystem. This paper proposes an implementation of the AES mix columns operation. In this paper, a compact architecture for the AES mix columns operation and its inverse is presented. The hardware implementation is compared with previous work done in this area. We show that our design has a lower gate count than other designs that implement both the forward and the inverse mix columns operation.

20 citations


Journal ArticleDOI
TL;DR: A highly parallel deblocking filter architecture for H.264/AVC is proposed to process one macroblock in 48 clock cycles and give real-time support to QFHD@60fps sequences at less than 100MHz.
Abstract: In this paper, a highly parallel deblocking filter architecture for H.264/AVC is proposed to process one macroblock in 48 clock cycles and give real-time support to QFHD@60fps sequences at less than 100MHz. 4 edge filters organized in 2 groups for simultaneously processing vertical and horizontal edges are applied in this architecture to enhance its throughput. While parallelism increases, pipeline hazards arise owing to the latency of edge filters and data dependency of deblocking algorithm. To solve this problem, a zig-zag processing schedule is proposed to eliminate the pipeline bubbles. Data path of the architecture is then derived according to the processing schedule and optimized through data flow merging, so as to minimize the cost of logic and internal buffer. Meanwhile, the architecture's data input rate is designed to be identical to its throughput, while the transmission order of input data can also match the zig-zag processing schedule. Therefore no intercommunication buffer is required between the deblocking filter and its previous component for speed matching or data reordering. As a result, only one 24×64 two-port SRAM as internal buffer is required in this design. When synthesized with SMIC 130nm process, the architecture costs a gate count of 30.2k, which is competitive considering its high performance.

17 citations


Proceedings ArticleDOI
11 Dec 2009
TL;DR: A novel on-the-fly key expansion structure is applied to improve the throughput and outperforms prior works with respect to the parameter throughput per kilo gates with the same process1.
Abstract: This paper proposes a high-throughput cost-effective implementation of AES supporting encryption and decryption with 128-, 192-, and 256-bit cipher key. Optimum irreducible polynomial coefficients are selected to construct the composite field GF(((22)2)2) on standard and normal base in order to minimize the gate count in SubBytes/InvSubBytes transformation. In addition, MixCoulmn/InvMixColumn transformations are optimized and the gate count is the least as we know. And then, a novel on-the-fly key expansion structure is applied to improve the throughput. The performance is evaluated on SMIC 0.18µm CMOS technology and the design has been verified on FPGA. The throughput can achieve at 1.16Gbps with the cost of only 19476 equivalent NAND2 gates, which outperforms prior works with respect to the parameter throughput per kilo gates with the same process1.

16 citations


Posted Content
TL;DR: It has been shown that the quantum cost of earlier proposals can be further reduced with the help of existing local optimization algorithms (e.g. template matching, moving rule and deletion rule) and a systematic protocol for reduction of quantum cost has been proposed.
Abstract: Multiplier circuits play an important role in reversible computation, which is helpful in diverse areas such as low power CMOS design, optical computing, DNA computing and bioinformatics. Here we propose a new reversible multiplier circuit with optimized hardware complexity. The optimized multiplier circuit is compared with the earlier proposals. We have shown that the quantum cost of earlier proposals can be further reduced with the help of existing local optimization algorithms (e.g. template matching, moving rule and deletion rule). A systematic protocol for reduction of quantum cost has been proposed. It has also been shown that the advantage in gate count obtained in some of the earlier proposals by introduction of new reversible gates is an artifact and if it is allowed then every circuit block can be reduced to a single gate. Further, it is shown that the 4x4 reversible gates proposed for designing of a component of multiplier circuit (full adder) is neither unique nor special and many such 4x4 gates may be proposed. As example three such new gates have been presented here and it is shown that the proposed gates are universal. It is also shown that the total cost of our design is minimum.

15 citations


Proceedings ArticleDOI
06 Mar 2009
TL;DR: The proposed work employs a memory-less combinatorial design for the implementation of SB/ISR as an alternative to achieve higher speeds by eliminating memory access delays while retaining or enhancing the over all area efficiency.
Abstract: The most critical factors responsible for bottleneck in the design and implementation of high-speed AES (Advanced Encryption Standard) architectures for any resource constrained target platform such as an FPGA are Substitute byte/Inverse SubstituteByte and MixColumn/InverseMixcolumn operations. Most implementations conventionally make use of the memory intensive look up table approach for Substitute byte/Inverse SubstituteByte (SB/ISR) block implementations resulting in an unbreakable delay. The proposed work employs a memory-less combinatorial design for the implementation of SB/ISR as an alternative to achieve higher speeds by eliminating memory access delays while retaining or enhancing the over all area efficiency. The work also explores use of sub-pipelining to further enhance the speed and throughput of the suggested implementation. The architecture employs optimization in both inverter design and isomorphic mapping using composite field arithmetic to reduce the area requirements. The proposed design replicates the very compact SB/ISR reported in [6] and [13] with an overall reduction in area requirement of 18% and 14% resply. The Optimum construction of composite field for AES S-Box are selected based on the complexities of subfield operations in the design of inverter in GF (28) for the effects of irreducible polynomial coefficients, and isomorphic mappings to minimize gate count and critical path. This decreased size of SB/ISR design could help for an area limited hardware implementations and also to allow more copies of SB/ISR for parallelism and/or pipelining of AES. The proposed decomposition method for integrated MixColumn/InverseMixcolumn (MC/IMC) optimizes the area and path delay.

Journal ArticleDOI
TL;DR: A highly efficient CAVLC encoder is proposed for video coding application of MPEG-4 AVC/H.264 to use block-based pipelining to speed up encoding efficiency and reduce the pipeline storage elements by using the associated input buffer.
Abstract: In dealing with high-resolution video information, encoding (or decoding) with an efficient context-based adaptive variable length code (CAVLC) encoder is important. A highly efficient CAVLC encoder is proposed for video coding application of MPEG-4 AVC/H.264. The main concept is to use block-based pipelining to speed up encoding efficiency and reduce the pipeline storage elements by using the associated input buffer. We also use zero-block detection to speed up encoding efficiency and eliminate the same codeword from all the tables to save the hardware cost. Simulation results show that our design can meet the real-time processing for 1920×1088 resolution with lower operation frequency. We also accomplish the higher encoding throughput with a more complete CAVLC design than others. The proposed design has been implemented and synthesised with TSMC 0.18 µm standard cell library. The synthesis result indicates that the gate count is 12 125 with the clock constraint of 125 MHz.

Proceedings Article
01 Dec 2009
TL;DR: In this article, a low-complexity and full-mode BCH decoder with long block length for DVB-S2 application is presented with the reversed error locator polynomial.
Abstract: In this paper, a low-complexity and full-mode BCH decoder with long block length for DVB-S2 application is presented With the reversed error locator polynomial, our proposed reversed Berlekamp-Massey algorithm features a sharing architecture to perform parallel-4 syndrome and Chien search calculations Concatenated with the LDPC decoder, which has a long decoding latency and a short period of data output time, the proposed parallel-4 BCH decoder ensures the sufficient throughput with only one bank memory Moreover, a composite field divider instead of a large Galois field inversion table is also presented to reduce complexity After implemented in 013µm CMOS technology, our parallel-4 BCH decoder occupied 44K gate count can reach 380Mb/s according to the post-layout simulations

Proceedings ArticleDOI
23 Jan 2009
TL;DR: This paper presents area optimal integer 2-D DCT architecture for H.264/AVC codecs, which will find application in hand-held/mobile devices due to its area optimized approach.
Abstract: With continuous advancement of VLSI technology it has become possible to achieve any desired performance metric, but at a cost of increased system complexity. In this paper we present area optimal integer 2-D DCT architecture for H.264/AVC codecs. The 2-D DCT calculation is performed by utilizing the separability property, in such a way, 2-D DCT is divided into two 1-D DCT calculation that share a common memory, which considerably reduces the gate count. Due to its area optimized approach the design will find application in hand-held/mobile devices. The transform module has been coded in Verilog hardware description language (HDL) and synthesized in 0.18μ TSMC technology.

Journal ArticleDOI
TL;DR: Several novel designs are proposed to reduce the synchronization latency and hardware complexity of an OFDM baseband receiver for DVB-T/H and the pre-filling scheme reduces the latency of channel estimation.
Abstract: In this paper, an OFDM baseband receiver for DVB-T/H is presented. The receiver contains four synchronizations, an OFDM symbol synchronization, a carrier synchronization, a sampling clock synchronization and a scattered pilots synchronization. This paper proposes several novel designs to reduce the synchronization latency and hardware complexity. The carrier and clock synchronization loops are fully digitalized schemes. The scattered pilots synchronization adopts a two stages scheme to reduce the detection latency. In addition, the pre-filling scheme reduces the latency of channel estimation. The design result shows that the equivalent gate count is about 810 K gates including 102.8 KB memory.

Journal ArticleDOI
TL;DR: The proposed low-cost, low-power multistandard video decoder for high definition (HD) video applications is optimized through reducing memory bandwidth by increasing both data reuse amount and burst length of memory access as well as eliminating cycle overhead in data access for supporting HD video decoding with single AHB-based SDR memory.
Abstract: This article proposes a low-cost, low-power multistandard video decoder for high definition (HD) video applications. The proposed design supports multiple-standard (JPEG baseline, MPEG-1/2/4 Simple Profile (SP), and H.264 Baseline Profile (BP)) video decoding through interactive parsing control and common parameter bus interface. In order to reduce hardware cost, the shared adder-based structure and reusable data management are proposed to achieve hardware sharing and reduce internal memory size, respectively. In addition, the proposed design is optimized through reducing memory bandwidth by increasing both data reuse amount and burst length of memory access as well as eliminating cycle overhead in data access for supporting HD video decoding with single AHB-based SDR memory. The proposed 252Kgates/4.9kB/71mW/0.13μm multi-standard video decoder reduces 72p in gate count and 87p in power consumption as compared to the state-of-the-art design, when operating at 120MHz for real-time HD1080 video decoding with single AHB-based SDR memory.

Proceedings ArticleDOI
07 Jul 2009
TL;DR: This paper presents the fastest 317–657 ns reconfiguration demonstration of a 16-context optically reconfigurable gate array architecture.
Abstract: Demand for fast dynamic reconfiguration has increased since dynamic reconfiguration can accelerate the performance of processors. Dynamic reconfiguration has two important prerequisites: fast reconfiguration and numerous reconfiguration contexts. Unfortunately, fast reconfigurations and numerous contexts share a tradeoff relation on current VLSIs. Therefore, optically reconfigurable gate arrays were developed to resolve this dilemma. Optically reconfigurable gate arrays can realize a large virtual gate count that is much larger than those of current VLSI chips by exploiting the large storage capacity of a holographic memory. Furthermore, optically reconfigurable gate arrays can realize rapid reconfiguration using large bandwidth optical connections between a holographic memory and a programmable gate array VLSI. This paper presents the fastest 317–657 ns reconfiguration demonstration of a 16-context optically reconfigurable gate array architecture.

Proceedings ArticleDOI
28 Dec 2009
TL;DR: A novel implementation of the core processors, the integer transform and quantization for H.264 video encoder using an FPGA, capable of processing the picture frames with the desired compression controlled by the user input is proposed.
Abstract: This paper proposes a novel implementation of the core processors, the integer transform and quantization for H.264 video encoder using an FPGA. It is capable of processing the picture frames with the desired compression controlled by the user input. The algorithm and architecture of the components of the video encoder namely, integer transformation, quantization were developed, designed and coded in Verilog. The complete H.264 video encoder was coded in Matlab in order to verify the results of the Verilog implementation. The processor is implemented on a Xilinx Vertex – II Pro XC2VP30 FPGA. The gate count of the implementation is approximately 1,057,000 working at a frequency of 208 MHz. It can process 1024x768 pixel color images in 4:2:0 format at 25 frames per second. The reconstructed picture quality is better than 35 dB.

Journal ArticleDOI
TL;DR: A new heuristic algorithm is proposed to optimize the power domain clustering in controlling-value-based (CV-based) power gating technology by considering both the switching activity of sleep signals and the overall numbers of sleep gates, and the sum of the product of p and N is optimized.
Abstract: In this paper, a new heuristic algorithm is proposed to optimize the power domain clustering in controlling-value-based (CV-based) power gating technology. In this algorithm, both the switching activity of sleep signals (p) and the overall numbers of sleep gates (gate count, N) are considered, and the sum of the product of p and N is optimized. The algorithm effectively exerts the total power reduction obtained from the CV-based power gating. Even when the maximum depth is kept to be the same, the proposed algorithm can still achieve power reduction approximately 10% more than that of the prior algorithms. Furthermore, detailed comparison between the proposed heuristic algorithm and other possible heuristic algorithms are also presented. HSPICE simulation results show that over 26% of total power reduction can be obtained by using the new heuristic algorithm. In addition, the effect of dynamic power reduction through the CV-based power gating method and the delay overhead caused by the switching of sleep transistors are also shown in this paper.

01 Jan 2009
TL;DR: A novel VLSI architecture for the demodulator for processing satellite data communication is proposed and makes extensive use of LUTs and hence is ideally suited for FPGA implementation.
Abstract: Summary This paper proposes a novel VLSI architecture for the demodulator for processing satellite data communication. The overall receiver algorithm is divided into two parts: one to be implemented on an FPGA and the other on a DSP processor. A new distributed arithmetic based architecture for implementing a Sampling Rate Converter is also proposed. The main advantage of this architecture is that it does not employ any MAC unit, whose operational speed is, generally, a bottleneck for high filter throughput. Instead, it makes extensive use of LUTs and hence is ideally suited for FPGA implementation. Architecture for Digital Frequency Synthesizer, which gives 60 dB spectral purity, is also presented. The developed FPGA core consists of a mixer and two numbers of 193 tap, RRC filters to accept modulated, 12-bit, signed ADC output at a sampling frequency of 1.536 MHz and convert it into In-phase (I) and Quadrature-phase (Q) channel outputs, each of size 16 bits, signed, at half the sampling frequency. The main design goals in this work were to maintain low system complexity and reduce power consumption and chip area requirements. These architectures were coded in Verilog HDL and implemented on Xilinx FPGA. The design was synthesized with XCV600-4 FPGA and occupies about 2360 slices with an equivalent gate count of about 45000 and operating at a maximum frequency of 19.8 MHz. The entire modulator and demodulator have been coded in Matlab in order to validate the hardware results. The hardware and MATLAB results compare favorably.

Journal ArticleDOI
TL;DR: An efficient technique for finding the mean and variance of the full-chip leakage of a candidate design, while considering logic structures and both die-to-die and within-die (WID) process variations, and taking into account the spatial correlation due to WID variations is presented.
Abstract: In this paper, we present an efficient technique for finding the mean and variance of the full-chip leakage of a candidate design, while considering logic structures and both die-to-die and within-die (WID) process variations, and taking into account the spatial correlation due to WID variations. Our model uses a ldquorandom-gaterdquo concept to capture high-level characteristics of a candidate chip design, which are sufficient to determine its leakage. These high-level characteristics include information about the process, the standard cell library, and expected design characteristics. We show empirically that, for large gate count, the set of all chip designs that share the same high-level characteristics have approximately the same leakage, with very small error. Therefore, our model can be used as either an early or a late estimator of leakage, with high accuracy. In its simplest form, we show that full-chip-leakage estimation reduces in finding the area under a scaled version of the WID channel length autocorrelation function, which can be done in constant time.

Proceedings ArticleDOI
15 Sep 2009
TL;DR: The different paths of encryption and decryption could be chosen, and the elliptic curve (EC) is based on GF (2163), and the EC scalar multiplication is a main operation module that includes add, Montgomery multiplier and inverse in ECC architecture.
Abstract: In this paper, we propose an elliptic curve cryptographic (ECC) architecture for a lower hardware resource. In our work, the different paths of encryption and decryption could be chosen, and the elliptic curve (EC) is based on GF (2163). The EC scalar multiplication is a main operation module that includes add, Montgomery multiplier and inverse in ECC architecture. All modules are organized in a hierarchical structure according to their complexity. In the hardware implementations using a 0.18µ m TSMC cell library, a 69 K gate count is possessed, and the maximum speed is 181 MHz. The EC multiplication time is from 1.26 ms to 2.52 ms. The private key k is a 163-bit random number. If the private key k is chosen to be a small one, the EC multiplication time would be faster.

Journal Article
TL;DR: In this article, an inversion/non-inversion dynamic optically reconfigurable gate array (VLSI) was proposed to achieve both advantages of rapid configuration and high gate count.
Abstract: Up to now, an optically differential reconfigurable gate array taking a differential reconfiguration strategy and a dynamic optically reconfigurable gate array taking a photodiode memory architecture have been proposed. The differential reconfiguration strategy provides a higher reconfiguration frequency, with no increase in laser power, than other optically reconfigurable gate arrays, however the differential reconfiguration strategy can not achieve a high-gate-count VLSI because of the area occupied by the static configuration memory. On the other hand, the photodiode memory architecture can achieve a high-gate-count VLSI, but its configuration is slower than that of the optically differential reconfigurable gate array using equivalent laser power. So, this paper presents a novel inversion/non-inversion dynamic optically reconfigurable gate array VLSI that combines both architectures. It thereby achieves both advantages of rapid configuration and a high gate count. The experiments undertaken in this study clarify the effectiveness of the inversion/non-inversion optical configuration method.

Proceedings ArticleDOI
29 Jul 2009
TL;DR: This paper presents the first demonstration of a 16-context DORGA architecture and presents experimental results: 530–833 ns reconfiguration times and 5-9.375 us retention times.
Abstract: Demand for fast dynamic reconfiguration has increased since dynamic reconfiguration can accelerate the performance of implementation circuits on a programmable device. Such dynamic reconfiguration necessitates two important features: fast reconfiguration and numerous contexts. However, because fast reconfiguration and numerous contexts share a tradeoff relation on current VLSIs, optically reconfigurable gate arrays (ORGAs) have been developed to resolve this dilemma.ORGAs can realize a large virtual gate count that is much larger than those of current VLSI chips by exploiting the large storage capacity of a holographic memory. Furthermore, ORGAs can realize fast reconfiguration through use of large bandwidth optical connections between a holographic memory and a programmable gate array VLSI. Among such developments, we have been developing dynamic optically reconfigurable gate arrays (DORGAs)that realize a high gate density VLSI using a photodiode memory architecture. This paper presents the first demonstration of a 16-context DORGA architecture. Furthermore, we present experimental results: 530–833 ns reconfiguration times and 5-9.375 us retention times.

Proceedings ArticleDOI
29 Sep 2009
TL;DR: This paper presents the world's first demonstration of an ORGA with micro electro mechanical system's (MEMS) holographic memory, which can realize a large virtual gate count that is much larger than those of current VLSI chips.
Abstract: Demand for fast dynamic reconfiguration has increased since dynamic reconfiguration can accelerate the performance of implementation circuits on its programmable gate array. Such dynamic reconfiguration requires two important features: fast reconfiguration and numerous contexts. However, fast reconfigurations and numerous contexts share a trade-off relation on current VLSIs. Therefore, optically reconfigurable gate arrays (ORGAs) have been developed to resolve this dilemma. ORGAs can realize a large virtual gate count that is much larger than those of current VLSI chips by exploiting the large storage capacity of a holographic memory. Also, ORGAs can realize fast reconfiguration through use of large bandwidth optical connections between a holographicmemory and a programmable gate array VLSI. This paper presents the world's first demonstration of an ORGA with micro electro mechanical system's (MEMS) holographic memory.

Book ChapterDOI
07 Mar 2009
TL;DR: This paper presents the first demonstration of a nine-context DORGA architecture and presents experimental results: 1.2-8.97μ s reconfiguration times and 66-221μ s retention times.
Abstract: Demand for fast dynamic reconfiguration has increased since dynamic reconfiguration can accelerate the performance of implementation circuits. Such dynamic reconfiguration requires two important features: fast reconfiguration and numerous contexts. However, fast reconfigurations and numerous contexts share a trade-off relation on current VLSIs. Therefore, optically reconfigurable gate arrays (ORGAs) have been developed to resolve this dilemma. ORGAs can realize a large virtual gate count that is much larger than those of current VLSI chips by exploiting the large storage capacity of a holographic memory. Also, ORGAs can realize fast reconfiguration through use of large bandwidth optical connections between a holographic memory and a programmable gate array VLSI. Among such developments, we have been developing dynamic optically reconfigurable gate arrays (DORGAs) that realize a high gate density VLSI using a photodiode memory architecture. This paper presents the first demonstration of a nine-context DORGA architecture. Furthermore, this paper presents experimental results: 1.2-8.97μ s reconfiguration times and 66-221μ s retention times.

Proceedings ArticleDOI
28 Sep 2009
TL;DR: A power-aware variable-precision multiply-accumulate (VP-MAC) unit that makes use of dynamic-range detection unit and a 16-bit scalable Baugh-Wooley Multiplier with fixed-width error compensation circuit for DSP applications is presented.
Abstract: An energy-efficient power-aware design is highly desirable for digital signal processing (DSP) functions that encounter a wide diversity of operating scenarios in battery-powered wireless sensor network systems. Addressing this issue, this paper presents a power-aware variable-precision multiply-accumulate (VP-MAC) unit that makes use of dynamic-range detection unit and a 16-bit scalable Baugh-Wooley Multiplier with fixed-width error compensation circuit for DSP applications. The proposed VP-MAC contains both an 8-bit and a 16-bit multiplier and has input gating to route the data to appropriate hardware. When 16-bit multiplication is needed, the entire multiplier is used. However, if only 8-bit multiplication is needed, the 8-bit logic is enabled. Simulated and measured results show a reduced power-consumption of 43% and reduced gate count of 42.7% respectively, in comparison with conventional power-aware scalable pipelined MAC unit.

Proceedings ArticleDOI
01 Dec 2009
TL;DR: PressNoC was implemented on a Xilinx Virtex 4 FPGA device, which required 25.5% lesser number of slices compared to the traditional NoC with a full-fledged encoding method, and showed higher probability toward the reduction of crosstalk interferences and dynamic power consumption at the same overheads by using the proposed architecture.
Abstract: We propose a Power-aware and Reliable Encoding Schemes Supported reconfigurable Network-on-Chip (PRESSNoC) architecture whereby an encoding can be selected by a REasoning And Learning (REAL) framework at run-time to fit the reliability and power requirements of the application and its execution environment PRESSNoC was implemented on a Xilinx Virtex 4 FPGA device, which required 255% lesser number of slices compared to the traditional NoC with a full-fledged encoding method The average benefit to overhead ratio of the proposed architecture is greater than that of the traditional architecture by 71%, 32% ,a nd277% when we consider the individual effects of interference rate per instruction, application domains, and system characteristics, respectively It shows we have higher probability toward the reduction of crosstalk interferences and dynamic power consumption at the same overheads by using the proposed architecture I INTRODUCTION Due to advanced process technologies, the decreasing dis- tance between wires has increased the gate count capacity of a chip, but it has also led to significant crosstalk interfer- ences among adjacent wires Crosstalk is usually caused by undesired conductive coupling from one channel to another This problem becomes more severe in a Network-on-Chip (NoC) because of the large number of wires used for parallel communication Further, in advanced process technologies, the ratio of wire to gate power consumption has increased significantly such that the power consumed by wires can no longer be neglected in any power estimation model Due to the large number of wires in an NoC, the high dynamic power consumption incurred due to parallel communications among concurrently running applications has also become a design issue To reduce crosstalk interferences and dynamic power con- sumption in an NoC, a well-known and popular method is to reduce the switching activity by means of data encoding schemes which target at reducing the occurrence of spe- cific data patterns such as two/three adjacent transitions and two/three aggressors that may cause crosstalk interferences or high dynamic power consumption Nevertheless, the tra- ditional encoding methods are fixed, that is, they cannot adapt to the varying requirements of different applications, domains, and systems The proposed infrastructure mainly includes a novel reconfigurable NoC architecture design, called Power- aware and Reliable Encoding Schemes Supported reconfig- urable Network-on-Chip (PRESSNoC), four data encoding strategies, and an intelligent strategy selection method with reasoning and learning PRESSNoC supports the dynamic reconfiguration (9) of encoding methods to fit the require- ments of the working set of applications connected to it The encoding methods include DUal Cycle Encoding (DUCE), Transition and Aggressor reduced Encoding (TAE), Single Additional Flit Encoding (SAFE), and Dual Additional Flit Encoding (DAFE), which differ in the provision of reliability, power efficiency, hardware resource overhead, and perfor- mance overhead The encoding strategy selection in PRESS- NoC is achieved through a REasoning And Learning (REAL) framework that can dynamically investigate the tradeoffs among reliability requirements, power reduction requirements, performance overhead, and hardware resource utilization The rest of the paper is organized as follows Section II summarizes the state-of-the-art reconfigurable NoC designs In Section III, we describe the proposed encoding strategies with a REAL framework Experiment results are shown in Section IV Finally, we conclude in Section V with some future work

Proceedings Article
28 Sep 2009
TL;DR: A compact architecture for the AES mix columns operation and its inverse is presented and it is shown that the design has a lower gate count than other designs that implement both the forward and the inverse mix column operation.
Abstract: Since the debut of the Advanced Encryption Standard (AES), it has been thoroughly studied by hardware designers with the goal of reducing the area and delay of the hardware implementation of this cryptosystem. This paper proposes an implementation of the AES mix columns operation. In this paper, a compact architecture for the AES mix columns operation and its inverse is presented. The hardware implementation is compared with previous work done in this area. We show that our design has a lower gate count than other designs that implement both the forward and the inverse mix columns operation.

Proceedings ArticleDOI
28 Apr 2009
TL;DR: A multi-mode Reed-Solomon decoder design based on the reformulated inversionless Berlekamp-Massey (riBM) algorithm is proposed to correct both errors and erasures for any RS code including shortened codes, making the decoder suitable for VLSI realization.
Abstract: A multi-mode Reed-Solomon (RS) decoder design based on the reformulated inversionless Berlekamp-Massey (riBM) algorithm is proposed to correct both errors and erasures for any RS code including shortened codes. Without degrading the resulting performance, we effectively improve the hardware utilization of decoder and simplify the routing network in conventional multi-mode decoder design. With the developed multi-mode arrangement, the proposed decoder possesses not only high-performance property but also simple and regular interconnect topology, making the decoder suitable for VLSI realization. Experimental results reveal that for code words of length n ≤ 255 with ν errors and ρ erasures correcting capability, 0≤ 2ν+ρ ≤ 16, the achievable throughput rate of the proposed decoder, implemented in TSMC 0.13µm 1P8M process, is 4Gbps at a maximum operating clock of 450MHz and the total gate count is 50K.