scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Very Large Scale Integration Systems in 2009"


Journal ArticleDOI
TL;DR: Applying clock gating to the energy recovery clocked flip-flops reduces their power by more than 1000times in the idle mode with negligible power and delay overhead in the active mode.
Abstract: A significant fraction of the total power in highly synchronous systems is dissipated over clock networks. Hence, low-power clocking schemes are promising approaches for low-power design. We propose four novel energy recovery clocked flip-flops that enable energy recovery from the clock network, resulting in significant energy savings. The proposed flip-flops operate with a single-phase sinusoidal clock, which can be generated with high efficiency. In the TSMC 0.25-mum CMOS technology, we implemented 1024 proposed energy recovery clocked flip-flops through an H-tree clock network driven by a resonant clock-generator to generate a sinusoidal clock. Simulation results show a power reduction of 90% on the clock-tree and total power savings of up to 83% as compared to the same implementation using the conventional square-wave clocking scheme and flip-flops. Using a sinusoidal clock signal for energy recovery prevents application of existing clock gating solutions. In this paper, we also propose clock gating solutions for energy recovery clocking. Applying our clock gating to the energy recovery clocked flip-flops reduces their power by more than 1000times in the idle mode with negligible power and delay overhead in the active mode. Finally, a test chip containing two pipelined multipliers one designed with conventional square wave clocked flip-flops and the other one with the proposed energy recovery clocked flip-flops is fabricated and measured. Based on measurement results, the energy recovery clocking scheme and flip-flops show a power reduction of 71% on the clock-tree and 39% on flip-flops, resulting in an overall power savings of 25% for the multiplier chip.

173 citations


Journal ArticleDOI
TL;DR: The interests and limitations of technology scaling for subthreshold logic are investigated from 0.25 mum to 32 nm nodes, which shows scaling to 90/65 nm nodes is shown to be highly desirable for medium-throughput applications due to great dynamic energy reduction.
Abstract: Subthreshold logic is an efficient technique to achieve ultralow energy per operation for low-to-medium throughput applications. In this paper, the interests and limitations of technology scaling for subthreshold logic are investigated from 0.25 mum to 32 nm nodes. Scaling to 90/65 nm nodes is shown to be highly desirable for medium-throughput applications (1-10 MHz) due to great dynamic energy reduction. However, this interest is limited at 45/32 nm nodes by high static energy due to degraded subthreshold swing and delay variability. Moreover, for low-throughput applications (10-100 kHz), this limitation is worsened by the increase of minimum supply voltage to achieve sufficient functional yield, which results in bad energy efficiency starting at 0.13 mum node. Upsizing the channel length is proposed as a straightforward circuit-level technique to efficiently mitigate these effects. At 32 nm node, this technique reduces energy per operation by 60% at medium throughput and by two orders of magnitude at low throughput.

165 citations


Journal ArticleDOI
TL;DR: This paper proposes a new general-purpose sensor processor architecture, which is called the Subliminal Processor, optimized across different design stages including ISA definition, microarchitecture evaluation and circuit and implementation optimization.
Abstract: Subthreshold circuits have drawn a strong interest in recent ultralow power research. In this paper, we present a highly efficient subthreshold microprocessor targeting sensor application. It is optimized across different design stages including ISA definition, microarchitecture evaluation and circuit and implementation optimization. Our investigation concludes that microarchitectural decisions in the subthreshold regime differ significantly from that in conventional superthreshold mode. We propose a new general-purpose sensor processor architecture, which we call the Subliminal Processor. On the circuit side, subthreshold operation is known to exhibit an optimal energy point (Knin)- However, propagation delay also becomes more sensitive to process variation and can reduce the energy scaling gain. We conduct thorough analysis on how supply voltage and operating frequency impact energy efficiency in a statistical context. With careful library cell selection and robust static RAM design, the Subliminal Processor operates correctly down to 200 mV in a 0.13-mum technology, which is sufficiently low to operate at Vmin . Silicon measurements of the Subliminal Processor show a maximum energy efficiency of 2.6 pJ/instruction at 360 mV supply voltage and 833 kHz operating frequency. Finally, we examine the variation in frequency and Vmin across die to verify our analysis of adaptive tuning of the clock frequency and Vmin for optimal energy efficiency.

157 citations


Journal ArticleDOI
TL;DR: To explore integrated solar energy harvesting as a power source for low power systems, an array of energy scavenging photodiodes based on a passive-pixel architecture for CMOS imagers has been fabricated together with storage capacitors implemented using on-chip interconnect in a 0.35-mum bulk process.
Abstract: To explore integrated solar energy harvesting as a power source for low power systems, an array of energy scavenging photodiodes based on a passive-pixel architecture for CMOS imagers has been fabricated together with storage capacitors implemented using on-chip interconnect in a 0.35-mum bulk process. Integrated vertical plate capacitors enable dense energy storage without limiting optical efficiency. Tests were conducted with both a white light source and a green laser. Measurements indicate that 225 muW/mm2 output power may be generated by white light with an intensity of 20 kLUX.

145 citations


Journal ArticleDOI
TL;DR: An inductor-less on-chip micro power management system for light energy harvesting applications that target at wide variety of applications that operate at different lighting environments ranging from strong sunlight to dim indoor lighting where the output voltage from the photovoltaic cells is low.
Abstract: An inductor-less on-chip micro power management system for light energy harvesting applications is presented. We target at wide variety of applications that operate at different lighting environments ranging from strong sunlight to dim indoor lighting where the output voltage from the photovoltaic cells is low. A step-up charge pump is used to directly operate the circuit or to charge a rechargeable battery. The power management system operation is discussed and the control strategy for transferring the maximum output power from the power system is presented. Low power circuit design is proposed for the implementation of the system maximum output power control. The system was implemented using a 0.35-mum CMOS process. The chip was fabricated and measurements were conducted for different lighting conditions to demonstrate the system operation and verify the control strategy.

134 citations


Journal ArticleDOI
TL;DR: In this paper, the authors consider network-on-chip (NoC) architectures partitioned into several voltage-frequency islands (VFIs) and propose a design methodology for runtime energy management.
Abstract: The design of many core systems-on-chip (SoCs) has become increasingly challenging due to high levels of integration, excessive energy consumption and clock distribution problems To deal with these issues, we consider network-on-chip (NoC) architectures partitioned into several voltage-frequency islands (VFIs) and propose a design methodology for runtime energy management The proposed approach minimizes the energy consumption subject to performance constraints Then, we present efficient techniques for on-the-fly workload monitoring and management to ensure that the system can cope with variability in the workload and various technology-related parameters Simulation results demonstrate the effectiveness of our approach in reducing the overall system energy consumption for a real video application Finally, the results and functional correctness are validated using an field-programmable gate-array (FPGA) prototype for an NoC with multiple VFIs

133 citations


Journal ArticleDOI
TL;DR: A number of optimizations built into the skeleton and applied at compile-time depending on the user-supplied parameters result in high performance FPGA implementations tailored to the algorithm in hand.
Abstract: This paper presents the design and implementation of the most parameterisable field-programmable gate array (FPGA)-based skeleton for pairwise biological sequence alignment reported in the literature. The skeleton is parameterised in terms of the sequence symbol type, i.e., DNA, RNA, or protein sequences, the sequence lengths, the match score, i.e., the score attributed to a symbol match, mismatch or gap, and the matching task, i.e., the algorithm used to match sequences, which includes global alignment, local alignment, and overlapped matching. Instances of the skeleton implement the Smith-Waterman and the Needleman-Wunsch algorithms. The skeleton has the advantage of being captured in the Handel-C language, which makes it FPGA platform-independent. Hence, the same code could be ported across a variety of FPGA families. It implements the sequence alignment algorithm in hand using a pipeline of basic processing elements, which are tailored to the algorithm parameters. This paper presents a number of optimizations built into the skeleton and applied at compile-time depending on the user-supplied parameters. These result in high performance FPGA implementations tailored to the algorithm in hand. For instance, actual hardware implementations of the Smith-Waterman algorithm for Protein sequence alignment achieve speedups of two orders of magnitude compared to equivalent standard desktop software implementations.

128 citations


Journal ArticleDOI
TL;DR: This paper presents a high-throughput decoder architecture for generic quasi-cyclic low-density parity-check (QC-LDPC) codes and an approximate layered decoding approach is explored to reduce the critical path of the layered LDPC decoder.
Abstract: This paper presents a high-throughput decoder architecture for generic quasi-cyclic low-density parity-check (QC-LDPC) codes. Various optimizations are employed to increase the clock speed. A row permutation scheme is proposed to significantly simplify the implementation of the shuffle network in LDPC decoder. An approximate layered decoding approach is explored to reduce the critical path of the layered LDPC decoder. The computation core is further optimized to reduce the computation delay. It is estimated that 4.7 Gb/s decoding throughput can be achieved at 15 iterations using the current technology.

125 citations


Journal ArticleDOI
TL;DR: This work uses checkpointing with rollback recovery and active replication for tolerating transient faults, and presents several design optimization approaches which are able to find fault-tolerant implementations given a limited amount of resources.
Abstract: We present an approach to the synthesis of fault-tolerant hard real-time systems for safety-critical applications. We use checkpointing with rollback recovery and active replication for tolerating transient faults. Processes and communications are statically scheduled. Our synthesis approach decides the assignment of fault-tolerance policies to processes, the optimal placement of checkpoints and the mapping of processes to processors such that multiple transient faults are tolerated and the timing constraints of the application are satisfied. We present several design optimization approaches which are able to find fault-tolerant implementations given a limited amount of resources. The developed algorithms are evaluated using extensive experiments, including a real-life example.

124 citations


Journal ArticleDOI
TL;DR: It is demonstrated that the proposed codes outperform other existing coding schemes in making NOC fabrics reliable and energy efficient, with lower latency.
Abstract: Network-on-chip (NOC) is emerging as a revolutionary methodology to integrate numerous intellectual property blocks in a single die. It is the packet switching-based communications backbone that interconnects the components on multicore system-on-chip (SoC). A major challenge that NOC design is expected to face is related to the intrinsic unreliability of the interconnect infrastructure under technology limitations. By incorporating error control coding schemes along the interconnects, NOC architectures are able to provide correct functionality in the presence of different sources of transient noise and yet have lower overall energy dissipation. In this paper, designs of novel joint crosstalk avoidance and triple-error-correction/quadruple-error-detection codes are proposed, and their performance is evaluated in different NOC fabrics. It is demonstrated that the proposed codes outperform other existing coding schemes in making NOC fabrics reliable and energy efficient, with lower latency.

112 citations


Journal ArticleDOI
TL;DR: This paper identifies and defines a new class of error-correcting codes whose redundancy makes the design of fault-secure detectors (FSD) particularly simple and quantifies the importance of protecting encoder and decoder circuitry against transient errors.
Abstract: Memory cells have been protected from soft errors for more than a decade; due to the increase in soft error rate in logic circuits, the encoder and decoder circuitry around the memory blocks have become susceptible to soft errors as well and must also be protected. We introduce a new approach to design fault-secure encoder and decoder circuitry for memory designs. The key novel contribution of this paper is identifying and defining a new class of error-correcting codes whose redundancy makes the design of fault-secure detectors (FSD) particularly simple. We further quantify the importance of protecting encoder and decoder circuitry against transient errors, illustrating a scenario where the system failure rate (FIT) is dominated by the failure rate of the encoder and decoder. We prove that Euclidean geometry low-density parity-check (EG-LDPC) codes have the fault-secure detector capability. Using some of the smaller EG-LDPC codes, we can tolerate bit or nanowire defect rates of 10% and fault rates of 10-18 upsets/device/cycle, achieving a FIT rate at or below one for the entire memory system and a memory density of 1011 bit/cm2 with nanowire pitch of 10 nm for memory blocks of 10 Mb or larger. Larger EG-LDPC codes can achieve even higher reliability and lower area overhead.

Journal ArticleDOI
TL;DR: This work presents guidelines for the CODEC design of the ldquo forbidden pattern free crosstalk avoidance coderdqui (FPF-CAC), and shows that mathematically, a mapping scheme exists based on the representation of numbers in the Fibonacci numeral system.
Abstract: Interconnect delay has become a limiting factor for circuit performance in deep sub-micrometer designs. As the crosstalk in an on-chip bus is highly dependent on the data patterns transmitted on the bus, different crosstalk avoidance coding schemes have been proposed to boost the bus speed and/or reduce the overall energy consumption. Despite the availability of the codes, no systematic mapping of data words to codewords has been proposed for CODEC design. This is mainly due to the nonlinear nature of the crosstalk avoidance codes (CAC). The lack of practical CODEC construction schemes has hampered the use of such codes in practical designs. This work presents guidelines for the CODEC design of the ldquoforbidden pattern free crosstalk avoidance coderdquo (FPF-CAC). We analyze the properties of the FPF-CAC and show that mathematically, a mapping scheme exists based on the representation of numbers in the Fibonacci numeral system. Our first proposed CODEC design offers a near-optimal area overhead performance. An improved version of the CODEC is then presented, which achieves theoretical optimal performance. We also investigate the implementation details of the CODECs, including design complexity and the speed. Optimization schemes are provided to reduce the size of the CODEC and improve its speed.

Journal ArticleDOI
TL;DR: This paper proposes solutions that maximize the yield of wafer-to-wafer 3-D integration, assuming that the individual die can be tested on the wafers before bonding, and demonstrates that it is possible to significantly improve the yield in comparison to yield-oblivious wafer assignment methods.
Abstract: Three-dimensional integrated circuit technology with through-silicon vias offers many advantages, including improved form factor, increased circuit performance, robust heterogenous integration, and reduced costs. Wafer-to-wafer integration supports the highest possible density of through-silicon vias and highest throughput; however, in contrast to die-to-wafer integration, it does not benefit from the ability to bond only tested and diced good die. In wafer-to-wafer integration, wafers are entirely bonded together, which can unintentionally integrate a bad die from one wafer to a good die from another wafer reducing the yield. In this paper, we propose solutions that maximize the yield of wafer-to-wafer 3-D integration, assuming that the individual die can be tested on the wafers before bonding. We exploit some of the available flexibility in the integration process, and propose wafer assignment algorithms that maximize the number of good 3-D ICs. Our algorithms range from scalable, fast heuristics to optimal methods that exactly maximize the yield of wafer-to-wafer 3-D integration. Using realistic defect models and yield simulations, we demonstrate the effectiveness of our methods up to large numbers of wafer stacks. Our results demonstrate that it is possible to significantly improve the yield in comparison to yield-oblivious wafer assignment methods.

Journal ArticleDOI
TL;DR: An analytical model for critical charge is presented in order to assess the soft error vulnerability of 6T SRAM cell and can serve as a tool to optimize the hibernation voltage of low-power SRAMs or the size of MIM capacitor per cell inorder to achieve a target soft error robustness.
Abstract: Scaling transistor size to the scale of the nanometer coupled with reduction of supply voltage has made SRAMs more vulnerable to soft errors than ever before. The vulnerability has been accentuated by increased variability in device parameters. In this paper, we present an analytical model for critical charge in order to assess the soft error vulnerability of 6T SRAM cell. The model takes into account the dynamic behavior of the cell and demonstrates a simple technique to decouple the nonlinearly coupled storage nodes. Decoupling of storage nodes enables solving associated current equations to determine the critical charge for an exponential noise current. The critical charge model thus developed consists of both NMOS and PMOS transistor parameters. Consequently, the model can estimate critical charge variations due to variability of transistor parameters and manufacturing defects, such as resistive contacts and vias. In addition, the model can serve as a tool to optimize the hibernation voltage of low-power SRAMs or the size of MIM capacitor per cell in order to achieve a target soft error robustness. Critical charge calculated by the model is in good agreement with SPICE simulations for a commercial 90-nm CMOS process with a maximum discrepancy of less than 5%.

Journal ArticleDOI
TL;DR: The novelty of the approach lies in the fact that it performs interpolation efficiently, without the need to perform multiplication or division, and the method performs both the log() and antilog() operation using the same hardware architecture.
Abstract: The realization of functions such as log() and antilog() in hardware is of considerable relevance, due to their importance in several computing applications. In this paper, we present an approach to compute log() and antilog() in hardware. Our approach is based on a table lookup, followed by an interpolation step. The interpolation step is implemented in combinational logic, in a field-programmable gate array (FPGA), resulting in an area-efficient, fast design. The novelty of our approach lies in the fact that we perform interpolation efficiently, without the need to perform multiplication or division, and our method performs both the log() and antilog() operation using the same hardware architecture. We compare our work with existing methods, and show that our approach results in significantly lower memory resource utilization, for the same approximation errors. Also our method scales very well with an increase in the required accuracy, compared to existing techniques.

Journal ArticleDOI
TL;DR: A 7-GHz CMOS voltage controlled ring oscillator that employs multiloop technique for frequency boosting is presented in this paper, which permits lower tuning gain through the use of coarse/fine frequency control.
Abstract: A 7-GHz CMOS voltage controlled ring oscillator that employs multiloop technique for frequency boosting is presented in this paper. The circuit permits lower tuning gain through the use of coarse/fine frequency control. The lower tuning gain also translates into a lower sensitivity to the voltage at the control lines. Fabricated in a standard 0.13-mum CMOS process, the proposed voltage-controlled ring oscillator exhibits a low phase noise of -103.4 dBc/Hz at 1 MHz offset from the center frequency of 7.64 GHz, while consuming a current of 40 mA excluding the buffer.

Journal ArticleDOI
TL;DR: Two static random access memory (SRAM) cells that reduce the static power dissipation due to gate and subthreshold leakage currents are presented.
Abstract: In this paper, two static random access memory (SRAM) cells that reduce the static power dissipation due to gate and subthreshold leakage currents are presented. The first cell structure results in reduced gate voltages for the NMOS pass transistors, and thus lowers the gate leakage current. It reduces the subthreshold leakage current by increasing the ground level during the idle (inactive) mode. The second cell structure makes use of PMOS pass transistors to lower the gate leakage current. In addition, dual threshold voltage technology with forward body biasing is utilized with this structure to reduce the subthreshold leakage while maintaining performance. Compared to a conventional SRAM cell, the first cell structure decreases the total gate leakage current by 66% and the idle power by 58% and increases the access time by approximately 2% while the second cell structure reduces the total gate leakage current by 27% and the idle power by 37% with no access time degradation.

Journal ArticleDOI
TL;DR: This paper presents a low-power, high-speed source-synchronous link transceiver which enables a factor 3.3 reduction in link power together with an 80% increase in data-rate.
Abstract: Networks on chips (NoCs) are becoming popular as they provide a solution for the interconnection problems on large integrated circuits (ICs). But even in a NoC, link-power can become unacceptably high and data rates are limited when conventional data transceivers are used. In this paper, we present a low-power, high-speed source-synchronous link transceiver which enables a factor 3.3 reduction in link power together with an 80% increase in data-rate. A low-swing capacitive pre-emphasis transmitter in combination with a double-tail sense-amplifier enable speeds in excess of 9 Gb/s over a 2 mm twisted differential interconnect, while consuming only 130 fJ/transition without the need for an additional supply. Multiple transceivers can be connected back-to-back to create a source-synchronous transceiver-chain with a wave-pipelined clock, operating with 6sigma offset reliability at 5 Gb/s.

Journal ArticleDOI
TL;DR: In this paper, the authors propose to achieve fault tolerance by employing redundancy at the core level instead of at the micro-architecture level, which not only maximizes the performance of the on-chip communication scheme, but also provides a unified topology to operating system and application software running on the processor.
Abstract: Homogeneous manycore systems are emerging for tera-scale computation and typically utilize Network-on-Chip (NoC) as the communication scheme between embedded cores. Effective defect tolerance techniques are essential to improve the yield of such complex integrated circuits. We propose to achieve fault tolerance by employing redundancy at the core-level instead of at the microarchitecture level. When faulty cores exist on-chip in this architecture, however, the physical topologies of various manufactured chips can be significantly different. How to reconfigure the system with the most effective NoC topology is a relevant research problem. In this paper, we first show that this problem is an instance of a well known NP-complete problem. We then present novel solutions for the above problem, which not only maximize the performance of the on-chip communication scheme, but also provide a unified topology to Operating System and application software running on the processor. Experimental results show the effectiveness of the proposed techniques.

Journal ArticleDOI
TL;DR: The virtual embedded block scheme is proposed to model embedded blocks using existing field-programmable gate array (FPGA) tools and can achieve four times improvement in speed and 25 times reduction in area compared with a traditional FPGA device.
Abstract: This paper presents an architecture for a reconfigurable device that is specifically optimized for floating-point applications. Fine-grained units are used for implementing control logic and bit-oriented operations, while parameterized and reconfigurable word-based coarse-grained units incorporating word-oriented lookup tables and floating-point operations are used to implement datapaths. In order to facilitate comparison with existing FPGA devices, the virtual embedded block scheme is proposed to model embedded blocks using existing field-programmable gate array (FPGA) tools. This methodology involves adopting existing FPGA resources to model the size, position, and delay of the embedded elements. The standard design flow offered by FPGA and computer-aided design vendors is then applied and static timing analysis can be used to estimate the performance of the FPGA with the embedded blocks. On selected floating-point benchmark circuits, our results indicate that the proposed architecture can achieve four times improvement in speed and 25 times reduction in area compared with a traditional FPGA device.

Journal ArticleDOI
TL;DR: A probabilistic error model based on Bayesian networks is proposed to estimate the overall output error probability, given dynamic error probabilities in each device since this estimate is crucial for nano-domain circuit designers to be able to compare and rank designs based on the expected output error.
Abstract: In nano-domain logic circuits, errors generated are transient in nature and will arise due to the uncertainty or the unreliability of the computing element itself. This type of errors - which we refer to as dynamic errors - are to be distinguished from traditional faults and radiation related errors. Due to these highly likely dynamic errors, it is more appropriate to model nano-domain computing as probabilistic rather than deterministic. We propose a probabilistic error model based on Bayesian networks to estimate this expected output error probability, given dynamic error probabilities in each device since this estimate is crucial for nano-domain circuit designers to be able to compare and rank designs based on the expected output error. We estimate the overall output error probability by comparing the outputs of a dynamic error-encoded model with an ideal logic model. We prove that this probabilistic framework is a compact and minimal representation of the overall effect of dynamic errors in a circuit. We use both exact and approximate Bayesian inference schemes for propagation of probabilities. The exact inference shows better time performance than the state-of-the art by exploiting conditional independencies exhibited in the underlying probabilistic framework. However, exact inference is worst case NP-hard and can handle only small circuits. Hence, we use two approximate inference schemes for medium size benchmarks. We demonstrate the efficiency and accuracy of these approximate inference schemes by comparing estimated results with logic simulation results. We have performed our experiments on LGSynth'93 and ISCAS'85 benchmark circuits. We explore our probabilistic model to calculate: 1) error sensitivity of individual gates in a circuit; 2) compute overall exact error probabilities for small circuits; 3) compute approximate error probabilities for medium sized benchmarks using two stochastic sampling schemes; 4) compare and vet design with respect to dynamic errors; 5) characterize the input space for desired output characteristics by utilizing the unique backtracking capability of Bayesian networks (inverse problem); and 6) to apply selective redundancy to highly sensitive nodes for error tolerant designs.

Journal ArticleDOI
TL;DR: The speculative virtual channel router was shown to have a very similar efficiency to the wormhole router, while providing a better performance, supporting its use for general purpose designs.
Abstract: In this paper, we explore the designs of a circuit-switched router, a wormhole router, a quality-of-service (QoS) supporting virtual channel router and a speculative virtual channel router and accurately evaluate the energy-performance tradeoffs they offer. Power results from the designs placed and routed in a 90-nm CMOS process show that all the architectures dissipate significant idle state power. The additional energy required to route a packet through the router is then shown to be dominated by the data path. This leads to the key result that, if this trend continues, the use of more elaborate control can be justified and will not be immediately limited by the energy budget. A performance analysis also shows that dynamic resource allocation leads to the lowest network latencies, while static allocation may be used to meet QoS goals. Combining the power and performance figures then allows an energy-latency product to be calculated to judge the efficiency of each of the networks. The speculative virtual channel router was shown to have a very similar efficiency to the wormhole router, while providing a better performance, supporting its use for general purpose designs. Finally, area metrics are also presented to allow a comparison of implementation costs.

Journal ArticleDOI
TL;DR: The proposed region-based routing (RBR) mechanism which groups destinations into network regions allowing an efficient implementation with logic blocks and shows that the number of entries in the table is significantly reduced, especially for large networks.
Abstract: An efficient routing algorithm is important for large on-chip networks [network-on-chip (NoC)] to provide the required communication performance to applications. Implementing NoC using table-based switches provide many advantages, including possibility of changing routing algorithms and fault tolerance, due to the option of table reconfigurations. However, table-based switches have been considered unsuitable for NoCs due to their perceived high area and power consumption. In this paper, we describe the region-based routing (RBR) mechanism which groups destinations into network regions allowing an efficient implementation with logic blocks. RBR can also be viewed as a mechanism to reduce the number of entries in routing tables. RBR is general and can be used in conjunction with any adaptive routing algorithm. In particular, we have evaluated the proposed scheme in conjunction with a general routing algorithm, namely segment-based routing (SR) and an application specific routing algorithm (APSRA) using regular and irregular mesh topologies. Our study shows that the number of entries in the table is significantly reduced, especially for large networks. Evaluation results show that RBR requires only four regions to support several routing algorithms in a 2-D mesh with no performance degradation. Considering link failures, our results indicate that RBR combined with SR is able to tolerate up to 7 link failures in an 8times8 mesh. RBR also reduces area and power dissipation of an equivalent table-based implementation by factors of 8 and 10, respectively. Moreover, the degradation in performance of the network is insignificant when using APSRA combined with RBR.

Journal ArticleDOI
TL;DR: A genetic algorithm-based automated design technique that synthesizes an application specific NoC topology and routes the communication traces on the interconnection network and generates a Pareto curve of the solution set.
Abstract: The network-on-chip (NoC) design problem requires the generation of a power and resource efficient interconnection architecture that can support the communication requirements for the SoC with the desired performance. This paper presents a genetic algorithm-based automated design technique that synthesizes an application specific NoC topology and routes the communication traces on the interconnection network. The technique operates on the system-level floorplan of the system on chip (SoC) and accounts for the power consumption in the physical links and the routers. The design technique solves a multi-objective problem of minimizing the power consumption and the router resources. It generates a Pareto curve of the solution set, such that each point in the curve represents a tradeoff between power consumption and associated number of NoC routers. The performance and quality of solutions produced by the technique are evaluated by experimentation with benchmark applications and comparisons with existing approaches.

Journal ArticleDOI
TL;DR: Two algorithms are proposed that attempt to meet the communication requirement of an on-chip application using a minimum number of network resources for the task, by generating application-specific topologies by incorporating a form of predictive analysis.
Abstract: This paper presents a new approach to the design and analysis of NoC topologies which is based on the transaction-oriented communication methods of on-chip components. We propose two algorithms that attempt to meet the communication requirement of an on-chip application using a minimum number of network resources for the task, by generating application-specific topologies. In addition, to aid the design process of complex systems, the design method incorporates a form of predictive analysis which can estimate the degree of contention in a given system without performing detailed simulation. This predictive analysis method is used to determine the minimum frequency of operation for generated topologies, and is incorporated into the topology generation process. The proposed design method was tested using real-word applications, including an MPEG4 decoder and a multi-window display application. The generated topologies were found to offer similar or better performance when compared with regular topologies. However, the topologies generated by our method were more economical, using, on average, half the network resources of regular topologies.

Journal ArticleDOI
TL;DR: A novel flow for parasitic and process-variation aware design of radio-frequency integrated circuits (RFICs) and the first work focussed on a current starved VCO in which the combined effect of parasitics and process variations has been considered is presented.
Abstract: This paper proposes a novel flow for parasitic and process-variation aware design of radio-frequency integrated circuits (RFICs). A nano-CMOS current-starved voltage controlled oscillator (VCO) circuit has been designed using this flow as a case study. The oscillation frequency is considered as the objective optimization function with the area overhead as constraint. Extensive Monte Carlo simulations have been carried out on the parasitic extracted netlist of the VCO to study the effect of process variation on the oscillation frequency. In the design cycle, a performance degradation of 43.5% is observed when the parasitic extracted netlist is subjected to worst-case process variation. The proposed design flow could bring the oscillation frequency within 4.5% of the target, leading to convergence of the complete design in only one design iteration. To the best of the authors' knowledge, this paper presents the first work focussed on a current starved VCO in which the combined effect of parasitics and process variations has been considered.

Journal ArticleDOI
TL;DR: This work describes a flexible hardware processor for performing computationally expensive modular addition, subtraction, multiplication, and inversion over prime finite fields GF(p) .
Abstract: Exchange of private information over a public medium must incorporate a method for data protection against unauthorized access. Elliptic curve cryptography (ECC) has become widely accepted as an efficient mechanism to secure sensitive data. The main ECC computation is a scalar multiplication, translating into an appropriate sequence of point operations, each involving several modular arithmetic operations. We describe a flexible hardware processor for performing computationally expensive modular addition, subtraction, multiplication, and inversion over prime finite fields GF(p) . The proposed processor supports all five primes p recommended by NIST, whose sizes are 192, 224, 256, 384, and 521 bits. It can also be programmed to automatically execute sequences of modular arithmetic operations. Our field-programmable gate-array implementation runs at 60 MHz and takes between 4 and 40 ms (depending on the used prime) to perform a typical scalar multiplication.

Journal ArticleDOI
TL;DR: A novel BIRA approach is proposed and it builds a line-based searching tree that minimizes the storage capacity requirements to store faulty address information by dropping all unnecessary faulty addresses for inherently repairable die.
Abstract: With the growth of memory capacity and density, test cost and yield improvement are becoming more important In the case of embedded memories for systems-on-a-chip (SOC), built-in redundancy analysis (BIRA) is widely used as a solution to solve quality and yield issues by replacing faulty cells with extra good cells However, previous BIRA approaches focused mainly on embedded memories rather than commodity memories Many BIRA approaches require extra hardware overhead to achieve the optimal repair rate, which means that 100% of solution detection is guaranteed for intrinsically repairable dies, or they suffer a loss of repair rate to minimize the hardware overhead In order to achieve both low area overhead and optimal repair rate, a novel BIRA approach is proposed and it builds a line-based searching tree The proposed BIRA minimizes the storage capacity requirements to store faulty address information by dropping all unnecessary faulty addresses for inherently repairable die The proposed BIRA analyzes redundancies quickly and efficiently with optimal repair rate by using a selected fail count comparison algorithm Experimental results show that the proposed BIRA achieves optimal repair rate, fast analysis speed, and nearly optimal repair solutions with a relatively small area overhead

Journal ArticleDOI
TL;DR: A low-power structure called bypass zero, feed A directly (BZ-FAD) for shift-and-add multipliers is proposed, which considerably lowers the switching activity of conventional multipliers.
Abstract: In this paper, a low-power structure called bypass zero, feed A directly (BZ-FAD) for shift-and-add multipliers is proposed. The architecture considerably lowers the switching activity of conventional multipliers. The modifications to the multiplier which multiplies A by B include the removal of the shifting the B register, direct feeding of A to the adder, bypassing the adder whenever possible, using a ring counter instead of a binary counter and removal of the partial product shift. The architecture makes use of a low-power ring counter proposed in this work. Simulation results for 32-bit radix-2 multipliers show that the BZ-FAD architecture lowers the total switching activity up to 76% and power consumption up to 30% when compared to the conventional architecture. The proposed multiplier can be used for low-power applications where the speed is not a primary design parameter.

Journal ArticleDOI
TL;DR: In this article, the authors present a flexible multiprocessor platform for high throughput turbo decoding using configurable application-specific instruction set processors (ASIP) combined with an efficient memory and communication interconnect scheme.
Abstract: Emerging digital communication applications and the underlying architectures encounter drastically increasing performance and flexibility requirements. In this paper, we present a novel flexible multiprocessor platform for high throughput turbo decoding. The proposed platform enables exploiting all parallelism levels of turbo decoding applications to fulfill performance requirements. In order to fulfill flexibility requirements, the platform is structured around configurable application-specific instruction-set processors (ASIP) combined with an efficient memory and communication interconnect scheme. The designed ASIP has an single instruction multiple data (SIMD) architecture with a specialized and extensible instruction-set and 6-stages pipeline control. The attached memories and communication interfaces enable its integration in multiprocessor architectures. These multiprocessor architectures benefit from the recent shuffled decoding technique introduced in the turbo-decoding field to achieve higher throughput. The major characteristics of the proposed platform are its flexibility and scalability which make it reusable for all simple and double binary turbo codes of existing and emerging standards. Results obtained for double binary WiMAX turbo codes demonstrate around 250 Mb/s throughput using 16-ASIP multiprocessor architecture.