scispace - formally typeset
Search or ask a question

Showing papers in "Iet Computers and Digital Techniques in 2007"


Journal ArticleDOI
TL;DR: A novel design is provided for the BCD-digit multiplier, which can serve as the key building block of a decimal multiplier, irrespective of the degree of parallelism, in semi- and fully parallel hardware decimal multiplication units.
Abstract: With the growing popularity of decimal computer arithmetic in scientific, commercial, financial and Internet-based applications, hardware realisation of decimal arithmetic algorithms is gaining more importance. Hardware decimal arithmetic units now serve as an integral part of some recently commercialised general purpose processors, where complex decimal arithmetic operations, such as multiplication, have been realised by rather slow iterative hardware algorithms. However, with the rapid advances in very large scale integration (VLSI) technology, semi- and fully parallel hardware decimal multiplication units are expected to evolve soon. The dominant representation for decimal digits is the binary-coded decimal (BCD) encoding. The BCD-digit multiplier can serve as the key building block of a decimal multiplier, irrespective of the degree of parallelism. A BCD-digit multiplier produces a two-BCD digit product from two input BCD digits. We provide a novel design for the latter, showing some advantages in BCD multiplier implementations.

73 citations


Journal ArticleDOI
TL;DR: The development of a very-large-scale integration architecture for a high-performance watermarking chip is presented which can perform both invisible robust and invisible fragile image water marking in the spatial domain.
Abstract: Research in digital watermarking is mature. Several software implementations of watermarking algorithms are described in the literature, but few attempts have been made to describe hardware implementations. The ultimate objective of the research is to develop low-power, high- performance, real-time, reliable and secure watermarking systems, which can be achieved through hardware implementations. The development of a very-large-scale integration architecture for a high-performance watermarking chip is presented which can perform both invisible robust and invisible fragile image watermarking in the spatial domain. The watermarking architecture is prototyped in two ways: (i) by using a Xilinx field-programmable gate array and (ii) by building a custom integrated circuit. This prototype is the first watermarking chip with both invisible robust and invisible fragile watermarking capabilities.

68 citations


Journal ArticleDOI
TL;DR: Qualitative and quantitative comparisons indicate that the proposed multipliers compare favourably with the earlier solutions.
Abstract: A new modulo 2 n +1 multiplier architecture is proposed for operands in the weighted representation. A new set of partial products is derived and it is shown that all required correction factors can be merged into a single constant one. It is also proposed that part of the correction factor is treated as a partial product, whereas the rest is handled by the final parallel adder. The proposed multipliers utilise a total of (n+1) partial products, each n bits wide and are built using an inverted end-around-carry, carry-save adder tree and a final adder. Area and delay qualitative and quantitative comparisons indicate that the proposed multipliers compare favourably with the earlier solutions

65 citations


Journal ArticleDOI
TL;DR: Two complementary approaches have been proposed to improve the efficiency of RNS based on triple moduli sets: enhancing multipliers modulo 2n+1, which perform the most complex arithmetic operation, and overloading the binary channel in order to obtain a more balanced moduli set.
Abstract: Residue number systems (RNS) are non-weighted systems that allow to perform addition, subtraction and multiplication operations concurrently and independently on each residue. The triple moduli set {2n−1, 2n, 2n+1} and its respective extensions have gained unprecedent importance in RNS, mainly because of the simplicity of the arithmetic units for the individual channels and also of the converters to and from RNS. However, there is neither a perfect balance between the various elements of this moduli set nor an exact equivalence in the complexity of the individual arithmetic units for each individual residue. Two complementary approaches have been proposed to improve the efficiency of RNS based on this type of moduli sets: enhancing multipliers modulo 2n+1, which perform the most complex arithmetic operation, and overloading the binary channel in order to obtain a more balanced moduli set. Experimental results show that, when applied together, these techniques can improve the efficiency of the multipliers up to 32%.

58 citations


Journal ArticleDOI
TL;DR: This work provides basic results and motivation for continued study of the direct synthesis of NCV circuits, and establishes relations between function realizations in different circuit cost metrics.
Abstract: A breadth-first search method for determining optimal three-qubit circuits composed of quantum NOT, CNOT, controlled-V and controlled-V + (NCV) gates is introduced. Results are presented for simple gate count and for technology-motivated cost metrics. The optimal NCV circuits are also compared with NCV circuits derived from optimal NOT, CNOT and Toffoli (NCT) gate circuits. This work provides basic results and motivation for continued study of the direct synthesis of NCV circuits, and establishes relations between function realizations in different circuit cost metrics

48 citations


Journal ArticleDOI
TL;DR: Comparisons with Gaussian specific generators show that the new architecture uses less than half the resources, provides a higher sample rate, and retains statistical quality for up to 50 billion samples, but can also generate other distributions.
Abstract: A hardware architecture for non-uniform random number generation, which allows the generator's distribution to be modified at run-time without reconfiguration is presented. The architecture is based on a piecewise linear approximation, using just one table lookup, one comparison and one subtract operation to map from a uniform source to an arbitrary non-uniform distribution, resulting in very low area utilisation and high speeds. Customisation of the distribution is fully automatic, requiring less than a second of CPU time to approximate a new distribution, and typically around 1000 cycles to switch distributions at run-time. Comparison with Gaussian-specific generators shows that the new architecture uses less than half the resources, provides a higher sample rate and retains statistical quality for up to 50 billion samples, but can also generate other distributions. When higher statistical quality is required and multiple samples are required per cycle, a two-level piecewise generator can be used, reducing the RAM required per generated sample while retaining the simplicity and speed of the basic technique.

45 citations


Journal ArticleDOI
TL;DR: Key features such as assertion threading, activity monitors, assertion and cover counters and completion mode assertions are presented to provide better and more diversified ways to achieve visibility within the assertion circuits, which, in turn, lead to more efficient circuit debugging.
Abstract: Although assertions are a great tool for aiding debugging in the design and implementation verification stages, their use in silicon debug has been limited so far. A set of techniques for debugging with the assertions in either pre-silicon or post-silicon scenarios are discussed. Presented are features such as assertion threading, activity monitors, assertion and cover counters and completion mode assertions. The common goal of these checker enhancements is to provide better and more diversified ways to achieve visibility within the assertion circuits, which, in turn, lead to more efficient circuit debugging. Experimental results show that such modifications can be done with modest checker hardware overhead.

32 citations


Journal ArticleDOI
TL;DR: A field programmable gate array (FPGA) -based implementation of a physical random number generator (PRNG) that can be implemented completely in digital technology, requires no external components, is very small in area, achieves very high throughput and has good statistical properties.
Abstract: A field programmable gate array (FPGA) -based implementation of a physical random number generator (PRNG) is presented. The PRNG uses an alternating step generator construction to decorrelate an oscillator-phase-noise-based physical random source. The resulting design can be implemented completely in digital technology, requires no external components, is very small in area, achieves very high throughput and has good statistical properties. The PRNG was implemented on an FPGA device and tested using the NIST, Diehard and TestU01 random number test suites.

31 citations


Journal ArticleDOI
TL;DR: The results of the analysis indicate that TDIM is the most efficient of the three circuits analysed; this method has been incorporated in a high-resolution time measurement system in the sub-picosecond range and has subsequently been fabricated by Sun Microsystems.
Abstract: An increasingly important issue in the implementation of high-performance circuits using either System-on-Chip or System-in-Package technology is ensuring the correct timing performance at the input/output interfaces of cores or chips. These interfaces are not accessible to conventional Automatic Test Equipment (ATE). However, had these nodes been accessible the limitations of the ATE to make accurate measurements would necessitate the use of tight guard bands adversely impacting upon yield. To address this issue of internal time parameter measurement, the circuitry normally resident in the ATE to perform the measurements is incorporated into the design itself. This paper is a case study of three time measurement techniques potentially suitable for circuit integration, namely, Time Difference Measurement (TDM), Successive Approximation Time Measurement (SATM) and Time Delay Interpolation Measurement (TDIM) methods. The techniques are analysed and compared for a number of design parameters such as area overhead, ease of calibration, timing resolution, robustness to processing, temperature and supply voltage variations. The results of the analysis indicate that TDIM is the most efficient of the three circuits analysed; this method has been incorporated in a high-resolution time measurement system in the sub-picosecond range and has subsequently been fabricated by Sun Microsystems.

30 citations


Journal ArticleDOI
TL;DR: An efficient design methodology and a systematic approach for the implementation of multiplication and squaring functions for unsigned large integers, using small-size embedded multipliers are presented and a set of equations is derived to aid in the realisation.
Abstract: An efficient design methodology and a systematic approach for the implementation of multiplication and squaring functions for unsigned large integers, using small-size embedded multipliers are presented. A general architecture of the multiplier and squarer is proposed and a set of equations is derived to aid in the realisation. The inputs of the multiplier and squarer are split into several segments leading to an efficient utilisation of the small-size embedded multipliers and a reduced number of required addition operations. Various benchmarks were tested for different segments ranging from 2 to 5 targeting Xilinx Spartan-3 FPGAs. The synthesis was performed with the aid of the Xilinx ISE 7.1 XST tool. The approach was compared with the traditional technique using the same tool. The results illustrate that the design approach is very efficient in terms of both timing and area savings. Combinational delay is reduced by an average of 7.71% for the multiplier and 21.73% for the squarer. In terms of 4-inputs look-up tables, area is lowered by an average of 11.63% for the multiplier and 52.22% for the squarer. In the case of the multiplier, both approaches use the same number of embedded multipliers. For the squarer, the proposed approach reduces the number of required embedded multipliers by an average of 32.77% compared with the traditional technique.

24 citations


Journal ArticleDOI
TL;DR: A double rotation CORDIC algorithm with an efficient strategy to predict the rotation direction is proposed for a high-speed sine and cosine generator and complex multiplier and results show that the computation time can be improved and the overall power consumption reduced.
Abstract: Coordinate rotation digital computer (CORDIC) is a well-known algorithm using simple adders and shifters to evaluate various elementary functions. A double rotation CORDIC algorithm with an efficient strategy to predict the rotation direction is proposed for a high-speed sine and cosine generator and complex multiplier. Simulation results show that the computation time can be improved by 37.2%, 42.67% and 46.04% for 16-bit, 32-bit and 64-bit operands, respectively. In addition, the overall power consumption per CORDIC arithmetic computation can be improved by 21.2% and 38.5% for 32-bit and 64-bit operands, respectively. Thus, the proposed double rotation CORDIC processor is suitable for high-speed applications.

Journal Article
TL;DR: This paper presents an extension of a DLBIST scheme for transition fault testing, the so-called transition fault model, which is widely used for complexity reasons and is used to generate the required pattern pairs.
Abstract: BIST is an attractive approach to detect delay faults due to its inherent support for at-speed test. Deterministic logic BIST (DLBIST) is a technique which was successfully applied to stuck-at fault testing. As delay faults have lower random pattern testability than stuck-at faults, the need for DLBIST schemes is increased. Nevertheless, an extension to delay fault testing is not trivial, since this necessitates the application of pattern pairs. Consequently, delay fault testing is expected to require a larger mapping effort and logic overhead than stuck-at fault testing. In this paper, we consider the so-called transition fault model, which is widely used for complexity reasons. We present an extension of a DLBIST scheme for transition fault testing. Functional justification is used to generate the required pattern pairs. The efficiency of the extended scheme is investigated by using industrial benchmark circuits.

Journal ArticleDOI
TL;DR: This work presents two high-resolution time measurement schemes for digital Built-in Self-Test (BIST) applications, namely: Two-Delay Interpolation Method and the Time Amplifier.
Abstract: The rapid pace of change in IC technology, specifically in the speed of operation, demands sophisticated design solutions for IC testing methodologies. Moreover, the current technology of System-on-Chip makes great demands on the accurate testing of internal timing parameters as access to internal nodes through input/output pins becomes more difficult. This work presents two high-resolution time measurement schemes for digital Built-in Self-Test (BIST) applications, namely: Two-Delay Interpolation Method and the Time Amplifier. The two schemes are subsequently combined to produce a novel design for BIST time measurement which offers two main advantages: a small time interval measurement capability which advances the state of the art and a small footprint, occupying 0.2 mm 2 or equivalent to 3020 transistors, compared with a recent design which has the equivalent of 4800 transistors

Journal ArticleDOI
TL;DR: Major modifications to the PIC scheme to improve its update performance are presented and the new coding scheme is called PIC with segmented domain, which can be implemented with embedded SRAM rather than with TCAM.
Abstract: Filter encoding can effectively enhance the efficiency of ternary content addressable memory (TCAM)-based packet classification. It can minimise the range expansion problem, reduce the TCAM space requirement and improve the lookup rate for IPv6. However, additional complexity will incur inevitably in the filter table update operations. Although the average update cost of the prefix inclusion coding (PIC) scheme is very low, the worst-case update cost can be significantly higher. Major modifications to the PIC scheme to improve its update performance are presented. The new coding scheme is called PIC with segmented domain. By dividing the field value domain into multiple segments, the mapping of field values to code points can be more structural and help avoid massive code-point relocation in the event of new insertions. Moreover, the simplified codeword lookup for the address fields can be implemented with embedded SRAM rather than with TCAM. Consequently, the lookup rate of the search engine can be improved to handle the OC-768 line rate.

Journal ArticleDOI
TL;DR: A number of real diagnosis results from the wafer testing data including both stuck-open faults and intra-gate bridging faults have confirmed the effectiveness of this new method.
Abstract: A comprehensive solution to the intra-gate diagnosis problem, including intra-gate bridging and stuck-open faults is provided. The work is based on a local transformation technique that allows transistor-level faults to be diagnosed by the commonly available gate-level fault diagnosis tools without having to deal with the complexity of a transistor-level description of the whole circuit. Three transformations are described: one for stuck-open faults, one for intra-gate resistive-open faults and one for intra-gate bridging faults. Experimental work has been conducted at NXP Semiconductors using the NXP diagnosis tool – FALOC. A number of real diagnosis results from the wafer testing data including both stuck-open faults and intra-gate bridging faults have confirmed the effectiveness of this new method.

Journal ArticleDOI
TL;DR: A comparison of clock pausing, clock stretching, and data-driven clocking schemes and how they can be applied to an existing partitioned synchronous architecture to obtain a reliable, low-latency and efficient clock control architectures is presented.
Abstract: Because of the increase in complexity of distributing a global clock over a single die, globally asynchronous and locally synchronous systems are becoming an efficient alternative technique to design distributed system-on-chip (SOC). A number of independently clocked synchronous domains can be integrated by clock pausing, clock stretching or data-driven clocking techniques. Such techniques are applied on point-to-point inter-domain communication schemes. Presented here is a comparison of these schemes and how they can be applied to an existing partitioned synchronous architecture to obtain a reliable, low-latency and efficient clock control architectures. The comparison highlights the advantages and disadvantages of one scheme over the other in terms of logical correctness, circuit implementation, performance and relative power consumption. Also presented are circuit solutions for stretchable and data-driven clocking schemes. These circuit solutions can be easily plugged into existing partitioned synchronous islands. To enable early evaluation of functional correctness, also proposed is the use of Petri net modelling techniques to model the asynchronous control blocks that constitute the interface between the synchronous islands.

Journal ArticleDOI
TL;DR: In this article, a multi-layer sequence pair-representation-based floorplanner is proposed to find high-quality floorplans for applications that use partial reconfiguration.
Abstract: Partial dynamic reconfiguration is an emerging area in field programmable gate arrays (FPGA) designs, which is used for saving device area and cost. In order to reduce the reconfiguration overhead, two consecutive similar sub-designs should be placed in the same locations to get the maximum reuse of common components. This requires that all the future designs be considered while floorplanning for any given design. A comprehensive framework for floorplanning designs on partial reconfigurable architecture is provided. Several reconfiguration-specific floorplanning cost functions and moves that aim to reduce the reconfiguration overhead are introduced. A new multi-layer sequence pair-representation-based floorplanner that allows overlap of static and non-static components of multiple designs and guarantees a feasible overlapping floorplan with minimal area packing is introduced. A new matching algorithm that covers all possible matchings of static blocks during floorplanning for multiple designs is presented. In our experiments, it is shown that the proposed floorplanner gives more than 50% savings in reconfiguration frames compared with the scheme where no reuse is done. Further, compared with a traditional sequential floorplanner, our floorplanner removes infeasibility in many designs, achieves an improvement of clock period by 12% on average and reduces the place and route time significantly. The proposed floorplanner could be used for finding high-quality floorplans for applications that use partial reconfiguration.

Journal ArticleDOI
TL;DR: A Pareto-based approach is proposed combining a design-time application and platform exploration with a low-complexity run-time manager to avoid conservative worst-case assumptions and eliminate large run- time overheads on the state-of-the-art RTOS kernels.
Abstract: In an Multi-Processor system-on-Chip (MP-SoC) environment, a customized run-time management layer should be incorporated on top of the basic Operating System services to alleviate the run-time decision-making and to globally optimise costs (e.g. energy consumption) across all active applications, according to application constraints (e.g. performance, user requirements) and available platform resources. To that end, to avoid conservative worst-case assumptions, while also eliminating large run-time overheads on the state-of-the-art RTOS kernels, a Pareto-based approach is proposed combining a design-time application and platform exploration with a low-complexity run-time manager. The design-time exploration phase of this approach is the main contribution of this work. It is also substantiated with two real-life applications (image processing and video codec multimedia). These are simulated on MP-SoC platform simulator and used to illustrate the optimal trade-offs offered by the design-time exploration to the run-time manager.

Journal ArticleDOI
TL;DR: The results show that both schemes do not degrade network performance in terms of average packet latency and throughput if the flit injection rate is slower than 0.57 flit/cycle/node.
Abstract: Reducing the design complexity of switches is essential for cost reduction and power saving in on-chip networks In wormhole-switched networks, packets are split into flits which are then admitted into and delivered in the network When reaching destinations, flits are ejected from the network Since flit admission, flit delivery and flit ejection interfere with each other directly and indirectly, techniques for admitting and ejecting flits exert a significant impact on network performance and switch cost Different flit-admission and flit-ejection micro-architectures are investigated In particular, for flit admission, a novel coupling scheme which binds a flit-admission queue with a physical channel (PC) is presented This scheme simplifies the switch crossbar from 2p×p to (p+1)×p, where p is the number of PCs per switch For flit ejection, a p-sink model that uses only p flit sinks to eject flits is proposed In contrast to an ideal ejection model which requires p · v flit sinks (v is the number of virtual channels per PC), the buffering cost of flit sinks becomes independent of v The proposed flit-admission and flit-ejection schemes are evaluated with both uniform and locality traffic in a 2D 4×4 mesh network The results show that both schemes do not degrade network performance in terms of average packet latency and throughput if the flit injection rate is slower than 057 flit/cycle/node

Journal ArticleDOI
TL;DR: This paper proposes an irredundant bus encoding scheme for on-chip buses to tackle the thermal issue and results show that the encoding scheme is very efficient to reduce the on- chip bus temperature rise over substrate temperature, with much less overhead compared to other low power encoding schemes.
Abstract: As technology scales, increasing clock rates, decreasing interconnect pitch and the introduction of low-k dielectrics have made self-heating of the global interconnects an important issue in VLSI design. Further, high bus temperatures have had a negative impact on the delay and reliability of on-chip interconnects. Energy and thermal models are used to characterise the effects of self-heating on the temperature of on-chip interconnects. The results obtained show that self-heating of on-chip buses contribute significantly to the temperature of the bus, which increases as technology scales, motivating the need to find solutions to mitigate this effect. The theoretical analysis performed shows that spreading switching activities among all bus lines can effectively reduce the peak temperature of the on-chip bus. Based on this observation, a thermal spreading encoding scheme for on-chip buses is proposed to tackle the thermal issue. The results obtained show that this approach is very effective in reducing the transient peak temperature among bus lines, with much less overhead compared with other low-power encoding schemes. This technique can then be combined with low-power encoding schemes to further reduce the on-chip bus temperature.

Journal ArticleDOI
TL;DR: A tool called BITLINKER, that creates partially reconfigurable modules from the bit-streams of individual components is described, capable of performing restricted component placement and interconnect routing between the assembled components.
Abstract: A tool called BITLINKER, that creates partially reconfigurable modules from the bit-streams of individual components is described. It is also capable of performing restricted component placement and interconnect routing between the assembled components. The resulting modules are used in applications that exploit partial dynamic reconfiguration. The tool is integrated in a design flow particularly aimed at dynamically reconfigurable platform field-programmable gate arrays (FPGAs). The associated development design flow and a run-time support system that can be used to manage module activation and data communication are described. Evaluation results obtained with a Virtex-II Pro system are also reported.

Journal ArticleDOI
Myung-Hoon Yang1, Youbean Kim1, Young-Kyu Park1, Duk C. Lee1, Sungho Kang1 
TL;DR: Experimental results for the largest ISCAS'89 benchmark circuits show that the proposed scheme can reduce the switching activity by 50% with little hardware overhead compared with previous schemes.
Abstract: A new low-power testing methodology to reduce the excessive power dissipation associated with scan-based designs in the deterministic test pattern generated by linear feedback shift registers (LFSRs) in built-in self-test is proposed. This new method utilises two split LFSRs to reduce the amount of the switching activity. The original test cubes are partitioned into zero-set and one-set cubes according to specified bits in the test cubes, and the split LFSR generates a zero-set or one-set cube in the given test cube. In cases where the current scan shifting value is a do not care bit accounting for the output values of the LFSRs, the last value shifted into the scan chain is repeatedly shifted into the scan chain and no transition is produced. Experimental results for the largest ISCAS'89 benchmark circuits show that the proposed scheme can reduce the switching activity by 50% with little hardware overhead compared with previous schemes.

Journal ArticleDOI
TL;DR: A low-power signed pipelined truncated multiplier is proposed that can dynamically detect multiple combinations of input ranges and deactivate a large amount of the unnecessary transitions in non-effective ranges to reduce the power consumption.
Abstract: An energy-efficient multiplier is very desirable for multimedia and digital signal processing systems. In many of these systems, the effective dynamic range of input operands for multipliers is generally limited to a small range and the case with maximum range seldom occurs. In addition, the output products of multipliers are usually rounded or truncated to avoid growth in word size. Based on these features, a low-power signed pipelined truncated multiplier is proposed that can dynamically detect multiple combinations of input ranges and deactivate a large amount of the unnecessary transitions in non-effective ranges to reduce the power consumption. Moreover, the proposed multiplier can trade output precision against power consumption so as to further reduce power consumption. Experimental results show that the proposed multiplier consumes up to 90% less power than the conventional standard multiplier while still maintaining an acceptable output precision and quality.

Journal ArticleDOI
TL;DR: A feasibility study of accelerating fault simulation by emulation by emulation on FPGA is described, showing that it is beneficial to use emulation for circuits/methods that require large numbers of test vectors, e.g., sequential circuits and/or genetic algorithms.
Abstract: A feasibility study of accelerating fault simulation by emulation on field programmable gate arrays (FPGAs) is described. Fault simulation is an important subtask in test pattern generation and it is frequently used throughout the test generation process. The problems associated with fault simulation of sequential circuits are explained. Alternatives that can be considered as trade-offs in terms of the required FPGA resources and accuracy of test quality assessment are discussed. In addition, an extension to the existing environment for re-configurable hardware emulation of fault simulation is presented. It incorporates hardware support for fault dropping. The proposed approach allows simulation speed-up of 40–500 times as compared to the state-of-the-art in software-based fault simulation. On the basis of the experiments, it can be concluded that it is beneficial to use emulation for circuits/methods that require large numbers of test vectors while using simple but flexible algorithmic test vector generating circuits, for example built-in self-test.

Journal ArticleDOI
TL;DR: The first in-depth study on applying statistical timing analysis with cross-chip and on-chip variations to speed-binning and guard- banding in FPGAs has been presented and the effects of timing-model with guard-banding and speed- binning on statistical performance and timing yield are quantified.
Abstract: Process variations affecting timing and power is an important issue for modern integrated circuits in nanometre technologies Field programmable gate arrays (FPGA) are similar to application-specific integrated circuit (ASIC) in their susceptibility to these issues, but face unique challenges in that critical paths are unknown at test time The first in-depth study on applying statistical timing analysis with cross-chip and on-chip variations to speed-binning and guard- banding in FPGAs has been presented Considering the uniqueness of re-programmability in FPGAs, the effects of timing-model with guard-banding and speed-binning on statistical performance and timing yield are quantified A new variation aware statistical placement, which is the first statistical algorithm for FPGA layout and achieves a yield loss of 297% of the original yield loss with guard-banding and a yield loss of 4% of the original one with speed-binning for Microelectronics Center of North Carolina (MCNC) and Quartus University Interface Program (QUIP) designs, has also been developed

Journal ArticleDOI
TL;DR: By simulations, it is shown that the performance of the proposed method is close to that of the true rounding method and much better than those of other methods.
Abstract: This study presents a design method for fixed-width two's complement squarer that receives an n-bit input and produces an n-bit squared product. To efficiently compensate for the truncation error, modified Booth-folding encoder signals are used for the generation of error compensation bias. The truncated bits are divided into two groups depending on their effects on the truncation error. Then, different error compensation methods are applied to each group. By simulations, it is shown that the performance of the proposed method is close to that of the true rounding method and much better than those of other methods. Also, it is shown that the proposed fixed-width two's complement squarers lead to about 34% reduction in area, 35% reduction in power consumption and 10% improvement in speed compared with conventional squarers.

Journal ArticleDOI
TL;DR: A robust new paradigm for diagnosing hold-time violation at scan chains with the ability to tolerate non-ideal conditions is presented and two algorithms including a greedy algorithm and a so-called best-alignment based algorithm are proposed.
Abstract: Hold-time violation is a common cause of failure at scan chains. A robust new paradigm for diagnosing such failures is presented. As compared to previous methods, the main advantage of this is the ability to tolerate non-ideal conditions, for example, under the presence of certain core logic faults or for those faults that manifest themselves intermittently. The diagnosis problem is first formulated as a ‘delay insertion process’. Upon this formulation, two algorithms – a ‘greedy’ algorithm and a so-called ‘best-alignment-based’ algorithm – is proposed. Experimental results on a number of practical designs and ISCAS'89 benchmark circuits are presented to demonstrate its effectiveness.

Journal ArticleDOI
TL;DR: The study included remodeling the entire hardware architecture removing the shifter from the scalable computing part and embedding it in the non-scalable memory unit instead, which resulted in a speedup to the complete inversion process with an area increase due to the new memory shifting unit.
Abstract: Modular inversion is a fundamental process in several cryptographic systems. It can be computed in software or hardware, but hardware computation has been proven to be faster and more secure. This research focused on improving an old scalable inversion hardware architecture proposed in 2004 for finite field GF(p). The architecture comprises two parts, a computing unit and a memory unit. The memory unit holds all the data bits of computation whereas the computing unit performs all the arithmetic operations in word (digit) by word bases such that the design is scalable. The main objective of this paper is to show the cost and benefit of modifying the memory unit to include shifting, which was previously one of the tasks of the scalable computing unit. The study included remodeling the entire hardware architecture removing the shifter from the scalable computing part and embedding it in the non-scalable memory unit instead. This modification resulted in a speedup to the complete inversion process with an area increase due to the new memory shifting unit. Several design schemes have been compared giving the user the complete picture to choose from depending on the application need.

Journal ArticleDOI
TL;DR: An efficient roll-forward checkpointing/recovery scheme for distributed systems has been presented and offers the main advantages of both the synchronous and the asynchronous approaches, that is simple recovery and simple way to create checkpoints.
Abstract: An efficient roll-forward checkpointing/recovery scheme for distributed systems has been presented. This work is an improvement of our earlier work. The use of the concept of forced checkpoints helps to design a single phase non-blocking algorithm to find consistent global checkpoints. It offers the main advantages of both the synchronous and the asynchronous approaches, that is simple recovery and simple way to create checkpoints. The algorithm produces reduced number of checkpoints. Since each process independently takes its decision whether to take a forced checkpoint or not, it makes the algorithm simple, fast and efficient. The proposed work offers better performance than some noted existing works. Besides, the advantages stated above also ensure that the algorithm can work efficiently in mobile computing environment.

Journal ArticleDOI
TL;DR: Novel architectures for combined units that perform modulo 2 n -1/diminished-1 modulo2 n +1 multiplication or sum-of-squares depending on the value of a control signal are proposed.
Abstract: Digital signal processing and multimedia applications often profit from the use of a residue number system. Among the most commonly used moduli, in such systems, are those of 2 n -1 and 2 n +1 forms and among the most commonly used operations are multiplication and sum-of-squares. These operations are currently performed using distinct design units and/or consecutive machine cycles. Novel architectures for combined units that perform modulo 2 n -1/diminished-1 modulo 2 n +1 multiplication or sum-of-squares depending on the value of a control signal are proposed