scispace - formally typeset
Search or ask a question

Showing papers in "Iet Computers and Digital Techniques in 2016"


Journal ArticleDOI
TL;DR: This study proposes a majority-based evolution (MBE) SA algorithm that can be considered a variant of the well known differential evolution algorithm for FSM state encoding targeting the optimisation of both area and power.
Abstract: State assignment (SA) for finite state machines (FSMs) is one of the crucial synthesis steps in the design and optimisation of sequential circuits. In this study, we propose a majority-based evolution (MBE) SA algorithm that can be considered a variant of the well known differential evolution algorithm. Each individual is evolved based on selecting three random individuals, one of which is selected to be the best individual with a 50% probability. Then, for each state in the individual a selection is made with a 50% probability between keeping the current state or replacing it with a newly computed state. The bit values of the new state are determined based on the majority values of the state of the three selected individuals under a randomly generated probability within a predetermined range. The proposed algorithm is used for FSM state encoding targeting the optimisation of both area and power. Experimental results demonstrate the effectiveness of the proposed MBE SA algorithm in comparison with other evolutionary algorithms including genetic algorithm, binary particle swarm optimisation, Tabu search and simulated evolution.

22 citations


Journal ArticleDOI
TL;DR: The proposed designs for IDDMM are well suited to be implemented in modern FPGAs, making use of available dedicated multipliers and memory blocks reducing drastically the FPGA's standard logic while keeping an acceptable performance compared with other implementation approaches.
Abstract: This study presents a scalable hardware architecture for modular multiplication in prime fields GF( p ). A novel iterative digit-digit Montgomery multiplication (IDDMM) algorithm is proposed and two hardware architectures that compute that algorithm are described. The input operands (multiplicand, multiplier and modulus) are represented using as radix β = 2 k . Multiplication over GF( p ) is possible using almost the same hardware since the complexity of multiplier's kernel module depends mainly on k and not on p . The novel hardware architectures of GF( p ) multipliers were evaluated on three Xilinx FPGA families. Design trade-offs were analysed considering different operand sizes commonly used in cryptography and different digits sizes. The proposed designs for IDDMM are well suited to be implemented in modern FPGAs, making use of available dedicated multipliers and memory blocks reducing drastically the FPGA's standard logic while keeping an acceptable performance compared with other implementation approaches. From the Virtex5 implementation, the proposed MM multiplier reaches a throughput of 242 Mbps using only 219 FPGA slices and achieving a 1024-bit modular multiplication in 4.21μs. This is 26 times less area resources than similar related works in the literature with an improved efficiency of 7x.

22 citations


Journal ArticleDOI
TL;DR: A number of large-scale platforms have been developed recently that promise to accelerate progress both in understanding the biology and in supporting engineering applications, and much has been learnt in the design, development and commissioning of this machine that will inform future developments in this area.
Abstract: The inner workings of the brain as a biological information processing system remain largely a mystery to science. Yet there is a growing interest in applying what is known about the brain to the design of novel computing systems, in part to explore hypotheses of brain function, but also to see if brain-inspired approaches can point to novel computational systems capable of circumventing the limitations of conventional approaches, particularly in the light of the slowing of the historical exponential progress resulting from Moore's Law. Although there are, as yet, few compelling demonstrations of the advantages of such approaches in engineered systems, a number of large-scale platforms have been developed recently that promise to accelerate progress both in understanding the biology and in supporting engineering applications. SpiNNaker (Spiking Neural Network Architecture) is one such large-scale example, and much has been learnt in the design, development and commissioning of this machine that will inform future developments in this area.

20 citations


Journal ArticleDOI
TL;DR: A two phase heuristic technique for routing droplets on a two-dimensional DMFB that significantly reduces latest arrival time, average assay execution time and number of used cells as compared with earlier methods.
Abstract: Digital microfluidic biochip's (DMFB's) have emerged as an alternative to various in-vitro diagnostic tests and are expected to be closely coupled with cyber physical systems. Efficient-error-free-routing and cross-contamination minimisation are needed during bioassay operations on DMFB. This study proposes a two phase heuristic technique for routing droplets on a two-dimensional DMFB. Initially it attempts to route maximum number of nets in a concurrent fashion depending on the evaluated value of a proposed function named interfering index (IInet). Then exact routing is attempted based on tabulation minimisation process. Remaining nets having interfering index values higher than threshold will be routed considering various constraints in DMFB framework. In second phase another metric named routable ratio (RR) is proposed and depending on RR metric, the routing order among conflicting paths are prioritised to avoid deadlock from there onwards till the droplet reaches its target location. Finally we formulate droplet movement problem as satisfiability problems and solve with SAT based solver engine if higher number of overlapping (≥5) nets exist. Experimental results on benchmark suite I and III show our proposed technique significantly reduces latest arrival time, average assay execution time and number of used cells as compared with earlier methods.

18 citations


Journal ArticleDOI
TL;DR: The goal of this study is to highlight the design of a bio-molecular p-i-n FET with satisfactory large current using ultra low power dissipation and high quantum transmission along with satisfactory current for the proposed device during the room temperature operation.
Abstract: In this study, electrically doped bio-molecular p-i-n field-effect transistor (FET) is designed and its electronic properties are investigated. Density functional theory along with non-equilibrium Green's function based first principle approach is used to design the bio-molecular FET at sub-atomic region. Three Adenine and two Thymine molecules are attached together to form 6.24 nm long and 1.40 nm wide bio p-i-n FET. This device is attached with two platinum electrodes and wrapped with a metallic cylindrical gate at high vacuum. Intrinsic n and p regions can be made possible within a bio-molecular device at room temperature by electrical doping without explicit dopants, which leads to conduct current by the device both in forward and reverse bias. The various quantum mechanical properties have been calculated using Poisson's equations and self-consistent function for the bio-molecular FET. Among these various quantum mechanical properties, the authors obtain high quantum transmission along with satisfactory current for the proposed device during the room temperature operation. The goal of this study is to highlight the design of a bio-molecular p-i-n FET with satisfactory large current using ultra low power dissipation.

17 citations


Journal ArticleDOI
TL;DR: Temperature drift analysis of metal–oxide–semiconductor field-effect transistor (MOSFET) is carried out using silicon nitride/SiO2 as dielectric film to study the fabricated ISFET behaviour to be used as pH sensor.
Abstract: In the present study, temperature drift analysis of metal–oxide–semiconductor field-effect transistor (MOSFET) is carried out using silicon nitride/SiO2 as dielectric film. An n-channel depletion-mode MOSFET was fabricated with silicon nitride ion-sensitive field-effect transistor (ISFET) on the same wafer. The study presents the fabrication, simulation and characterisation of MOSFET. The gate of the ISFET is stacked with silicon nitride/SiO2 sensing membrane that was deposited using low pressure chemical vapour deposition. Output and transfer characteristics of on-chip fabricated Al gate MOSFET were obtained in order to study the fabricated ISFET behaviour to be used as pH sensor. Silicon nitride is preferred over SiO2 sensing film/dielectric (in case of MOSFET) which has better sensitivity and low drift. Process and device simulations were performed using Silvaco® TCAD tool.

17 citations


Journal ArticleDOI
TL;DR: The authors introduce a novel approach to optimise the co-scheduling of multi-threaded applications on heterogeneous processors based on the concept of stakes function, which represents the trade-off between isolation and sharing of resources.
Abstract: Single-ISA heterogeneous multi-core processors trade-off power with performance; however, threads that co-run on shared resources suffer from resource contention, which induces performance degradation and energy inefficiency. The authors introduce a novel approach to optimise the co-scheduling of multi-threaded applications on heterogeneous processors. The approach is based on the concept of stakes function, which represents the trade-off between isolation and sharing of resources. The authors also develop a co-scheduling algorithm that use stakes functions to optimise resource usage while mitigating resource contention, thus improving performance and energy efficiency. They validated the approach using applications from the Princeton Application Repository for Shared-Memory Computers (PARSEC) benchmark suite, obtaining up to 12.88% performance speed-up, 13.65% energy speed-up and 28.29% energy delay speed-up with respect to the standard Linux heterogeneous multi-processing scheduler.

16 citations


Journal ArticleDOI
TL;DR: This study presents the design and implementation of an efficient structure for fault tolerant bit-parallel polynomial basis multiplication and squaring over GF(2 m), based on a similar strategy of Roving method with a minimum overhead.
Abstract: This study presents the design and implementation of an efficient structure for fault tolerant bit-parallel polynomial basis multiplication and squaring over GF(2 m ), based on a similar strategy of Roving method with a minimum overhead. The Roving method is an efficient method for the circuits in which many similar and independent structures exist. The architectures of the polynomial basis multiplication and squaring over binary finite fields have inherent regularity in their subsections of the structures. Therefore, they are compatible to the applied version of Roving fault tolerant method. To generalise the proposed architecture, the multiplication and squaring operations are examined for different primitive polynomial, including general irreducible polynomials, irreducible pentanomials and irreducible trinomials. In the proposed design, the extracted common circuit has low hardware utilisation compared with that of the main circuit. The fault tolerant circuit is constructed by using three copies of the common circuit, a comparator and a voter circuit. The comparator and voter have parallel architectures with low critical path delays, which is a critical factor in any highly computational system. The design has been successfully verified and synthesised onVirtex-4 XC4VLX200 FPGA using Xilinx ISE 11. The results show an overall improvement in the speed and hardware usage compared with those of previous designs.

13 citations


Journal ArticleDOI
TL;DR: This study proposes an algorithm for the residue number system with a three-moduli set using the New Chinese Remainder Theorem II, and shows a 24% reduction in the area–delay product than existing algorithms for the same moduli set.
Abstract: Sign detection is an essential part of many computer hardware designs, and is not a trivial task in residue number systems because it is a function of all the residues. This study proposes an algorithm for the residue number system with a three-moduli set {2 n − 1, 2 n , 2 n+1 − 1} using New Chinese Remainder Theorem II. The unit is built with one n-bit carry-save adder and a 2n-bit parallel prefix carry-generation unit. In the best case the Synoposys 90 nm synthesis result shows a 24% reduction in the area–delay product than existing algorithms for the same moduli set.

11 citations


Journal ArticleDOI
TL;DR: This study studies the problem of generating dilutions using a combination of (1 : 1) and (1:2) mix/split operations, called weighted dilution (WD), and presents a layout architecture to implement such WD-steps and describes a simulation based method to find the optimal mix-split steps for generating a dilution.
Abstract: Digital microfluidics has recently emerged as an effective technology in providing inexpensive but reliable solutions to various biomedical and healthcare applications. On-chip dilution of a fluid sample to achieve a desired concentration is an important problem in the context of droplet-based microfluidic systems. Existing dilution algorithms deploy a sequence of balanced mix-split steps, where two unit-volume droplets of different concentrations are mixed, followed by a balanced-split operation to obtain two equal-sized droplets. In this study, the authors study the problem of generating dilutions using a combination of (1 : 1) and (1:2) mix/split operations, called weighted dilution (WD), and present a layout architecture to implement such WD-steps. The authors also describe a simulation based method to find the optimal mix-split steps for generating a dilution under various criteria such as minimisation of waste, sample, or buffer droplets. The sequences can be stored in a look-up table a priori, and used later in real time for fast generation of actuation sequences. Compared with the balanced (1:1) model, the proposed WD scheme reduces the number of mix-split steps by around 22%, and the number of waste droplets, by 18%.

10 citations


Journal ArticleDOI
TL;DR: The proposed designs, a novel power efficient implicit pulsed-triggered flip-flop with embedded clock-gating and pull-up control scheme and an enhanced version (IPFF-ECGPC), are suitable for power-constrained applications in very-large-scale integration designs which are speed-insensitive.
Abstract: In this study, a novel power efficient implicit pulsed-triggered flip-flop with embedded clock-gating and pull-up control scheme (IPFF-CGPC) is proposed. By applying an XOR-based clock-gating scheme in the pulse generating stage, which conditionally disables the inverter chain when the input keeps unchanged, IPFF-CGPC is able to gain low power efficiency by eliminating redundant transitions of internal nodes. Meanwhile, a pull-up control scheme is applied to enhance the discharging path and save short-circuit power when D makes ‘0’–‘1’ transition. To further improve the robustness of the proposed design, the XOR-based comparator in the clock-gating scheme is replaced by a transmission gate-based comparator, which results in an enhanced version (IPFF-ECGPC). Based on the SMIC 65 nm technology, extensive post-layout simulation results show that IPFF-CGPC exhibits excellent power characteristic with a reduction of 32.06–85.89% against its rival designs at 10% data switching activity. Due to its power efficiency, its power-delay product (PDP) gains an improvement of up to 73.94% in the same condition. Moreover, IPFF-ECGPC also enjoys outstanding total-power and PDP efficiency at 10% data switching activity. Therefore, the proposed designs are suitable for power-constrained applications in very-large-scale integration designs which are speed-insensitive.

Journal ArticleDOI
TL;DR: This study presents a technique called adaptively weighted round-robin (RR) arbitration for equality of service in a many-core network-on-chip that exploits the deterministic properties of the interconnection network to achieve global fairness in terms of service provided to each node with less resource requirements compared with previous work.
Abstract: This study presents a technique called adaptively weighted round-robin (RR) arbitration for equality of service in a many-core network-on-chip. The authors concentrate on the network congested with various traffic patterns generated by the applications running on the system. It exploits the deterministic properties of the interconnection network – the topology and the routing algorithm – to achieve the global fairness in terms of service provided to each node with less resource requirements compared with previous work. The weights for input arbitration can be adjusted to make the network better adapted to various traffic patterns. It requires no additional information in packet headers. The hardware overhead is minimal, requiring only several small counters in addition to a typical RR arbiter. The critical path delay is also reduced due to its simplicity. The authors show the effectiveness by implementing RTL models of the routers and synthesizing them with 32/28 nm process technology. SPEC CPU2006 benchmark applications are executed in multi-programmed manner to show that the approach results in outstanding equality-of-service characteristics for real applications.

Journal ArticleDOI
TL;DR: A new distributed power management scheme called duty cycle estimation-event driven duty cycling is suggested and installed locally in the RSUs in order to decrease their power consumption and to extend the lifetime of their batteries.
Abstract: In this study, a green vehicular ad-hoc network (VANET) infrastructure is suggested. The main players in such an infrastructure are the road side units (RSUs) which are able to harvest the energy needed for their work from the surrounding environment, especially the solar energy. Such a suggestion permits to install the RSUs in any place without considering the power supply availability and hence, an extensive area is covered by the VANET infrastructure with an improved performance. To achieve this goal, a new distributed power management scheme called duty cycle estimation-event driven duty cycling is suggested and installed locally in the RSUs in order to decrease their power consumption and to extend the lifetime of their batteries. Embedded UBICOM IP2022 network processer platform is adopted to implement the proposed RSU and the detailed design steps are described, while the necessary values of the system components such as the number of solar cell panels, battery cells capacity and so on, are tuned to suit the design goals. The suggested method is compared with other duty cycling methods to show its effectiveness to build a green VANET infrastructure.

Journal ArticleDOI
TL;DR: This work proposes an efficient algorithm using literals minimisation technique to achieve squaring with improved performance with respect to area, delay and power and simulation results show better performance of the technique than the work shown in the past work.
Abstract: Digital multiplier and squarer circuits are indispensable in digital signal processing and cryptography. Using multiplier, the partial products of the squarer are generated which are added to achieve the final output. But the implementation of squaring has the advantage that we can avoid the generation of many partial products by eliminating the redundant bits, thus resulting the circuit to be simpler with less amount of hardware, propagation delay and power consumption. Our work proposes an efficient algorithm using literals minimisation technique to achieve squaring with improved performance with respect to area, delay and power. This technique compares favourably with the recent work by offering less gate delay, transistor count and area. The proposed optimisation algorithm has been verified using different Xilinx and Altera Field Programmable Gate Array device family. Simulation results show better performance of our technique than the work shown in the past work in respect of delay, power and area. Moreover the proposed technique has been compared with the well known Radix-4 Booth encoded squarer technique. Further, application specific integrated circuit (ASIC) implementation has been performed and the performance parameters have been compared with the earlier work and that also establishes the better results for our technique.

Journal ArticleDOI
TL;DR: An overview of the state-of-the-art in SPM management techniques in many-core processors is presented, some recent research on SPM-based systems are summarised, and future research directions in this field are outlined.
Abstract: Software Programmable Memories, or SPMs, are raw on-chip memories that are not implicitly managed by the processor hardware, but explicitly by software. For example, while caches fetch data from memories automatically and maintain coherence with other caches, SPMs explicitly manage data movement between memories and other SPMs through software instructions. SPMs make the design of on-chip memories simpler, more scalable, and power efficient, but also place additional burden for programming of SPM-based processors. Traditionally, SPMs have been utilised in embedded systems, especially multimedia and gaming systems, but recently research on SPM-based systems has seen increased interest as a means to solve the memory scaling challenges of many-core architectures. This study presents an overview of the state-of-the-art in SPM management techniques in many-core processors, summarises some recent research on SPM-based systems, and outlines future research directions in this field.

Journal ArticleDOI
TL;DR: Some logic circuits of universal modules are suggested to provide an easy way to design any synch-stratum for parallel synchronisation of system blocks with arbitrary interconnection graphs and for wave synchronisation with acyclic interconnection graph.
Abstract: The problem of organising the temporal behaviour of globally asynchronous systems consisting of parallel interacting blocks is discussed. System blocks are represented by the Moore state machine model. The earlier suggested GALA (Globally Asynchronous, Locally Arbitrary) design methodology is used. This methodology is based on decomposing the system to a Processors Stratum (stratum of blocks) and a Synchronisation Stratum (synch-stratum). The synch-stratum acts as a distributed asynchronous clock network that produces local synch-signals for the processor stratum, which basically can be a synchronous prototype. The synch-stratum is a self-timed circuit that interacts with the processor stratum (system devices) via the handshake protocol. Every local device that has received the request signal from the synch-stratum produces the acknowledgment signal and sends it back. In this study, some logic circuits of universal modules are suggested. They provide an easy way to design any synch-stratum for parallel synchronisation of system blocks with arbitrary interconnection graphs and for wave synchronisation of system blocks with acyclic interconnection graph.

Journal ArticleDOI
TL;DR: The exploration of design spaces from strong to weak inversions assisted the development of a unified noise factor model that helped in noise estimation in all regions of inversion and shows reasonably better performance in terms of noise and power consumption.
Abstract: A fully integrated, low power low-noise amplifier (LNA) is implemented for 2.14 GHz band using 65-nm radio frequency CMOS technology. By taking advantage of higher transition frequencies of recent technologies, transistors are biased in the moderate inversion region thus permitting scaling down the supply voltage to 0.7 V. Further, the exploration of design spaces from strong to weak inversions assisted the development of a unified noise factor model. An optimisation is carried out based on the parameter extraction and accordingly an extraction methodology is developed. Overall, the unified model based on the parameter extraction helped in noise estimation in all regions of inversion. The resulting LNA achieves a good power match at the input where the simulated S11 parameter shows an excellent value of −22 dB. Compared to other existing subthreshold cascode LNAs reported in the literature, it shows reasonably better performance in terms of noise and power consumption with a noise figure of 3.74 dB and a moderate power gain of 8.7 dB at a core device current consumption of 450 μA.

Journal ArticleDOI
TL;DR: A task model transformation strategy and an innovative best-fit transformation (BFT) placement algorithm are proposed for a non-rectangle task model to improve the performance of an RC system in rejection rate and total execution time.
Abstract: Task scheduling and placement problem is one of the most significant and time-consuming parts in reconfigurable computing (RC) system. Many investigators have explored on the subject, and most of the traditional studies are concentrated on the rectangle task model, which is inconsistent with objective task shape placed in a field programmable gate array (FPGA) but simplifies the system complexity. Rectangle task model produces inner fragments which reduces utilisation of reconfigurable resources in an FPGA. In this study, a task model transformation strategy and an innovative best-fit transformation (BFT) placement algorithm are proposed for a non-rectangle task model to improve the performance of an RC system in rejection rate and total execution time. According to simulation experiments, BFT algorithm reduced the rejection rate by 15% and 7% compared with that of the first-fit algorithm and the best-fit algorithm, respectively. Multi-shape placement algorithm and 3D compaction algorithm are also cited to compare with the BFT algorithm. The result shows that the BFT algorithm has less total execution time in short laxity period and lower rejection rate in large laxity period. Compared with 3D compaction algorithm, the proposed algorithm reduced the total execution time up to 10.79%.

Journal ArticleDOI
TL;DR: Both analytical and finite-element based simulation results for parameters of the designed structure are found in good agreement and show a reduction in the amplitude of spurious mode by accentuating the filter structure in its fully differential mode inherently present in the structure.
Abstract: In this paper, a design of two microelectromechanical systems based devices is carried out using an analytical and finite-element analysis. The first device is mechanically coupled ring-resonator band-pass filter with centre frequency of 4.4 MHz and a small bandwidth of only 36 kHz. Flexural-mode ring resonators have been mechanically coupled using soft mechanical spring for realising the filtering action. Owing to inherent symmetry in the ring structure, simple approach is used to access a low-velocity coupling locations to set a smallest possible bandwidth. The authors also show a reduction in the amplitude of spurious mode by accentuating the filter structure in its fully differential mode inherently present in the structure. Moreover, the effect of the number of the support beams and structural damping on the frequency response of a filter has been analysed. A second device is mechanically coupled ring-resonator arrays with varying number of rings coupled. The mechanical links using short stubs connect each constituent resonator of an array to its adjacent ones at the high-velocity vibrating locations to accentuate the desired mode and reject all other spurious modes. Both analytical and finite-element based simulation results for parameters of the designed structure are found in good agreement.

Journal ArticleDOI
TL;DR: An efficient task migration algorithm for mesh-based multi- and many-core chips that offers 36% better performance, 28% lower energy consumption, and 7% lower temperature in comparison with the previously proposed migration algorithms is proposed.
Abstract: This study proposes an efficient task migration algorithm for mesh-based multi- and many-core chips. The proposed algorithm collects tasks running on a rectangular-based set of cores, that is, source sub-mesh and moves the tasks to another rectangular-based set to remove chip temperature hotspots and to provide balanced load on the chip. The proposed migration algorithm uses the concept of gathering/scattering to minimise the traffic induced by the migration. In this regard, the proposed algorithm uses a selected node in each row of the source sub-mesh to gather tasks of all cores in the same row. Selection of the gathering node is done based on its location in the row and traffic rate of other cores in the row. When gathering nodes are migrated, in the destination sub-mesh, then, they scatter their tasks according to the same pattern among the cores in their rows. Simulations of the proposed migration algorithm are done by Access Noxim simulator in a various range of network conditions with application graphs of D263DECMP3DEC, DMPEG4, and DVOPD. Results obtained from simulations show that the proposed algorithm offers 36% better performance, 28% lower energy consumption, and 7% lower temperature in comparison with the previously proposed migration algorithms.

Journal ArticleDOI
TL;DR: A mathematical model for an eight-parallel multimode multi-path delay commutator-based FFT/IFFT processor which is suitable for the IEEE 802.11ac compliant MU-MIMO-OFDM system is presented and the data reordering, scheduling methodologies and its architectures are proposed.
Abstract: The IEEE 802.11ac is the recently ratified standard developed for the fifth generation wireless fidelity technology, in which the multi-user (MU) multiple-input multiple-output orthogonal frequency division multiplexing (MIMO-OFDM) technique is adopted for the high data rate communication. In an MIMO-OFDM System, the forward/inverse fast Fourier transform (FFT/IFFT) processor is a key component. On proper reception, the reordering and scheduling of data is important for the optimal utilisation of butterfly resources in the pipelined FFT/IFFT processor. In this study, a mathematical model for an eight-parallel multimode (N = 512/256/128/64) multi-path delay commutator-based FFT/IFFT processor which is suitable for the IEEE 802.11ac compliant MU-MIMO-OFDM system is presented. On the other hand, the data reordering, scheduling methodologies and its architectures are proposed for the pre-, post-FFT/IFFT process are proposed. The design implementations are done using TSMC 65 nm complementary metal–oxide–semiconductor technology at 160 MHz. The power and area metrics with and without clock gating are compared. The clock gated implementation reports show that the power consumption is 17.44 mW for the pre-transformed data reordering and 11.64 mW for the post-transformed data reordering with an area occupation of 0.7694 mm2 and 0.5111 mm2, respectively.

Journal ArticleDOI
TL;DR: The bit by bit parallel processing at the inputs - from MSB to LSB, and the simple architecture utilising a minimum number of gates, makes the proposed design more energy efficient when compared with the K th max algorithm, the tree based maximum findinger, the AB based maximum finder and the IQT architecture.
Abstract: A novel combinational digital device for finding maximum magnitude among the ` n ' input numbers is proposed. This maximum magnitude generator (MaxMG) generates maximum magnitude as an output by utilising the bit by bit approach from multiple input (multi-bit) values simultaneously. MaxMG generates output from most significant bit (MSB) to least significant bit (LSB) in parallel, which utilises a minimum number of gate counts among the multi-bit of multiple input values. The minimum magnitude generator is also derived by applying the dual function to the MaxMG. The proposed design is implemented using Synopsys 90 nm generic library and RTL is written using Verilog HDL. The performance of the proposed design is compared with a rank based K th max selection algorithm, a parallel tree based maximum generator utilised comparator-multiplexer combination, an array-based maximum finder (AB) and improved quad tree (IQT). The bit by bit parallel processing at the inputs - from MSB to LSB, and the simple architecture utilising a minimum number of gates, makes the proposed design more energy efficient when compared with the K th max algorithm, the tree based maximum finder, the AB based maximum finder, and the IQT architecture.

Journal ArticleDOI
TL;DR: A new architecture of FMA is proposed to speed up the DFP processing, and the only digit-set conversion in the entire design is combined with the rounding operation to further reduce the critical path.
Abstract: Decimal floating-point (DFP) arithmetic has attracted attention in the applications of financial and commercial computing. However, the processing efficiency of DFP is still far away from that of binary designs. On the other hand, a floating-point fused multiply-add (FMA) function is widely used in many processors within functional iterations to implement division, square root, and many other functions due to the better accuracy achieved by a single rounding of continuous multiplication and addition. In this work, a new architecture of FMA is proposed to speed up the DFP processing. Compared with previous architectures, first, the proposed design applies a specific decimal redundant encoding system. The circuits to decide and shift the rounding position on a redundant result are therefore simplified. Second, the only digit-set conversion in the entire design is combined with the rounding operation to further reduce the critical path. Third, the techniques applied in different previous FMAs are merged in the proposed design. In addition the multiplier and adder referred to the previous designs are further optimised. Consequently, compared with the fastest previous design, the synthesis results show about 33.7% speed advantage and about 16.6% area advantage.

Journal ArticleDOI
TL;DR: A low complexity and area efficient reconfigurable architecture for multimode interleaver address generator to support multiple wireless standards and a reduction of 60% in resource utilisation and an improvement of 46% in operating frequency are proposed.
Abstract: Developing a reconfigurable transceiver to support multiple protocols seamlessly and efficiently is an extremely tough task. Wireless standards such as wireless local area network (IEEE 802.11a/g) and WiMAX (IEEE 802.16e) incorporate block interleaving technique to overcome the occurrence of burst errors during transmission. Field Programmable Gate Array (FPGA) implementation of floor and modulus (MOD) functions to perform the two step permutation for attaining the new index is quite complex. In this study, the authors propose a low complexity and area efficient reconfigurable architecture for multimode interleaver address generator to support multiple wireless standards. In addition, a novel MOD_row and MOD_column circuit are proposed to compute MOD function for row and column counter values, respectively. The proposed address generation circuitry supports BPSK, QPSK, 16-QAM and 64-QAM modulation schemes under all possible code rates. The reconfigurable address generator for various block size and modulation scheme are implemented on Xilinx Spartan XC3S400 FPGA and the functionalities are verified through simulation. The synthesis results of the proposed design shows a reduction of 60% in resource utilisation and an improvement of 46% in operating frequency over the existing approaches.

Journal ArticleDOI
TL;DR: This work proposes a new processor customisation method based on fixed-point word-length optimisation, which can reduce the number of necessary LUTs and flip-flops and improve the latency of the algorithm.
Abstract: Application-specific customisation of micro-processor architectures has been widely accepted as an effective way to improve the efficiency of processor-based designs. In this work, the authors propose a new processor customisation method based on fixed-point word-length optimisation. Accuracy-aware word-length optimisation (WLO) of fixed-point circuits is an active research area with a large body of literature. For the first time, this work introduces a method to combine the WLO with the processor customisation. The data type word-lengths, the size of register-files and the architecture of the functional units are the main target objectives to be optimised. Accuracy requirements, defined as the worst-case error bound, is the key consideration that must be met by any solution. A custom processor design environment, called PolyCuSP, is used to realise the processor architecture based on the solution found in the proposed optimisation algorithm. The results achieved by evaluating five benchmark show that this method can reduce the number of necessary LUTs and flip-flops by an average of 11.9% and 5.1%, respectively. The latency is also improved by an average of 33.4%. Moreover, the method was further examined through a case study on a JPEG decoder. The results suggest 16.2% and 56.2% reduction in area consumption and latency, respectively.

Journal ArticleDOI
TL;DR: This work essentially does this task in parallel for five such sets of subregions of a given restricted sized chip in digital microfluidics using an array based partitioning pin assignment technique, where cross contamination problem has been considered, and efficiency of proper taxonomy of agiven sample has also been improved.
Abstract: Digital microfluidic biochips are reforming many areas of biochemistry, biomedical sciences, as well as microelectronics. It is renowned as lab-on-a-chip for its appreciation as a substitute for laboratory experiments. Nowadays, for emergency purposes and to ensure cost efficacy, multiple assay operations are essential to be carried out simultaneously. In this context, parallelism is of utmost importance in designing biochip while the size of a chip is a constraint. Hence, the objective of this study is to enhance the performance of a chip in terms of its throughput, electrode utilisation, and pin count as well. Here, the authors have considered some of the most familiar assay requirements where a sample is to be analysed using different reagents, and identify some parameter(s) of the sample(s) under consideration. Moreover, sample preparation is a vital task in digital microfluidic biochip; thus, dilution of different samples up to different concentrations using buffer (neutral) fluid is a crucial issue. In this design, the authors effectively perform this task in parallel in a number of sub-regions of a given restricted sized chip using an array based partitioning pin-assignment technique while taking care of the cross contamination problem. The design has been verified for some significant real life assay examples.

Journal ArticleDOI
TL;DR: This work proposes a scheme for embedding two distinct signatures separately in a reconfigurable scan architecture and verifying those without conflict from the packaged chip by using two distinct test modes of the reconfigured architecture: namely, scan tree mode and SS mode.
Abstract: Signature-based authentication is used often to authenticate hardware intellectual property (IP) when it is reused on a plug-and-play system-on-chip. A signature embedded in the functional/test component of a hardware IP can easily be verified as it can be generated and observed as functional/scan output of the hardware IP for a certain input key vector. An existing scan-based approach for embedding signature inserts signature through reordering of scan cells in a single scan (SS) chain. However, it is not applicable to the recent reconfigurable scan architectures having reduced test application time. We propose a scheme for embedding two distinct signatures separately in a reconfigurable scan architecture and verifying those without conflict from the packaged chip by using two distinct test modes of the reconfigurable architecture: namely, scan tree mode and SS mode. The two signatures may include one from logic IP source and the other from physical IP source. The overhead in both routing and power has been minimised in our scheme. Experimental results on design overhead and robustness for ISCAS89 benchmarks are very encouraging.

Journal ArticleDOI
TL;DR: This study describes a static test compaction procedure for transition faults in circuits with multiple scan chains where each scan chain can operate independently in functional or shift mode.
Abstract: This study describes a static test compaction procedure for transition faults in circuits with multiple scan chains where each scan chain can operate independently in functional or shift mode. The procedure mixes parts of different broadside and skewed-load tests, where the parts coincide with the scan chains, in order to create new tests that detect more faults. This allows the number of tests to be reduced without reducing the fault coverage. By mixing parts of tests with different types, different scan chains are assigned different modes of operation within the same test. Experimental results are presented to demonstrate that this allows the number of tests to be reduced below the number of tests in a compact test set that consists of broadside and skewed-load tests.

Journal ArticleDOI
TL;DR: Two congestion handling strategies aiming to capture the congestion in few bits to avoid congested routes are proposed and improve latency by 20 and 30%, respectively, and have less area and power overhead as compared with baseline table-based approach.
Abstract: The number of cores on a chip is increasing from a few cores to thousands. However, the communication mechanisms for these systems do not scale at the same pace, leading to certain challenges. One of them is on-chip congestion. There are many table-based approaches for congestion handling and avoidance, but these are not acceptable as they impose high area and power overheads. In this study, the authors propose two congestion handling strategies aiming to capture the congestion in few bits to avoid congested routes. The first approach called σ n LBDR (logic based distributed routing) captures congestion present at nodes n-hop away from the current node, reducing area, power and overall packet latency. However, all nodes in the network do not experience same congestion level. For this, their second approach, weighted σ n LBDR, uses a different set of bits for each node and results in the further improvement in area and power. This study shows a comparison of both approaches with each other and also with other similar approaches. From their experimental results, they show that σ n LBDR and weighted σ n LBDR improve latency by 20 and 30%, respectively, and have less area and power overhead as compared with baseline table-based approach.

Journal ArticleDOI
TL;DR: A novel scheme to design inexact computing architectures that selectively protects memory regions based on their significance, i.e., their impact on the end-to-end quality of service, as dictated by the bio-signal application characteristics is proposed.
Abstract: This paper introduces an inexact, but ultra-low power, computing architecture devoted to the embedded analysis of bio-signals. The platform operates at extremely low voltage supply levels to minimize energy consumption. In this scenario, the reliability of SRAM memories cannot be guaranteed when using conventional 6-transistor implementations. While error correction codes and dedicated SRAM implementations can ensure correct operations in this near-threshold regime, they incur in significant area and energy overheads, and should therefore be employed judiciously. Herein, we propose a novel scheme to design inexact computing architectures that selectively protects memory regions based on their significance, i.e., their impact on the end-to-end quality of service, as dictated by the bio-signal application characteristics. We illustrate our scheme on an industrial benchmark application performing the power spectrum analysis (PSA) of electrocardiograms. Experimental evidence showcases that a significance-based memory protection approach leads to a small degradation in the output quality with respect to an exact implementation, while resulting in substantial energy gains, both in the memory and the processing subsystem.