scispace - formally typeset
Search or ask a question

Showing papers in "Iet Computers and Digital Techniques in 2012"


Journal ArticleDOI
TL;DR: The architecture and implementation of a field-programmable gate array (FPGA) accelerator for doubleprecision floating-point matrix multiplication employs the block matrix multiplication algorithm which returns the result blocks to the host processor as soon as they are computed.
Abstract: This study treats architecture and implementation of a field-programmable gate array (FPGA) accelerator for doubleprecision floating-point matrix multiplication. The architecture is oriented towards minimising resource utilisation and maximising clock frequency. It employs the block matrix multiplication algorithm which returns the result blocks to the host processor as soon as they are computed. This avoids output buffering and simplifies placement and routing on the chip. The authors show that such architecture is especially well suited for full-duplex communication links between the accelerator and the host processor. The architecture requires the result blocks to be accumulated by the host processor; however, the authors show that typically more than 99% of all arithmetic operations are performed by the accelerator. The implementation focuses on efficient use of embedded FPGA resources, in order to allow for a large number of processing elements (PEs). Each PE uses eight Virtex-6 DSP blocks. Both adders and multipliers are deeply pipelined and use several FPGA-specific techniques to achieve small area size and high clock frequency. Finally, the authors quantify the performance of accelerator implemented in Xilinx Virtex-6 FPGA, with 252 PEs running at 403 MHz (achieving 203.1 Giga FLOPS (GFLOPS)), by comparing it to double-precision matrix multiplication function from MKL, ACML, GotoBLAS and ATLAS libraries executing on Intel Core2Quad and AMD Phenom X4 microprocessors running at 2.8 GHz. The accelerator performs 4.5 times faster than the fastest processor/library pair.

69 citations


Journal ArticleDOI
TL;DR: A machine learning-based predictive model design space exploration (DSE) method for high-level synthesis (HLS) is presented, which creates a predictive model for a training set until a given error threshold is reached and continues with the exploration using the predictive model avoiding time-consuming synthesis and simulations of new configurations.
Abstract: A machine learning-based predictive model design space exploration (DSE) method for high-level synthesis (HLS) is presented The method creates a predictive model for a training set until a given error threshold is reached and then continues with the exploration using the predictive model avoiding time-consuming synthesis and simulations of new configurations Results show that the authors' method is on average 192 times faster than a genetic-algorithm DSE method generating comparable results, whereas it achieves better results when constraining the DSE runtime When compared with a previously developed simulated annealer (SA)-based method, the proposed method is on average 209 faster, although again achieving comparable results

64 citations


Journal ArticleDOI
TL;DR: This study proposes a heuristic and a genetic algorithm-based methods for generating an application-specific topology generation algorithm for network-on-chip architectures with better results than previous methods with negligible area and link length overheads.
Abstract: Network-on-chip (NoC) is an alternative approach to traditional communication methods for system-on-chip architectures. Irregular topologies are preferable for the application specific NoC designs as they offer huge optimisation space in contrast to their regular counterparts. Generating an application-specific topology as part of the synthesis flow of a NoC architecture is a challenging problem as there may be several topology alternatives, each of which may be superior to the others based on the different objective criteria. In this study, the authors tackle at this problem and propose a heuristic and a genetic algorithm-based methods. The heuristic method, called TopGen, is a two-phase application-specific topology generation algorithm aiming to minimise the energy consumption of the system. TopGen first decomposes the given application into clusters based on the communication traffic. It then maps the clusters onto the routers and connects them in such a way that the communication cost of the network is minimised. The second algorithm, called GA-based topology generation algorithm-based topology generation algorithm (GATGA), is based on a genetic algorithm, which initially creates a set of solutions and uses genetic operators to reproduce new topologies from them. The authors compared our algorithms with existing methods through several multimedia benchmarks and custom generated graphs. TopGen and GATGA obtained better results than previous methods with negligible area and link length overheads.

33 citations


Journal ArticleDOI
TL;DR: A fast internal configuration access port (ICAP) controller, FaRM, providing high-speed configuration and easy-to-use readback capabilities, reducing configuration overhead as much as possible, to overcome the issue of partial reconfiguration overhead.
Abstract: Partial reconfiguration suffers from low performance and thus its use is limited when the reconfiguration overhead is too high compared to the task execution time. To overcome this issue, the authors present a fast internal configuration access port (ICAP) controller, FaRM, providing high-speed configuration and easy-to-use readback capabilities, reducing configuration overhead as much as possible. In order to enhance performance, FaRM uses techniques such as master accesses, ICAP overclocking, bitstream pre-load into a controller and bitstream compression technique, Offset-run length encoding (RLE), which is an improvement of the RLE algorithm. Combining these approaches allows us to achieve an ICAP theoretical throughput of 800 MB/S at 200 MHz. In order to complete our approach, we provide a cost model for the reconfiguration overhead for the system level that can be used during the early stages of development. The authors tested their approach on an Advanced Encryption Standard AES encryption/decryption architecture.

27 citations


Journal ArticleDOI
TL;DR: WBS and IEN are promoted as new design concepts for designers of computer arithmetic circuits and a modulo-2 n + 1 multiplier is presented, where partial products are represented in WBS with IEN and it is shown that by using standard reduction cells, partial products can be reduced to two.
Abstract: Most common uses of negatively weighted bits (negabits), normally assuming arithmetic value 21(0) for logical 1(0) state, are as the most significant bit of 2's-complement numbers and negative component in binary signed-digit (BSD) representation. More recently, weighted bit-set (WBS) encoding of generalised digit sets and practice of inverted encoding of negabits (IEN) have allowed for easy handling of any equally weighted mix of negabits and ordinary bits (posibits) via standard arithmetic cells (e.g., half/full adders, compressors, and counters), which are highly optimised for a host of simple and composite figures of merit involving delay, power, and area, and are continually improving due to their wide applicability. In this paper, we aim to promote WBS and IEN as new design concepts for designers of computer arithmetic circuits. We provide a few relevant examples from previously designed logical circuits and redesigns of established circuits such as 2's-complement multipliers and modified booth recoders. Furthermore, we present a modulo-(2 n + 1) multiplier, where partial products are represented in WBS with IEN. We show that by using standard reduction cells, partial products can be reduced to two. The result is then converted, in constant time, to BSD representation and, via simple addition, to final sum.

19 citations


Journal ArticleDOI
TL;DR: A distributed stochastic dynamic task mapping strategy for mapping applications efficiently onto a large dynamically reconfigurable NoC shows more than 26.4% improvement in application communication distance during steady state, which implies lower energy consumption and lower execution time.
Abstract: Dynamically reconfigurable platforms based on network-on-chips (NoC) could be a viable option for the deployment of large heterogeneous multicore designs. The dynamic nature of these platforms will mean that run-time application mapping and core management will represent a key challenge since the exact tasks requirements and workloads will not be known a priori. Considering the Manhattan distance among tasks as a measure of efficiency for a mapped application, this study proposes a distributed stochastic dynamic task mapping strategy for mapping applications efficiently onto a large dynamically reconfigurable NoC. The effectiveness of the mapping scheme is investigated considering the transient and steady states of the dynamic platform. The comparison with state-of-the-art centralised dynamic task mapping methods shows more than 26.4% improvement in application communication distance during steady state, which implies lower energy consumption and lower execution time.

16 citations


Journal ArticleDOI
TL;DR: This study describes power-elastic systems, a method for designing systems whose operations are limited by applicable power, and concurrency management is demonstrated as an effective means of implementing run-time control, through both theoretical and numerical investigations.
Abstract: This study describes power-elastic systems, a method for designing systems whose operations are limited by applicable power. Departing from the traditional low-power design approach which minimises the power consumption for given amounts of computation throughput, power-elastic design focuses on the maximally effective use of applicable power. Centred on a run-time feedback control architecture, power-elastic systems manage their computation loads according to applicable power constraints, effectively viewing quantities of power as resources to be distributed among units of computation. Concurrency management is demonstrated as an effective means of implementing such run-time control, through both theoretical and numerical investigations. Several concurrency management techniques are studied and the effectiveness of arbitration for dynamic concurrency management with minimal prior system knowledge is demonstrated. A new type of arbitration, called soft arbitration, particularly suitable for managing the access of flexible resources such as power, is developed and proved.

15 citations


Journal ArticleDOI
TL;DR: A topology customisation technique is presented, using which on-demand network interconnects are systematically established in reconfigurable hardware to reduce the area cost of conventional rigid and general purpose on-chip networks.
Abstract: Conventional rigid and general purpose on-chip networks occupy significant logic and wire resources in field-programmable gate arrays (FPGAs). To reduce the area cost, the authors present a topology customisation technique, using which on-demand network interconnects are systematically established in reconfigurable hardware. First, the authors present a design of a customised crossbar switch, where physical topologies are identical to logical topologies for a given application. A multiprocessor system combined with the presented custom crossbar has been designed with the ESPAM design environment and prototyped in the FPGA device. Experiments with practical applications show that the custom crossbar occupies significantly less area, maintains higher performance and reduces the power consumption, when compared with the general-purpose crossbars. In addition, the authors present that configuration performance and cost can be improved by reducing the functional area cost in FPGAs. Second, a customisation technique for the circuit-switched network-on-chip (NoC) is presented, where only necessary half-duplex interconnects are established for a given application mapping. The presented customised NoC is implemented in FPGA and results indicate that the area is reduced by 66%, when compared with the general-purpose networks.

12 citations


Journal ArticleDOI
TL;DR: A high-resolution diagnostic framework for open defects is proposed and consists of a diagnostic test-pattern generation (DTPG) and its diagnosis flow, which deduces nearly one candidate for each open-segment defect on average among all ISCAS'85 benchmark circuits.
Abstract: As an open defect occurs in one wire segment of the circuit, different logic values on the coupling wires of the physical layout may result in different faulty behaviours, which are so called the Byzantine effect. Many previous researches focus on the test and diagnosis of open defects but the pattern diagnosability has not properly addressed. Therefore in this study, a high-resolution diagnostic framework for open defects is proposed and consists of a diagnostic test-pattern generation (DTPG) and its diagnosis flow. The branch-and-bound search associated with controllability analysis is incorporated in satisfiability-based DTPG to generate patterns for the target segment. Later, a precise diagnosis flow constructs the list of defect candidates in a dictionary-based fashion followed by an inject-and-evaluate analysis to greatly reduce the number of candidates for silicon inspection. Experimental results show that the proposed framework runs efficiently and deduces nearly one candidate for each open-segment defect on average among all ISCAS’85 benchmark circuits.

11 citations


Journal ArticleDOI
TL;DR: A fault tolerant router design with an adaptive routing algorithm that tolerates faults in the network links and the router components is proposed and can tolerate multiple failures and prove robustness and fault tolerance with negligible impact on the performance.
Abstract: Network-on-chip (NoC) systems have been proposed to achieve high-performance computing where multiple processors are integrated into one chip. As the number of cores increases and the chips are scaled in the deep submicron technology, the NoC systems become subject to physical manufacture defects and running-time vulnerability, which result in faults. The faults affect the performance and functionality of the NoC systems and result in communication malfunctions. In this study, a fault tolerant router design with an adaptive routing algorithm that tolerates faults in the network links and the router components is proposed. The approach does not require the use of virtual channels and assures deadlock freedom. Furthermore, the experimental results show that the proposed architecture can tolerate multiple failures and prove robustness and fault tolerance with negligible impact on the performance.

11 citations


Journal ArticleDOI
TL;DR: This study presents a particle swarm optimisation (PSO)-based approach to optimise node count and path length of the binary decision diagram (BDD) representation of Boolean function.
Abstract: This study presents a particle swarm optimisation (PSO)-based approach to optimise node count and path length of the binary decision diagram (BDD) representation of Boolean function. The optimisation is achieved by identifying a good ordering of the input variables of the function. This affects the structure of the resulting BDD. Both node count and longest path length of the shared BDDs using the identified input ordering are found to be much superior to the existing results. The improvements are more prominent for larger benchmarks. The PSO parameters have been tuned suitably to explore a large search space within a reasonable computation time.

Journal ArticleDOI
TL;DR: The IVPP is implemented on a Xilinx Virtex-5 FPGA using a high-level synthesis and can be used to realise and test complex algorithms for real-time image and video processing applications.
Abstract: In this study, an image and video processing platform (IVPP) based on field programmable gate array (FPGAs) is presented. This hardware/software co-design platform has been implemented on a Xilinx Virtex-5 FPGA using a high-level synthesis and can be used to realise and test complex algorithms for real-time image and video processing applications. The video interface blocks are done in Register Transfer Languages and can be configured using the MicroBlaze processor allowing the support of multiple video resolutions. The IVPP provides the required logic to easily plug-in the generated processing blocks without modifying the front-end (capturing video data) and the back-end (displaying processed output data). The IVPP can be a complete hardware solution for a broad range of real-time image/video processing applications including video encoding/decoding, surveillance, detection and recognition.

Journal ArticleDOI
TL;DR: A massively parallel scheme aiming at performing all IMMs concurrently is proposed, based on the m -ary exponentiation method, which groups the exponent bits into partition so that the number of required MMs is reduced, provided that some common modular powers are pre-computed and stored for future repeated use.
Abstract: Most cryptographic systems are based on modular exponentiation (ME). It is performed using successive modular multiplications (MMs). In this case, there are many ways to improve the throughput of a cryptographic system implementation: one is reducing the number of the required MMs and the other is reducing the time spent in performing a single MM and a third way consists of executing required independent modular multiplications (IMMs) in parallel. With the purpose of further accelerating the computation of ME, we investigate the impact of these three strategies. First, we propose a massively parallel scheme aiming at performing all IMMs concurrently. The scheme is based on the m-ary exponentiation method, which groups the exponent bits into partition so that the number of required MMs is reduced, provided that some common modular powers are pre-computed and stored for future repeated use. Finally, two different implementations for the MM are used: one is sequential and the other systolic. This investigation is culminated by a comparison of the speedups yielded against the extra-costs due for seven different implementations. One implementation is software based and the other six are hardware based.

Journal ArticleDOI
TL;DR: Results indicate that the proposed general SUT-RNS multiplier for the moduli set {2n−1, 2 n, 2n+1} is a fast fault-tolerant multiplier which outperforms area, power and energy/operation of existing RRNS multiplier.
Abstract: Residue number system (RNS) which utilises redundant encoding for the residues is called redundant residue number system (RRNS). It can accelerate multiplication which is a high-latency operation. Using stored-unibit-transfer (SUT) redundant encoding in RRNS called SUT-RNS has been shown as an efficient number system for arithmetic operation. Radix-2h SUT-RNS multiplication has been proposed in previous studies for modulo 2n−1, but it has not been generalised for each moduli lengths (n) and radix (r=2h). Also, SUT-RNS multiplication for modulo 2n+1 has not been discussed. In this study the authors remove these weaknesses by proposing general radix-2h SUT-RNS multiplication for the moduli set {2n−1, 2n, 2n+1}. Moreover, the authors demonstrate that our approach enables a unified design for the moduli set multipliers, which results in designing fault-tolerant SUT-RNS multipliers with low hardware redundancy. Results indicate that the proposed general SUT-RNS multiplier for the moduli set {2n−1, 2n, 2n+1} is a fast fault-tolerant multiplier which outperforms area, power and energy/operation of existing RRNS multiplier.

Journal ArticleDOI
TL;DR: In this study, asymmetric non-pipelined large size unsigned and signed multipliers are implemented using symmetric and asymmetric embedded multipliers, look-up tables and dedicated adders in field programmable gate arrays (FPGAs).
Abstract: In this study, asymmetric non-pipelined large size unsigned and signed multipliers are implemented using symmetric and asymmetric embedded multipliers, look-up tables and dedicated adders in field programmable gate arrays (FPGAs). Decompositions of the operands are performed for the efficient use of the embedded blocks. Partial products are organised in various configurations, and the additions of the products are realised in an optimised manner. The additions used in the implementation of the multiplication include compressor-based, Delay-Table and Ternary-adder-based approaches. These approaches have led to the minimisation of the total critical path delay with reduced utilisation of FPGA resources. The asymmetric multipliers were implemented in Xilinx FPGAs using 18×18-bit and 25×18-bit embedded signed multipliers. Implementation results demonstrate an improvement of up to 32× in delay and up to 37× in the number of embedded blocks compared with the performance of designs generated by commercial synthesis tools.

Journal ArticleDOI
TL;DR: A security metric is introduced, which is based on the common selection function that is widely used in differential power analysis (DPA) attacks and a correlation measure similar to the one used in correlation power analysis [CPA] attacks.
Abstract: A new design flow for security is presented. Cryptographic circuit specifications are first refined and then mapped to a secure power-balanced library consisting of novel mixed 1-of-2 and 1-of-4 components based on N -nary logic. Logic optimisation tools are then applied to generate secure synchronous circuits for layout generation. The circuits generated are more efficient than balanced circuits generated by alternative techniques. A new method is presented for evaluating the security of such circuits. A security metric is introduced, which is based on the common selection function that is widely used in differential power analysis (DPA) attacks and a correlation measure similar to the one used in correlation power analysis (CPA) attacks. The metric enables the construction of a library of robust cryptograhic components including S -boxes that are more resistant to attack.

Journal ArticleDOI
TL;DR: The experimental results show that the automatic DSE exploration provides significantly better configurations than the previous manual DSE approach, considering the proposed multi-objective approac h.
Abstract: This work extends an earlier manual design space ex ploration of our developed Selective Load Value Pre diction based superscalar architecture to the L2 unified cache. A fter that we perform an automatic design space expl oration using a special developed software tool by varying several architectural parameters. Our goal is to find optim al configurations in terms of CPI (Cycles per Instruction) and energy consumption. By varying 19 architectural parameter s, as we proposed, the design space is over 2.5 millions of billions configurations which obviously means that only heuristic search can be considered. Therefore, we propose dif ferent methods of automatic design space exploratio n based on our developed FADSE tool which allow us to evaluate only 2500 configurations of the above mentioned huge design space! The experimental results show that our automatic de sign space exploration (DSE) provides significantly better configurations than our previous manual DSE approach, considering the proposed multi-objective approac h.

Journal ArticleDOI
TL;DR: This work benefits from the advantages of non-contiguous processor allocation mechanisms, by allowing the tasks of the input application mapped onto disjoint regions (submeshes) and then virtually connecting them by bypassing the router pipeline stages of the inter-region routers.
Abstract: In this study, the authors propose a processor allocation mechanism for run-time assignment of a set of communicating tasks of input applications onto the processing nodes of a chip multiprocessor, when the arrival order and execution lifetime of the input applications are not known a priori This mechanism targets the on-chip communication and aims to reduce the power and latency of the network-on-chip employed as the communication infrastructure In this work, the authors benefit from the advantages of non-contiguous processor allocation mechanisms, by allowing the tasks of the input application mapped onto disjoint regions (submeshes) and then virtually connecting them by bypassing the router pipeline stages of the inter-region routers Among different existing contiguous and non-contiguous processor allocation techniques, the authors have chosen and implemented four efficient schemes for the comparison purpose: the best fit and stack-based allocation algorithms as contiguous techniques and the greedy-available-busy-list algorithm and run-time incremental mapping as non-contiguous techniques Experimental results show considerable improvements over all selected contiguous and non-contiguous methods

Journal ArticleDOI
TL;DR: In this algorithm, the operations required in several contiguous iterations of a previously reported algorithm based on the extended Euclid's algorithm are represented as a matrix and performed at once through the matrix by means of a polynomial multiply instruction on GF(2).
Abstract: The authors propose a fast inversion algorithm in Galois field GF(2 m ). In this algorithm, the operations required in several contiguous iterations of a previously reported algorithm based on the extended Euclid's algorithm are represented as a matrix. These operations are performed at once through the matrix by means of a polynomial multiply instruction on GF(2). When the word size of a processor is 32 or 64 and m is larger than 233 for National Institute of Standards and Technology (NIST)-recommended irreducible polynomials, the proposed algorithm computes inversion with less polynomial multiply instructions on GF(2) and exclusive-OR instructions required by previously reported inversion algorithms on an average.

Journal ArticleDOI
TL;DR: This study describes a scheme for the random injection of single event transients/upsets to evaluate the viability of employing COTS field programmable gate array for an onboard, low-complexity, remote-sensing image data compressor.
Abstract: The successful use of commercial-off-the-shelf (COTS) devices on board space applications requires the use of fault mitigation methods because of the effects of space radiation in microelectronics devices. This study describes a scheme for the random injection of single event transients/upsets to evaluate the viability of employing COTS field programmable gate array for an onboard, low-complexity, remote-sensing image data compressor. The fault injection features are added to the application to be tested by modifying its hardware description language source code. Then the tests are executed by simulation, with or without the inclusion of fault mitigation methods, so that comparative evaluations can be quickly obtained. The evaluation results (robustness enhancement against area) of different fault mitigation methods are presented, with good estimates of the behaviour of the hardware implementation of the application in a space radiation environment.

Journal ArticleDOI
TL;DR: Adaptiveness of the proposed response compactor enhances the observability of scan cells cost-effectively, where observability enhancements can be tailored in a fault model-dependent or -independent manner, in either way improving test quality and/or test costs.
Abstract: Scan architectures with compression support have remedied the test time and data volume problems of today's sizable designs. On-chip compression of responses enables the transmission of a reduced volume signature information to the ATE, delivering test data volume savings, while it engenders the challenge of retaining test quality. In particular, unknown bits (x's) in responses corrupt other response bits upon being compacted altogether, masking their observation, and hence preventing the manifestation of the fault effects they possess. In this work, we propose the design and utilisation of a response compactor that can adapt to the varying density of x's in responses. In the proposed design, fan-out of scan chains to XOR trees within the compactor can be adjusted per pattern/slice so as to minimise the corruption impact of x's. A theoretical framework is developed to guide the cost-effective synthesis of multi-modal compactor that can deliver x-mitigation capabilities in every mode it operates. Adaptiveness of the proposed response compactor enhances the observability of scan cells cost-effectively, where observability enhancements can be tailored in a fault model-dependent or -independent manner, in either way improving test quality and/or test costs.

Journal ArticleDOI
Irith Pomeranz1
TL;DR: Under certain conditions it is possible to apply to the logic block functional broadside tests that were generated for it as a stand-alone circuit in order to maximise the fault coverage without overtesting, and reduce the computational complexity of test generation.
Abstract: When a logic block is embedded in a larger design, the input sequences applicable to it may be constrained by other logic blocks in the design. This has an impact on what would constitute overtesting of the logic block by scan-based tests. This study defines functional broadside tests that avoid overtesting for an embedded block based on functional broadside tests for the larger design. The definition is constructive and results in a procedure for generating the tests. This study compares these tests with ones generated for the logic block as a stand-alone circuit. The results demonstrate that it is important to consider in the discussion of overtesting the extent to which the functionality of an embedded logic block is utilised as a part of the design. Under certain conditions it is possible to apply to the logic block functional broadside tests that were generated for it as a stand-alone circuit in order to maximise the fault coverage without overtesting, and reduce the computational complexity of test generation.

Journal ArticleDOI
TL;DR: This study shows that the set of reachable states for a circuit with hardware reset contains the set-of-reachable states based on a synchronising sequence, and that there are differences between different reset states in the sets of Reachable states and the sets- of detectable faults.
Abstract: Functional broadside tests were defined to avoid overtesting that may occur under scan-based tests because of non-functional operation conditions created by unreachable scan-in states. Functional broadside tests were computed assuming that functional operation starts after the circuit is initialised by applying a synchronising sequence. This study discusses the definition of functional broadside tests for the case where hardware reset is used for bringing the circuit into a known state before functional operation starts. This study shows that the set of reachable states for a circuit with hardware reset contains the set of reachable states based on a synchronising sequence. Consequently, the set of functional broadside tests and the set of detectable faults for a circuit with hardware reset contain those obtained based on a synchronising sequence. In addition, there are differences between different reset states in the sets of reachable states and the sets of detectable faults. This study also discusses the case where hardware reset is provided only for a subset of the state variables (referred to as partial reset).

Journal ArticleDOI
TL;DR: Variable stages pipeline (VSP) architecture, which reduces energy consumption and improves execution time by dynamically unifying the pipeline stages, is proposed to achieve high-performance computing with low-energy consumption.
Abstract: Enhancement of mobile computers requires high-performance computing with low-energy consumption. Variable stages pipeline (VSP) architecture, which reduces energy consumption and improves execution time by dynamically unifying the pipeline stages, is proposed to achieve this requirement. A VSP processor uses a special pipeline register called a latch D-flip-flop selector-cell (LDS-cell) that unifies the pipeline stages and prevents glitch propagation caused by stage unification under low-energy mode. The design of the fabricated VLSI of a VSP processor chip on 0.18 m CMOS technology is presented. An evaluation shows that the VSP processor consumes 13 less energy than a conventional one.

Journal ArticleDOI
TL;DR: The delay estimation results of the proposed architecture show that it has a significant decrease in terms of latency in contrast with recently published high performance decimal CORDIC implementations.
Abstract: This study presents the algorithm and architecture of the decimal floating-point (DFP) antilogarithmic converter, based on the digit-recurrence algorithm with selection by rounding. The proposed approach can compute faithful DFP antilogarithmic results for any one of the three DFP formats specified in the IEEE 754-2008 standard. The proposed architecture is synthesised with an STM 90-nm standard cell library and the results show that the critical path delay and the number of clock cycles of the proposed Decimal64 antilogarithmic converter are 1.26 ns (28.0 FO4) and 19, respectively, and the total hardware complexity is 29325 NAND2 gates. The delay estimation results of the proposed architecture show that it has a significant decrease in terms of latency in contrast with recently published high performance decimal CORDIC implementations.

Journal ArticleDOI
TL;DR: The analysis, implementation and simulation indicate that the hardwired networks perform significantly better than soft networks.
Abstract: It is well-known that any logical functionality can be implemented using the reconfigurability in field-programmable gate arrays (FPGAs). However, the reconfigurability is traded with the reduced functional performance, increased cost and increased configuration overheads. Hardwiring the interconnect fabric is gaining notice as an alternative solution to tackle the mentioned problems. In this article, first, the authors present that hardwired built-in crossbars that can improve the performance of the inter-processor communication. The authors conduct an analysis of functional performance, cost and configuration cost for soft and hard crossbar (SBAR and HBAR) interconnects. The queuing model is applied to compare soft and hard interconnects. A motion JPEG (MJPEG) case study suggests that HBAR achieve significantly better throughput and less cost compared to SBAR. Second, the authors present the effectiveness of the hardwired network-on-chip (NoC) in FPGAs. Considering the AEthereal NoC, an analysis is conducted to compare hard and soft NoCs. Consequently, the analysis, implementation and simulation indicate that the hardwired networks perform significantly better than soft networks.

Journal ArticleDOI
TL;DR: This study shows a performance evaluation of the GAP architecture with different array dimensions as well as its performance using a simplified interconnection network and shows a maximum performance drawback of 10% for only a particular configuration and a single benchmark.
Abstract: In the billion transistor era only a few architectural approaches propose new paths to improve the execution of conventional sequential instruction streams. Many legacy applications could profit from processors that are able to speed-up the execution of sequential applications beyond the performance of current superscalar processors. The Grid arithmetic logic unit (ALU) Processor (GAP) accelerates conventional sequential instruction streams without the need for recompilation. The GAP comprises a processor front-end similar to that of a superscalar processor extended by a configuration unit and a two-dimensional array of functional units that forms the execution unit. Instruction sequences are mapped dynamically into the array by the configuration unit so that they form the dataflow graph of the sequence. This study shows a performance evaluation of the GAP architecture with different array dimensions as well as its performance using a simplified interconnection network. GAP outperforms an out-of-order superscalar processor by a maximum of factor 2 with a complete crossbar interconnect between two array rows. Reducing the interconnection network to the minimum shows a maximum performance drawback of 10% for only a particular configuration and a single benchmark. In general, the slowdown is less than 2% for the minimum interconnect (two buses) and about 0.02% if three interconnection buses are used.

Journal ArticleDOI
TL;DR: A novel, arithmetic module-based BIST architecture for two-pattern testing (ABAS) is presented that exercises arithmetic modules to generate two- pattern tests; the hardware overhead required is by far the lowest of all schemes that have been presented for the same purpose in the open literature.
Abstract: Built-in self test (BIST) techniques use test pattern-generation and response-verification operations, reducing the need for external testing. BIST techniques that use arithmetic modules existing in the circuit (accumulators, counters etc.) to perform the testgeneration and response-verification operations have been proposed in the open literature. Two-pattern tests are exercised to detect complementary metal oxide semiconductor (CMOS) stuck-open faults and to assure correct temporal circuit operation at clock speed (delay fault testing). In this study, a novel, arithmetic module-based BIST architecture for two-pattern testing (ABAS) is presented that exercises arithmetic modules to generate two-pattern tests; the hardware overhead required by the presented scheme, provided the availability of such modules is by far the lowest of all schemes that have been presented for the same purpose in the open literature.

Journal ArticleDOI
TL;DR: A protocol is proposed to increase the throughput of internal cores in the latency-insensitive systems when there are several repetitive structures and shows area reduction for the majority of simulated systems.
Abstract: Latency-insensitive design (LID) is a correct by-construction methodology for system on chip design that prevents multiple iterations in synchronous system design. However, one problem in the LID is system throughput reduction. In this study, a protocol is proposed to increase the throughput of internal cores in the latency-insensitive systems when there are several repetitive structures. The validation of the protocol is checked for latency equivalency in various system graphs. A shell wrapper to implement the protocol is described and superimposed logic gates for the shell wrapper are formulated. Simulation is performed for 12 randomly generated systems and four actual systems. The simulation results represent protocol accuracy and show 57% throughput improvement on average compared with the scheduling-based methodology. The protocol also shows area reduction for the majority of simulated systems.

Journal ArticleDOI
TL;DR: The authors describe the problems/solutions of supporting the semantics of recursion (single/multiple, direct/arbitrarily indirect) in synthesis.
Abstract: Behavioural synthesis is the process of automatically translating an abstract specification to physical realisation - silicon. The endpoints of this process are accelerating apart (behavioural descriptions become more abstract, DSM silicon becomes less willing to behave as Boolean circuits) but there is still work outstanding in the middle ground. Recursion allows the elegant expression of complicated systems, and is supported by many languages (software and hardware). The electronic design automation (EDA) tool designers- task is to support the semantics of a language (both simulation and synthesis). Although recursive descriptions can always be re-cast into non-recursive iterative forms, if a language supports a construct, a user should be able to utilise it (the authors are not offering any opinion on the relative wisdom of using recursion or iteration). The authors describe the problems/solutions of supporting the semantics of recursion (single/multiple, direct/arbitrarily indirect) in synthesis. The hardware synthesised can be smaller and faster than that obtained by reformulating the description. It is dangerous, to conclude too much from this - recursion requires a stack and a heap (plus managers). In software, these are taken for granted (-free- resources that do not feature in footprint metrics); in hardware, every resource needed must be explicitly created.