scispace - formally typeset
Search or ask a question

Showing papers in "Iet Computers and Digital Techniques in 2011"


Journal ArticleDOI
TL;DR: This study presents a flow linking a design-time design space explorer coupled with platform simulators at two abstraction levels, with a fast and lightweight priority-based heuristic integrated in the run-time manager to select near-optimal application configurations.
Abstract: Nowadays, owing to unpredictable changes of the environment and workload variation, optimally running multiple applications in terms of quality, performance and power consumption on embedded multi-core platforms is a huge challenge. A lightweight run-time manager, linked with an automated design-time exploration and incorporated in the host processor of the platform, is required to dynamically and efficiently configure the applications according to the available platform resources (e.g. processing elements, memories, communication bandwidth), for minimising the cost (e.g. power consumption), while satisfying the constraints (e.g. deadlines). This study presents a flow linking a design-time design space explorer, coupled with platform simulators at two abstraction levels, with a fast and lightweight priority-based heuristic integrated in the run-time manager to select near-optimal application configurations. To illustrate its feasibility and the very low complexity of the run-time selection, the proposed flow is used to manage the processors and clock frequencies of a multiple-stream MPEG4 encoder chip dedicated to automotive cognitive safety applications.

53 citations


Journal ArticleDOI
TL;DR: Improved performance, such as lower latency and higher bandwidth, lower power consumption, smaller form factor, lower cost and heterogeneous integration of disparate functionalities, are made possible in the next generation of electronics products with the realisation of 3D IC.
Abstract: Various integration schemes and key enabling technologies for wafer-level three-dimensional integrated circuits (3D IC) are reviewed and discussed. Stacking orientations (face up or face down), methods of wafer bonding (metallic, dielectric or hybrid), formation of through-silicon via (TSV) (via first, via middle or via last) and singulation level (wafer-to-wafer or chip-to-wafer) are options for 3D IC integration schemes. Key enabling technologies, such as alignment, Cu-Cu bonding and TSV fabrication, are described as well. Improved performance, such as lower latency and higher bandwidth, lower power consumption, smaller form factor, lower cost and heterogeneous integration of disparate functionalities, are made possible in the next generation of electronics products with the realisation of 3D IC.

43 citations


Journal ArticleDOI
TL;DR: The results show that the use of LFSRs simplifies the design of the multipliers architecture reducing area resources and retaining high performance compared to related works.
Abstract: This work presents novel multipliers for Montgomery multiplication defined on binary fields GF(2m) Different to state of the art Montgomery multipliers, this work uses a linear feedback shift register (LFSR) as the main building block The authors studied different architectures for bit-serial and digit-serial Montgomery multipliers using the LFSR and the Montgomery factors xm and xm-1 The proposed multipliers are for different classes of irreducible polynomials: general, all one polynomials, pentanomials and trinomials The results show that the use of LFSRs simplifies the design of the multipliers architecture reducing area resources and retaining high performance compared to related works

42 citations


Journal ArticleDOI
TL;DR: Results presented in this work highlight the need for thermal and electrical co-design in multi-strata microelectronics, and for reconciling manufacturing and design considerations in order to develop practical design tools for 3D ICs.
Abstract: Although the stacking of multiple strata to produce three-dimensional (3D) integrated circuits (ICs) improves interconnect length and hence reduces power and latency, it also results in the exacerbation of the thermal management challenge owing to the increased power density. There is a need for design tools to understand and optimise the trade-off between electrical and thermal design at the device and block levels. This study presents results from thermal-electrical co-optimisation for block-level floorplanning in a multi-die 3D IC under various manufacturing and physical design constraints. A method for temperature computation based on linearity of the governing energy equation is presented. This method is combined with previously reported electrical delay models for 3D ICs to simultaneously optimise both the maximum temperature and the interconnect length. It is shown that co-optimisation of thermal and electrical objectives results in a floorplan that is attractive from both perspectives. Physical design constraints because of cost-effective 3D manufacturing such as using fully or partly identical dies using reciprocal design symmetry (RDS), differentiated technology in each die and thinned die/wafer are discussed and their impact on the thermal-electrical co-optimisation is investigated. In some cases, the cheapest manufacturing choice, such as using identical die, for each layer may not result in optimal thermal and electrical design. Results presented in this work highlight the need for thermal and electrical co-design in multi-strata microelectronics, and for reconciling manufacturing and design considerations in order to develop practical design tools for 3D ICs.

28 citations


Journal ArticleDOI
TL;DR: This study proposes a secure multiprocessor architecture to prevent side channel attacks, based on a dual-core algorithmic balancing technique, where two identical cores are used to foil a side channel attack.
Abstract: Side channel attackers observe external manifestations of internal computations in an embedded system to predict the encryption key employed. The ability to examine such external manifestations (power dissipation or electromagnetic emissions) is a major threat to secure embedded systems. This study proposes a secure multiprocessor architecture to prevent side channel attacks, based on a dual-core algorithmic balancing technique, where two identical cores are used. Both cores use a single clock and encrypt simultaneously, with one core executing the original encryption, whereas the second executes the complementary encryption. This effectively balances the crucial information from the power profile (note that it is the information and not the power profile itself), hiding the actual key from the adversary attempting an attack based on differential power analysis (DPA). The two cores normally execute different tasks, but will encrypt together to foil a side channel attack. The authors show that, when our technique is applied, DPA fails on the most common block ciphers, data encryption standard (DES) and advanced encryption standard (AES), leaving the attacker with little useful information with which to perpetrate an attack.

28 citations


Journal ArticleDOI
TL;DR: The experimental results show that saving in components and switching activity are achieved in most of the benchmarks tested compared with recent published research.
Abstract: In this study, a new approach using a multi-objective genetic algorithm (MOGA) is proposed to determine the optimal state assignment with less area and power dissipations for completely and incompletely specified sequential circuits. The goal is to find the best assignments which reduce the component count and switching activity. The MOGA employs a Pareto ranking scheme and produces a set of state assignments, which are optimal in both objectives. The ESPRESSO tool is used to optimise the combinational parts of the sequential circuits. Experimental results are given using a personal computer with an Intel CPU of 2.4 GHz and 2 GB RAM. The algorithm is implemented using C++ and fully tested with benchmark examples. The experimental results show that saving in components and switching activity are achieved in most of the benchmarks tested compared with recent published research.

23 citations


Journal ArticleDOI
TL;DR: An efficient test algorithm is proposed, called March-PCM, to test for special failure modes, known as disturbs, as well as other PCM specific faults, and its performance is compared to some previously developed test algorithms.
Abstract: Chalcogenide-based phase change memory (PCM) is a type of non-volatile memory that will most likely replace the currently widespread flash memory. Current research on PCM targets the integration feasibility, as well as the reliability of such memory technology into the currently used complementary metal oxide semiconductor (CMOS) process. Such studies identified special failure modes, known as disturbs, as well as other PCM specific faults. In this study, the authors identify these failures, analyse their behaviours and develop fault primitives/models that describe these faults accurately and effectively. In addition, the authors propose an efficient test algorithm, called March-PCM, to test for these faults and compare its performance to some previously developed test algorithms.

19 citations


Journal ArticleDOI
TL;DR: The paper shows that the CPOG model is a very convenient formalism for efficient representation of processor instruction sets and provides a ground for a concise formulation of several encoding problems, which are reducible to the Boolean satisfiability (SAT) problem and can be efficiently solved by modern SAT solvers.
Abstract: There is a critical need for design automation in microarchitectural modelling and synthesis. One of the areas which lacks the necessary automation support is synthesis of instruction codes targeting various design optimality criteria. This paper aims to fill this gap by providing a set of formal methods and a software tool for synthesis of instruction codes given the description of a processor as a set of instructions. The method is based on the conditional partial order graph (CPOG) model, which is a formalism for efficient specification and synthesis of microcontrollers. It describes a system as a functional composition of its behavioural scenarios, or instructions, each of them being a partial order of events. In order to distinguish instructions within a CPOG they are given different encodings represented with Boolean vectors. Size and latency of the final microcontroller significantly depends on the chosen encodings, thus efficient synthesis of instruction codes is essential. The paper shows that the CPOG model is a very convenient formalism for efficient representation of processor instruction sets. It provides a ground for a concise formulation of several encoding problems, which are reducible to the Boolean satisfiability (SAT) problem and can be efficiently solved by modern SAT solvers.

18 citations


Journal ArticleDOI
TL;DR: This work proposes a hazard-free majority voter design for the triple-modular redundancy fault-tolerance design paradigm, which enters an output-holding state to preserve the output value when transient errors may be sensitised to its inputs.
Abstract: N -modular redundancy (NMR) is the simplest and most effective fault-tolerant design method for integrated circuits, where N copies of a circuit are employed and a majority voter produces the voted output. Asynchronous circuits, however, exhibit various characteristics that limit the applicability of NMR. Specifically, the hazard-free property of the output in these circuits must be preserved when hardware providing fault tolerance, such as a majority voter, is added. In this work, we first demonstrate that a typical majority voter design would fail to preserve the hazard-free property of its response. We then propose a hazard-free majority voter design for the triple-modular redundancy fault-tolerance design paradigm, which enters an output-holding state to preserve the output value when transient errors may be sensitised to its inputs. By exploring various conditions to exit from the output-holding state, we describe several extensions of the voter into an NMR one, each yielding a distinct implementation with different tolerance characteristics and area cost. We generalise this extension based on the exit condition and analyse the associated tolerance capability of the extended NMR voter. Finally, the proposed hazard-free voter is simulated using HSPICE, and detailed area cost formulations are derived for the proposed voter designs.

18 citations


Journal ArticleDOI
TL;DR: This study presents an FPGA-based distributed architecture for solving the single-source shortest-path problem in a fast and efficient manner based on the Bellman-Ford algorithm adapted to facilitate early termination of computation.
Abstract: There exist several practical applications that require high-speed shortest-path computations. In many situations, especially in embedded applications, an field programmable gate array (FPGA)-based accelerator for computing the shortest paths can help to achieve high performance at low cost. This study presents an FPGA-based distributed architecture for solving the single-source shortest-path problem in a fast and efficient manner. The proposed architecture is based on the Bellman-Ford algorithm adapted to facilitate early termination of computation. One of the novelties of the architecture is that it does not involve any centralised control and the processing elements (PEs), which are identical in construction, operate in perfect synchronisation with each other. The functional correctness of the design has been verified through simulations and also in actual hardware. It has been shown that the implementation on a Xilinx Virtex-5 FPGA is more than twice as fast as a software implementation of the algorithm on a high-end general-purpose processor that runs at an order-of-magnitude faster clock. The speed-up offered by the design can be further improved by adopting an interconnection topology that maximises the data transfer rate among the PEs.

16 citations


Journal ArticleDOI
TL;DR: The authors offer a decimal division scheme that takes advantage of the best design options of D1 and D2 with due modifications that significantly enhance the division speed and removes the rounding cycle by cost-free auto-rounding.
Abstract: The authors study previous major contributions to digit recurrence decimal division hardware and focus on techniques for improving the performance of quotient digit selection (QDS) as the most complex part. In particular, Design D1 uses the digit set [-5, 5] for quotient digits. Another design (D2) uses mixed binary/decimal carry-save manipulation of the few most significant digits of partial remainders. Motivated by successful combined arithmetic algorithms such as hybrid adders, the authors offer a decimal division scheme that takes advantage of the best design options of D1 and D2 with due modifications that significantly enhance the division speed. In particular, they configure the architectures of QDS and partial remainder computation paths in favour of reduced balanced latencies of both. Furthermore, they remove the rounding cycle by cost-free auto-rounding, which is an exclusive advantage of the digit set [-5, 5]. The authors of D1 and D2 have used logical effort (LE) and circuit synthesis to evaluate their dividers, respectively. Therefore for a fair comparison, the authors evaluate the proposed design (D3) with both methods. The LE-based D3/D1 comparison shows 21- more speed at the cost of 6- more area, whereas the synthesis-based D3/D2 comparison results in 46- less latency and 23- less area.

Journal ArticleDOI
TL;DR: The authors show that HARD can be configured to achieve both performance and power improvements and compare HARD to an alternative dynamic scheduler and a static scheduler to provide better understanding.
Abstract: The authors introduce a history-aware, resource-based dynamic (or simply HARD) scheduler for heterogeneous chip multi-processors (CMPs). HARD relies on recording application resource utilisation and throughput to adaptively change cores for applications during runtime. The authors show that HARD can be configured to achieve both performance and power improvements and compare HARD to an alternative dynamic scheduler and a static scheduler to provide better understanding.

Journal ArticleDOI
TL;DR: A programmable and configurable motion estimation (ME) processor capable of performing ME across several state-of-the-art video codecs that include multiple tools to improve the accuracy of the calculated motion vectors.
Abstract: This study presents a programmable and configurable motion estimation (ME) processor capable of performing ME across several state-of-the-art video codecs that include multiple tools to improve the accuracy of the calculated motion vectors. The core can be programmed using a C-style syntax optimised to implement arbitrary block matching algorithms and configured with different execution units depending on the selected codec, the available inter-coding options and required performance. This flexibility means that the core can support the latest video codecs such as H.264, VC-1 and AVS at high-definition resolutions and frame rates. The configuration and programming phases are supported by an integrated development environment that includes a compiler and profiling tools enabling a designer without specific hardware knowledge to optimise the microarchitecture for the selected codec standard and motion search technique leading to a highly efficient implementation.

Journal ArticleDOI
TL;DR: This work addresses the state encoding problem with the objective of minimising peak current in synchronous finite state machine (FSM) circuits with an efficient SAT-based heuristic to solve the state re-encoding problem for minimising switching power without deteriorating the minimum peak current.
Abstract: As the silicon process technology advances, chip reliability becomes more and more important. One of the critical factors that affect the chip reliability is the peak current in the circuit. In particular, high current peaks at the time of state transition in synchronous finite state machine (FSM) circuits often make the circuits very unstable in execution. This work addresses the state encoding problem with the objective of minimising peak current in FSMs. Unlike the previous power-aware state encoding algorithms, where the primary objective is to reduce the amount of switching activities of state register and the problem of reducing peak current has not been addressed at all or considered as a secondary objective, which obviously severely limits the search space of state encoding for minimising peak current, our algorithm, called SAT-pc, places the importance on the reliability, that is, peak current. Specifically, the authors solve two important state encoding problems in two phases: (Phase 1) the authors present a solution to the problem of state encoding for directly minimising peak current, by formulating it into the SAT problem with pseudo-Boolean expressions, which leads to a full exploration of the search space; (Phase 2) the authors then propose an efficient SAT-based heuristic to solve the state re-encoding problem for minimising switching power without deteriorating the minimum peak current obtained in Phase 1. Through an experimentation using MCNC benchmarks, it is shown that SAT-pc is able to reduce the peak current by 51 and 35%, compared to POW3 [4] that minimises the switching power only and POW3[14] + [24] that minimises the switching power and then peak current, respectively.

Journal ArticleDOI
TL;DR: Through-silicon-via (TSV)-based 3D integration technology will allow integration of diverse functionality to realise energy-efficient and affordable compact systems that will continue to deliver higher performance.
Abstract: Promise of form-factor reduction and hybrid process integration by three-dimensional (3D)-stacked integrated circuits (3DICs) has spurred interest in both academia and industry. In this study, through-silicon-via (TSV)-based 3D integration is discussed from a microprocessor centric view. The authors present the challenges faced by technology scaling and provide 3D integration as a possible solution. The applications for 3DICs are discussed with details of a few prototypes. The issues and challenges associated with 3D integration technologies are also addressed. TSV-based 3D integration technology will allow integration of diverse functionality to realise energy-efficient and affordable compact systems that will continue to deliver higher performance.

Journal ArticleDOI
TL;DR: The authors present an optimal solution based on an integer linear programming model as well as two polynomial-time heuristic solutions for wrapper optimisation in 3D ICs based on through-silicon vias (TSVs) for vertical interconnects.
Abstract: System-on-chip (SOC) designs comprised of a number of embedded cores are widespread in today's integrated circuits. Embedded core-based design is likely to be equally popular for three-dimensional integrated circuits (3D ICs), the manufacture of which has become feasible in recent years. 3D integration offers a number of advantages over traditional 2D technologies, such as the reduction in the average interconnect length, higher performance, lower interconnect power consumption and smaller IC footprint. Despite recent advances in 3D fabrication and design methods, no attempt has been made thus far to design a 1500-style test wrapper for an embedded core that spans multiple layers in a 3D SOC. This study addresses wrapper optimisation in 3D ICs based on through-silicon vias (TSVs) for vertical interconnects. The authors objective is to minimise the scan-test time for a core under constraints on the total number of TSVs available for testing. The authors present an optimal solution based on an integer linear programming model as well as two polynomial-time heuristic solutions. Simulation results are presented for embedded cores from the ITC 2002 SOC test benchmarks.

Journal ArticleDOI
TL;DR: The history index of correct computation (HICC) is examined in a recursive and non-recursive fault-tolerant approach at the bit and module levels to identify reliable blocks on-the-fly and forward their computation results, while ignoring results from unreliable blocks.
Abstract: Future nano-scale devices are expected to shrink to ever smaller dimensions, to operate at low voltages and high frequencies, to be more sensitive to environmental influences and to be characterised by high dynamic fault rates and defect densities. Fundamentally new fault-tolerant architectures are required in order to produce reliable systems that will operate correctly. Simple replication of micro-architecture blocks will no longer suffice, as all replicated blocks will have faults. The history index of correct computation (HICC) is examined in a recursive and non-recursive fault-tolerant approach at the bit and module levels to identify reliable blocks on-the-fly and forward their computation results, while ignoring results from unreliable blocks. Simulation results show that recursive and non-recursive HICC offers the best resilience to faults when faults are non-uniformly distributed among redundant blocks. A correct computation rate of 99% is achieved using the recursive HICC when decision units at the bit and module levels are fault free, despite an average fault injection rate of 20% compared to a 68% correct computation rate for the recursive triple modular redundancy voter. When faults are injected everywhere in the design, the non-recursive HICC supports the best correct computation percentage. The effect of circuit size and history indices are also examined and discussed.

Journal ArticleDOI
TL;DR: The simulation result shows that MRAM stacking can provide competitive instruction-per-cycle (IPC) performance with a large reduction in power consumption and compared MRAM against its static random access memory (SRAM) and dynamic randomAccess memory (DRAM) counterparts.
Abstract: Magnetic random access memory (MRAM) has been considered as a promising memory technology because of its attractive properties such as non-volatility, fast access, zero standby leakage and high density. Although integrating MRAM with complementary metal-oxide-semiconductor (CMOS) logic may incur extra manufacturing cost because of the hybrid magnetic-CMOS fabrication process, it is feasible and cost-effective to fabricate MRAM and CMOS logic separately and then integrate them using 3D stacking. In this work, we first studied the MRAM properties and built an MRAM cache model in terms of performance, energy and area. Using this model, we evaluated the impact of stacking MRAM caches atop microprocessor cores and compared MRAM against its static random access memory (SRAM) and dynamic random access memory (DRAM) counterparts. Our simulation result shows that MRAM stacking can provide competitive instruction-per-cycle (IPC) performance with a large reduction in power consumption.

Journal ArticleDOI
TL;DR: This study outlines an approach for reducing the dynamic power consumption of a class of fast algorithms by minimising the index space separation, which allows the generation of field programmable gate array (FPGA) implementations with reduced power consumption.
Abstract: Dynamic power consumption is very dependent on interconnect, so clever mapping of digital signal processing algorithms to parallelised realisations with data locality is vital. This is a particular problem for fast algorithm implementations where typically, designers will have sacrificed circuit structure for efficiency in software implementation. This study outlines an approach for reducing the dynamic power consumption of a class of fast algorithms by minimising the index space separation; this allows the generation of field programmable gate array (FPGA) implementations with reduced power consumption. It is shown how a 50% reduction in relative index space separation results in a measured power gain of 36 and 37% over a Cooley–Tukey Fast Fourier Transform (FFT)-based solution for both actual power measurements for a Xilinx Virtex-II FPGA implementation and circuit measurements for a Xilinx Virtex-5 implementation. The authors show the generality of the approach by applying it to a number of other fast algorithms namely the discrete cosine, the discrete Hartley and the Walsh–Hadamard transforms.

Journal ArticleDOI
TL;DR: A floating-point synthetic aperture radar processor that achieves a power efficiency of 18.0 mW/GFlop in simulation through the use of three-dimensional (3D) integration and reconfiguration of the data path is presented.
Abstract: In this study, the authors present a floating-point synthetic aperture radar processor that achieves a power efficiency of 18.0 mW/GFlop in simulation through the use of three-dimensional (3D) integration and reconfiguration of the data path. The reconfiguration reduces the number of arithmetic units required in every processing element (PE) from 24 down to 10. The processor uses a 3D integrated memory that reduces the memory power consumption by 70% when compared to a 2D memory. The system processes a SAR image using a two-tier 3D integrated PE, which when compared to an equivalent 2D PE decreases the power consumed in the interconnect of each PE by 15.5% and the footprint by 49.2%, and allows the PE to operate 7.1% faster in simulation. Additionally, by using 3D integration in the memory one can reduce the power consumption of the memory by 70%. Furthermore, the authors show how the 3D aspects of the processor can be realised by using 2D tools, when used in conjunction with the proposed through-silicon via assignment algorithm.

Journal ArticleDOI
TL;DR: This study presents a novel method for implementing any m-of-n-encoded function block using ‘bounded gates’, where any gate may be decomposed without violating indication by successively decomposing the input encoding into smaller unordered codes.
Abstract: Self-timed circuits present an attractive solution to the problem of process variation. However, implementing self-timed combinational logic is complex and expensive. As there are no external timing references, data must be encoded within an unordered (DI) encoding and the outputs of functions must indicate to the environment that transitions on inputs and internal signals have taken place. Mapping large function blocks into cell-libraries is extremely difficult as decomposing gates introduces new signals which may violate indication. This study presents a novel method for implementing any m-of-n-encoded function block using ‘bounded gates’, where any gate may be decomposed without violating indication. This is achieved by successively decomposing the input encoding into smaller unordered codes. The study presents algorithms to determine and quantify potential re-encodings. An exact branch and bound approach to the solution is shown, but the complexity of determining unordered encodings restricts the size of function blocks that may be decomposed. To overcome this problem, an approach has been proposed that uses algebraic extraction techniques to efficiently determine and quantify potential encodings. The results of the synthesis procedures are demonstrated on a range of combinational function blocks.

Journal ArticleDOI
TL;DR: The authors address the problem of reducing power dissipation of the instruction bus by reordering the instructions in basic blocks without increasing the executing time and the code size, and while maintaining the original functionality of the programme.
Abstract: Execution time is no longer the only target to achieve when designing programmes for today and next-generation CMOS-based digital systems. One needs to also consider reducing power dissipation. Buses contribute to the power dissipation during the execution of a given programme since instructions and/or operands have to be fetched from the memory. Reducing power dissipation in buses has been addressed in the literature. In this study, the authors address the problem of reducing power dissipation of the instruction bus by reordering the instructions in basic blocks without increasing the executing time and the code size, and while maintaining the original functionality of the programme. The authors target embedded processors having Harvard architecture. They focus on solving this problem for programmes developed at the assembly level, since at that level the machine code can be obtained by simply running an assembler, which allows an accurate computation of switching activities on the instruction bus by considering each pair of instructions. The authors formulate this problem as an integer linear programme (ILP), and they provide two heuristics. Experimental results have shown that the proposed approach can reduce switching activities. The ILP has reduced switching activities by as high as 38%. One of the two proposed heuristics has always resulted in reducing switching activities, and its relative savings are within an average of 5% from the optimum produced using the ILP.

Journal ArticleDOI
TL;DR: The authors improve their former work by presenting algorithms for identifying delay transitions and inserting gyroscopes for specifications having a much more general structure and are now able to synthesise controllers from real-life specifications.
Abstract: Logic synthesis of speed independent circuits based on signal transition graph (STG) decomposition is a promising approach to tackle complexity problems like state-space explosion. Unfortunately, decomposition can result in components that in isolation have irreducible complete state coding conflicts. In earlier work, the authors showed how to resolve such conflicts by introducing internal communication between components, but only for very restricted specification structures. Here, they improve their former work by presenting algorithms for identifying delay transitions and inserting gyroscopes for specifications having a much more general structure. Thus, the authors are now able to synthesise controllers from real-life specifications. For all algorithms, they present correctness proofs and show their successful application to benchmarks, including very complex STGs arising in the context of control resynthesis.

Journal ArticleDOI
TL;DR: This work presents an efficient approach to model check safety properties expressed in PSL (IEEE Std 1850 Property Specification Language), an industrial property specification language, and handles a larger syntactic subset of PSL safety properties than earlier translations.
Abstract: Safety properties are an important class of properties, as in the industrial use of model checking, a large majority of the properties to be checked are safety properties. This work presents an efficient approach to model check safety properties expressed in PSL (IEEE Std 1850 Property Specification Language), an industrial property specification language. The approach can also be used as a sound but incomplete bug-hunting tool for general (non-safety) PSL properties, and it will detect exactly the finite counterexamples that are the informative bad prefixes for the PSL formulae in question. The presented technique is inspired by the temporal testers approach of Pnueli and co-authors, but unlike theirs, the proposed approach is aimed at finding finite counterexamples to properties. The new approach presented here handles a larger syntactic subset of PSL safety properties than earlier translations for PSL safety subsets and has been implemented on top of the open source NuSMV 2 model checker. The experimental results show the approach to be a quite competitive model checking approach when compared to a state-of-the-art implementation of PSL model checking.

Journal ArticleDOI
TL;DR: Advanced and dynamic calibration techniques for maximising the link performance of parallel source-synchronous interfaces are introduced and demonstrated in this study, using as a case study a 533-MHz DDR2 SDRAM memory interface implemented in 90-nm standard complementary CMOS.
Abstract: Advanced and dynamic calibration techniques for maximising the link performance of parallel source-synchronous interfaces are introduced and demonstrated in this study, using as a case study a 533-MHz DDR2 SDRAM memory interface implemented in 90-nm standard complementary metal-oxide-semiconductor (CMOS), whereas most of them have been validated at 800-MHz too. A novel dynamic strobe masking system (DSMS) has also been employed which, in contrast to traditional techniques, adjusts dynamically the length of the masking signal in real time, based on the incoming strobe. Furthermore, optimal data capture is achieved by employing a fast bit-deskew calibration engine, while also a novel I/O calibration scheme is included. Post-layout simulation results demonstrate that the dynamic calibration and skew compensation techniques employed improve the timing margin while providing advanced robustness over process, voltage and temperature variations.

Journal ArticleDOI
TL;DR: A straightforward filtering mechanism is introduced, which results in a more energy-efficient design than past techniques, using less and simpler hardware, and provides new opportunities for extra types of filtering, which lead to higher energy savings.
Abstract: In the last few years, many researchers have focused their efforts on the field of low-power processor design. Several jobs in this area have dealt with the logic that enforces correct memory-based dependences – the load-store queue – (LSQ) a pretty energy-consuming structure since many accesses are performed in an associative fashion. Among these proposals, some of them manage to reduce this resource's energy consumption by avoiding unnecessary lookups. In this context, the authors introduce a straightforward filtering mechanism, which results in a more energy-efficient design than past techniques, using less and simpler hardware. Besides, both the new scheme and some previous approaches are tested in the widespread x86 architecture. This microarchitectural model provides new opportunities for extra types of filtering, which lead to higher energy savings. On average, the authors proposal filters up to 75% of the associative accesses to the load queue, 56% to the store queue and 42% to the dependence predictor with a reduced amount of hardware – less than 100 bytes. According to their energy model, this means a dynamic energy saving of more than 39% over a conventional LSQ.

Journal ArticleDOI
TL;DR: A novel metric called crosStalk error rate is developed which can be a valuable tool for designers to predict the crosstalk effects and explore interconnect design techniques in order to achieve the target performance with minimum overheads.
Abstract: The impact of crosstalk noise on the resilience of on-chip communication links in the presence of parametric variations is investigated. A novel metric called crosstalk error rate is developed which can be a valuable tool for designers to predict the crosstalk effects and explore interconnect design techniques in order to achieve the target performance with minimum overheads. Closed-form expressions of crosstalk error rate are presented. This metric is used to compare different crosstalk avoidance methods in the 90 nm technology.

Journal ArticleDOI
TL;DR: Token cages improve the performance of join controllers that use the early-evaluation firing rule and half-buffer retiming allows the creation of input queues by relocating one of the latches of the elastic buffer which follows the join controller.
Abstract: Synchronous elastic circuits borrow the tolerance of computation and communication latencies from the asynchronous design style The datapath is made elastic by turning registers into elastic buffers and adding a control layer that uses synchronous handshake signals and join/fork controllers Join elements are the objective of two improvements discussed in this study Half-buffer retiming allows the creation of input queues by relocating one of the latches of the elastic buffer which follows the join controller Token cages improve the performance of join controllers that use the early-evaluation firing rule Their effect on throughput is discussed by means of examples representative of typical topologies, simulations with synthetic benchmarks and a realistic microarchitecture Area and power costs of the control logic and the possible impact on the datapath are evaluated, based on the results of logic synthesis experiments on a 45 nm CMOS technology

Journal ArticleDOI
TL;DR: This study proposes a novel asynchronous dispatching (AD) algorithm for general three-stage Clos networks that avoids the contention in central modules using a state feedback scheme and outperforms the throughput of CRRD in behavioural simulations.
Abstract: Clos networks provide theoretically optimal solution to build high-radix switches. Dynamically reconfiguring a three-stage Clos network is more difficult in asynchronous circuits than in synchronous circuits. This study proposes a novel asynchronous dispatching (AD) algorithm for general three-stage Clos networks. It is compared with the classic synchronous concurrent round-robin dispatching (CRRD) algorithm in unbuffered Clos networks. The AD algorithm avoids the contention in central modules using a state feedback scheme and outperforms the throughput of CRRD in behavioural simulations. Two asynchronous Clos networks using the AD algorithm are implemented and compared with a synchronous Clos network using the CRRD algorithm. The asynchronous Clos scheduler is smaller than its synchronous counterpart. Synchronous Clos networks achieve higher throughput than asynchronous Clos networks because asynchronous Clos networks cannot hide the arbitration latency and their data paths are slow. The asynchronous Clos scheduler consumes significantly lower power than the synchronous scheduler and the asynchronous Clos network using bundled-data data switches shows the best power efficiency in all implementations.

Journal ArticleDOI
TL;DR: The proposed reconfigurable architecture achieves a large computational capability while still providing a high degree of flexibility, and can be specified in a high-level language and it also provides increased hardware resource usage.
Abstract: The development of multiple communication standards and services has created the need for a flexible and efficient computational platform for baseband signal processing. Using a set of heterogeneous reconfigurable execution units (RCEUS) and a homogeneous control mechanism, the proposed reconfigurable architecture achieves a large computational capability while still providing a high degree of flexibility. Software tools and a library of commonly used algorithms are also proposed in this paper to provide a convenient framework for hardware generation and algorithm mapping. In this way, the architecture can be specified in a high-level language and it also provides increased hardware resource usage. Finally, we evaluate the system's performance on representative algorithms, specifically a 32-tap finite impulse response (FIR) filter and a 256-point fast Fourier transform (FFT), and compare them with commercial digital signal processor (DSP) chips as well as with other reconfigurable and multi-core architectures.