Showing papers in &quot;Iet Computers and Digital Techniques in 2011&quot;

Integration schemes and enabling technologies for three-dimensional integrated circuits

TL;DR: This study presents a flow linking a design-time design space explorer coupled with platform simulators at two abstraction levels, with a fast and lightweight priority-based heuristic integrated in the run-time manager to select near-optimal application configurations.

...read moreread less

Abstract: Nowadays, owing to unpredictable changes of the environment and workload variation, optimally running multiple applications in terms of quality, performance and power consumption on embedded multi-core platforms is a huge challenge. A lightweight run-time manager, linked with an automated design-time exploration and incorporated in the host processor of the platform, is required to dynamically and efficiently configure the applications according to the available platform resources (e.g. processing elements, memories, communication bandwidth), for minimising the cost (e.g. power consumption), while satisfying the constraints (e.g. deadlines). This study presents a flow linking a design-time design space explorer, coupled with platform simulators at two abstraction levels, with a fast and lightweight priority-based heuristic integrated in the run-time manager to select near-optimal application configurations. To illustrate its feasibility and the very low complexity of the run-time selection, the proposed flow is used to manage the processors and clock frequencies of a multiple-stream MPEG4 encoder chip dedicated to automotive cognitive safety applications.

...read moreread less

53 citations

Journal Article•DOI•

[...]

Kuan-Neng Chen¹, Chuan Seng Tan²•Institutions (2)

National Chiao Tung University¹, Nanyang Technological University²

Bit-serial and digit-serial GF(2 m )Montgomery multipliers using linear feedback shift registers

TL;DR: Improved performance, such as lower latency and higher bandwidth, lower power consumption, smaller form factor, lower cost and heterogeneous integration of disparate functionalities, are made possible in the next generation of electronics products with the realisation of 3D IC.

...read moreread less

Abstract: Various integration schemes and key enabling technologies for wafer-level three-dimensional integrated circuits (3D IC) are reviewed and discussed. Stacking orientations (face up or face down), methods of wafer bonding (metallic, dielectric or hybrid), formation of through-silicon via (TSV) (via first, via middle or via last) and singulation level (wafer-to-wafer or chip-to-wafer) are options for 3D IC integration schemes. Key enabling technologies, such as alignment, Cu-Cu bonding and TSV fabrication, are described as well. Improved performance, such as lower latency and higher bandwidth, lower power consumption, smaller form factor, lower cost and heterogeneous integration of disparate functionalities, are made possible in the next generation of electronics products with the realisation of 3D IC.

...read moreread less

43 citations

Journal Article•DOI•

[...]

Miguel Morales-Sandoval¹, Claudia Feregrino-Uribe, P Kitsos²•Institutions (2)

University of Victoria¹, Hellenic Open University²

Thermal-electrical co-optimisation of floorplanning of three-dimensional integrated circuits under manufacturing and physical design constraints

TL;DR: The results show that the use of LFSRs simplifies the design of the multipliers architecture reducing area resources and retaining high performance compared to related works.

...read moreread less

Abstract: This work presents novel multipliers for Montgomery multiplication defined on binary fields GF(2m) Different to state of the art Montgomery multipliers, this work uses a linear feedback shift register (LFSR) as the main building block The authors studied different architectures for bit-serial and digit-serial Montgomery multipliers using the LFSR and the Montgomery factors xm and xm-1 The proposed multipliers are for different classes of irreducible polynomials: general, all one polynomials, pentanomials and trinomials The results show that the use of LFSRs simplifies the design of the multipliers architecture reducing area resources and retaining high performance compared to related works

...read moreread less

42 citations

Journal Article•DOI•

[...]

Ankur Jain¹, Syed M. Alam, Scott K. Pozder², Robert E. Jones²•Institutions (2)

Advanced Micro Devices¹, Freescale Semiconductor²

Multiprocessor information concealment architecture to prevent power analysis-based side channel attacks

TL;DR: Results presented in this work highlight the need for thermal and electrical co-design in multi-strata microelectronics, and for reconciling manufacturing and design considerations in order to develop practical design tools for 3D ICs.

...read moreread less

Abstract: Although the stacking of multiple strata to produce three-dimensional (3D) integrated circuits (ICs) improves interconnect length and hence reduces power and latency, it also results in the exacerbation of the thermal management challenge owing to the increased power density. There is a need for design tools to understand and optimise the trade-off between electrical and thermal design at the device and block levels. This study presents results from thermal-electrical co-optimisation for block-level floorplanning in a multi-die 3D IC under various manufacturing and physical design constraints. A method for temperature computation based on linearity of the governing energy equation is presented. This method is combined with previously reported electrical delay models for 3D ICs to simultaneously optimise both the maximum temperature and the interconnect length. It is shown that co-optimisation of thermal and electrical objectives results in a floorplan that is attractive from both perspectives. Physical design constraints because of cost-effective 3D manufacturing such as using fully or partly identical dies using reciprocal design symmetry (RDS), differentiated technology in each die and thinned die/wafer are discussed and their impact on the thermal-electrical co-optimisation is investigated. In some cases, the cheapest manufacturing choice, such as using identical die, for each layer may not result in optimal thermal and electrical design. Results presented in this work highlight the need for thermal and electrical co-design in multi-strata microelectronics, and for reconciling manufacturing and design considerations in order to develop practical design tools for 3D ICs.

...read moreread less

28 citations

Journal Article•DOI•

[...]

Jude Angelo Ambrose¹, Roshan Ragel², Sri Parameswaran¹, Aleksandar Ignjatovic¹•Institutions (2)

University of New South Wales¹, University of Peradeniya²

01 Jan 2011-Iet Computers and Digital Techniques

TL;DR: This study proposes a secure multiprocessor architecture to prevent side channel attacks, based on a dual-core algorithmic balancing technique, where two identical cores are used to foil a side channel attack.

...read moreread less

Abstract: Side channel attackers observe external manifestations of internal computations in an embedded system to predict the encryption key employed. The ability to examine such external manifestations (power dissipation or electromagnetic emissions) is a major threat to secure embedded systems. This study proposes a secure multiprocessor architecture to prevent side channel attacks, based on a dual-core algorithmic balancing technique, where two identical cores are used. Both cores use a single clock and encrypt simultaneously, with one core executing the original encryption, whereas the second executes the complementary encryption. This effectively balances the crucial information from the power profile (note that it is the information and not the power profile itself), hiding the actual key from the adversary attempting an attack based on differential power analysis (DPA). The two cores normally execute different tasks, but will encrypt together to foil a side channel attack. The authors show that, when our technique is applied, DPA fails on the most common block ciphers, data encryption standard (DES) and advanced encryption standard (AES), leaving the attacker with little useful information with which to perpetrate an attack.

...read moreread less

28 citations

Journal Article•DOI•

State assignment for sequential circuits using multi-objective genetic algorithm

[...]

B. A. Al Jassani¹, Neil Urquhart¹, A. E. A. Almaini¹•Institutions (1)

Edinburgh Napier University¹

Fault model and test procedure for phase change memory

TL;DR: The experimental results show that saving in components and switching activity are achieved in most of the benchmarks tested compared with recent published research.

...read moreread less

Abstract: In this study, a new approach using a multi-objective genetic algorithm (MOGA) is proposed to determine the optimal state assignment with less area and power dissipations for completely and incompletely specified sequential circuits. The goal is to find the best assignments which reduce the component count and switching activity. The MOGA employs a Pareto ranking scheme and produces a set of state assignments, which are optimal in both objectives. The ESPRESSO tool is used to optimise the combinational parts of the sequential circuits. Experimental results are given using a personal computer with an Intel CPU of 2.4 GHz and 2 GB RAM. The algorithm is implemented using C++ and fully tested with benchmark examples. The experimental results show that saving in components and switching activity are achieved in most of the benchmarks tested compared with recent published research.

...read moreread less

23 citations

Journal Article•DOI•

[...]

Mohammad Gh. Mohammad¹•Institutions (1)

Kuwait University¹

Encoding of processor instruction sets with explicit concurrency control

TL;DR: An efficient test algorithm is proposed, called March-PCM, to test for special failure modes, known as disturbs, as well as other PCM specific faults, and its performance is compared to some previously developed test algorithms.

...read moreread less

Abstract: Chalcogenide-based phase change memory (PCM) is a type of non-volatile memory that will most likely replace the currently widespread flash memory. Current research on PCM targets the integration feasibility, as well as the reliability of such memory technology into the currently used complementary metal oxide semiconductor (CMOS) process. Such studies identified special failure modes, known as disturbs, as well as other PCM specific faults. In this study, the authors identify these failures, analyse their behaviours and develop fault primitives/models that describe these faults accurately and effectively. In addition, the authors propose an efficient test algorithm, called March-PCM, to test for these faults and compare its performance to some previously developed test algorithms.

...read moreread less

19 citations

Journal Article•DOI•

[...]

Andrey Mokhov¹, Arseniy Alekseyev¹, Alex Yakovlev¹•Institutions (1)

Newcastle University¹

Novel hazard-free majority voter for n-modular redundancy-based fault tolerance in asynchronous circuits

TL;DR: The paper shows that the CPOG model is a very convenient formalism for efficient representation of processor instruction sets and provides a ground for a concise formulation of several encoding problems, which are reducible to the Boolean satisfiability (SAT) problem and can be efficiently solved by modern SAT solvers.

...read moreread less

Abstract: There is a critical need for design automation in microarchitectural modelling and synthesis. One of the areas which lacks the necessary automation support is synthesis of instruction codes targeting various design optimality criteria. This paper aims to fill this gap by providing a set of formal methods and a software tool for synthesis of instruction codes given the description of a processor as a set of instructions. The method is based on the conditional partial order graph (CPOG) model, which is a formalism for efficient specification and synthesis of microcontrollers. It describes a system as a functional composition of its behavioural scenarios, or instructions, each of them being a partial order of events. In order to distinguish instructions within a CPOG they are given different encodings represented with Boolean vectors. Size and latency of the final microcontroller significantly depends on the chosen encodings, thus efficient synthesis of instruction codes is essential. The paper shows that the CPOG model is a very convenient formalism for efficient representation of processor instruction sets. It provides a ground for a concise formulation of several encoding problems, which are reducible to the Boolean satisfiability (SAT) problem and can be efficiently solved by modern SAT solvers.

...read moreread less

18 citations

Journal Article•DOI•

[...]

Sobeeh Almukhaizim¹, Ozgur Sinanoglu²•Institutions (2)

Kuwait University¹, New York University Abu Dhabi²

Field programmable gate array-based acceleration of shortest-path computation

TL;DR: This work proposes a hazard-free majority voter design for the triple-modular redundancy fault-tolerance design paradigm, which enters an output-holding state to preserve the output value when transient errors may be sensitised to its inputs.

...read moreread less

Abstract: N -modular redundancy (NMR) is the simplest and most effective fault-tolerant design method for integrated circuits, where N copies of a circuit are employed and a majority voter produces the voted output. Asynchronous circuits, however, exhibit various characteristics that limit the applicability of NMR. Specifically, the hazard-free property of the output in these circuits must be preserved when hardware providing fault tolerance, such as a majority voter, is added. In this work, we first demonstrate that a typical majority voter design would fail to preserve the hazard-free property of its response. We then propose a hazard-free majority voter design for the triple-modular redundancy fault-tolerance design paradigm, which enters an output-holding state to preserve the output value when transient errors may be sensitised to its inputs. By exploring various conditions to exit from the output-holding state, we describe several extensions of the voter into an NMR one, each yielding a distinct implementation with different tolerance characteristics and area cost. We generalise this extension based on the exit condition and analyse the associated tolerance capability of the extended NMR voter. Finally, the proposed hazard-free voter is simulated using HSPICE, and detailed area cost formulations are derived for the proposed voter designs.

...read moreread less

18 citations

Journal Article•DOI•

[...]

George Rosario Jagadeesh¹, Thambipillai Srikanthan¹, C. M. Lim¹•Institutions (1)

Nanyang Technological University¹

Improving the speed of decimal division

TL;DR: This study presents an FPGA-based distributed architecture for solving the single-source shortest-path problem in a fast and efficient manner based on the Bellman-Ford algorithm adapted to facilitate early termination of computation.

...read moreread less

Abstract: There exist several practical applications that require high-speed shortest-path computations. In many situations, especially in embedded applications, an field programmable gate array (FPGA)-based accelerator for computing the shortest paths can help to achieve high performance at low cost. This study presents an FPGA-based distributed architecture for solving the single-source shortest-path problem in a fast and efficient manner. The proposed architecture is based on the Bellman-Ford algorithm adapted to facilitate early termination of computation. One of the novelties of the architecture is that it does not involve any centralised control and the processing elements (PEs), which are identical in construction, operate in perfect synchronisation with each other. The functional correctness of the design has been verified through simulations and also in actual hardware. It has been shown that the implementation on a Xilinx Virtex-5 FPGA is more than twice as fast as a software implementation of the algorithm on a high-end general-purpose processor that runs at an order-of-magnitude faster clock. The speed-up offered by the design can be further improved by adopting an interconnection topology that maximises the data transfer rate among the PEs.

...read moreread less

16 citations

Journal Article•DOI•

[...]

Amir Kaivani¹, Adel Hosseiny¹, Ghassem Jaberipur¹•Institutions (1)

Shahid Beheshti University¹

19 Sep 2011-Iet Computers and Digital Techniques

TL;DR: The authors offer a decimal division scheme that takes advantage of the best design options of D1 and D2 with due modifications that significantly enhance the division speed and removes the rounding cycle by cost-free auto-rounding.

...read moreread less

Abstract: The authors study previous major contributions to digit recurrence decimal division hardware and focus on techniques for improving the performance of quotient digit selection (QDS) as the most complex part. In particular, Design D1 uses the digit set [-5, 5] for quotient digits. Another design (D2) uses mixed binary/decimal carry-save manipulation of the few most significant digits of partial remainders. Motivated by successful combined arithmetic algorithms such as hybrid adders, the authors offer a decimal division scheme that takes advantage of the best design options of D1 and D2 with due modifications that significantly enhance the division speed. In particular, they configure the architectures of QDS and partial remainder computation paths in favour of reduced balanced latencies of both. Furthermore, they remove the rounding cycle by cost-free auto-rounding, which is an exclusive advantage of the digit set [-5, 5]. The authors of D1 and D2 have used logical effort (LE) and circuit synthesis to evaluate their dividers, respectively. Therefore for a fair comparison, the authors evaluate the proposed design (D3) with both methods. The LE-based D3/D1 comparison shows 21- more speed at the cost of 6- more area, whereas the synthesis-based D3/D2 comparison results in 46- less latency and 23- less area.

...read moreread less

Journal Article•DOI•

History-aware, resource-based dynamic scheduling for heterogeneous multi-core processors

[...]

Ali Jooya¹, Amirali Baniasadi², Morteza Analoui¹•Institutions (2)

Iran University of Science and Technology¹, University of Victoria²

Multi-standard reconfigurable motion estimation processor for hybrid video codecs

TL;DR: The authors show that HARD can be configured to achieve both performance and power improvements and compare HARD to an alternative dynamic scheduler and a static scheduler to provide better understanding.

...read moreread less

Abstract: The authors introduce a history-aware, resource-based dynamic (or simply HARD) scheduler for heterogeneous chip multi-processors (CMPs). HARD relies on recording application resource utilisation and throughput to adaptively change cores for applications during runtime. The authors show that HARD can be configured to achieve both performance and power improvements and compare HARD to an alternative dynamic scheduler and a static scheduler to provide better understanding.

...read moreread less

Journal Article•DOI•

[...]

Jose L Nunez-Yanez¹, Trevor Spiteri¹, George Vafiadis¹•Institutions (1)

University of Bristol¹

State encoding algorithm for peak current minimisation

TL;DR: A programmable and configurable motion estimation (ME) processor capable of performing ME across several state-of-the-art video codecs that include multiple tools to improve the accuracy of the calculated motion vectors.

...read moreread less

Abstract: This study presents a programmable and configurable motion estimation (ME) processor capable of performing ME across several state-of-the-art video codecs that include multiple tools to improve the accuracy of the calculated motion vectors. The core can be programmed using a C-style syntax optimised to implement arbitrary block matching algorithms and configured with different execution units depending on the selected codec, the available inter-coding options and required performance. This flexibility means that the core can support the latest video codecs such as H.264, VC-1 and AVS at high-definition resolutions and frame rates. The configuration and programming phases are supported by an integrated development environment that includes a compiler and profiling tools enabling a designer without specific hardware knowledge to optimise the microarchitecture for the selected codec standard and motion search technique leading to a highly efficient implementation.

...read moreread less

Journal Article•DOI•

[...]

Y. Lee¹, Tae Whan Kim¹•Institutions (1)

Seoul National University¹

01 Mar 2011-Iet Computers and Digital Techniques

TL;DR: This work addresses the state encoding problem with the objective of minimising peak current in synchronous finite state machine (FSM) circuits with an efficient SAT-based heuristic to solve the state re-encoding problem for minimising switching power without deteriorating the minimum peak current.

...read moreread less

Abstract: As the silicon process technology advances, chip reliability becomes more and more important. One of the critical factors that affect the chip reliability is the peak current in the circuit. In particular, high current peaks at the time of state transition in synchronous finite state machine (FSM) circuits often make the circuits very unstable in execution. This work addresses the state encoding problem with the objective of minimising peak current in FSMs. Unlike the previous power-aware state encoding algorithms, where the primary objective is to reduce the amount of switching activities of state register and the problem of reducing peak current has not been addressed at all or considered as a secondary objective, which obviously severely limits the search space of state encoding for minimising peak current, our algorithm, called SAT-pc, places the importance on the reliability, that is, peak current. Specifically, the authors solve two important state encoding problems in two phases: (Phase 1) the authors present a solution to the problem of state encoding for directly minimising peak current, by formulating it into the SAT problem with pseudo-Boolean expressions, which leads to a full exploration of the search space; (Phase 2) the authors then propose an efficient SAT-based heuristic to solve the state re-encoding problem for minimising switching power without deteriorating the minimum peak current obtained in Phase 1. Through an experimentation using MCNC benchmarks, it is shown that SAT-pc is able to reduce the peak current by 51 and 35%, compared to POW3 [4] that minimises the switching power only and POW3[14] + [24] that minimises the switching power and then peak current, respectively.

...read moreread less

Journal Article•DOI•

Microprocessor system applications and challenges for through-silicon-via-based three-dimensional integration

[...]

Tanay Karnik¹, Dinesh Somasekhar¹, Shekhar Borkar¹•Institutions (1)

Intel¹

Test-wrapper optimisation for embedded cores in through-silicon via-based three-dimensional system on chips

TL;DR: Through-silicon-via (TSV)-based 3D integration technology will allow integration of diverse functionality to realise energy-efficient and affordable compact systems that will continue to deliver higher performance.

...read moreread less

Abstract: Promise of form-factor reduction and hybrid process integration by three-dimensional (3D)-stacked integrated circuits (3DICs) has spurred interest in both academia and industry. In this study, through-silicon-via (TSV)-based 3D integration is discussed from a microprocessor centric view. The authors present the challenges faced by technology scaling and provide 3D integration as a possible solution. The applications for 3DICs are discussed with details of a few prototypes. The issues and challenges associated with 3D integration technologies are also addressed. TSV-based 3D integration technology will allow integration of diverse functionality to realise energy-efficient and affordable compact systems that will continue to deliver higher performance.

...read moreread less

Journal Article•DOI•

[...]

Brandon Noia¹, Krishnendu Chakrabarty¹•Institutions (1)

Duke University¹

Fault tolerance for nanotechnology devices at the bit and module levels with history index of correct computation

TL;DR: The authors present an optimal solution based on an integer linear programming model as well as two polynomial-time heuristic solutions for wrapper optimisation in 3D ICs based on through-silicon vias (TSVs) for vertical interconnects.

...read moreread less

Abstract: System-on-chip (SOC) designs comprised of a number of embedded cores are widespread in today's integrated circuits. Embedded core-based design is likely to be equally popular for three-dimensional integrated circuits (3D ICs), the manufacture of which has become feasible in recent years. 3D integration offers a number of advantages over traditional 2D technologies, such as the reduction in the average interconnect length, higher performance, lower interconnect power consumption and smaller IC footprint. Despite recent advances in 3D fabrication and design methods, no attempt has been made thus far to design a 1500-style test wrapper for an embedded core that spans multiple layers in a 3D SOC. This study addresses wrapper optimisation in 3D ICs based on through-silicon vias (TSVs) for vertical interconnects. The authors objective is to minimise the scan-test time for a core under constraints on the total number of TSVs available for testing. The authors present an optimal solution based on an integer linear programming model as well as two polynomial-time heuristic solutions. Simulation results are presented for embedded cores from the ITC 2002 SOC test benchmarks.

...read moreread less

Journal Article•DOI•

[...]

Yocheved Dotan¹, Nadav Levison², David J. Lilja³•Institutions (3)

Ruppin Academic Center¹, Tel Aviv University², University of Minnesota³

Stacking magnetic random access memory atop microprocessors: an architecture-level evaluation

TL;DR: The history index of correct computation (HICC) is examined in a recursive and non-recursive fault-tolerant approach at the bit and module levels to identify reliable blocks on-the-fly and forward their computation results, while ignoring results from unreliable blocks.

...read moreread less

Abstract: Future nano-scale devices are expected to shrink to ever smaller dimensions, to operate at low voltages and high frequencies, to be more sensitive to environmental influences and to be characterised by high dynamic fault rates and defect densities. Fundamentally new fault-tolerant architectures are required in order to produce reliable systems that will operate correctly. Simple replication of micro-architecture blocks will no longer suffice, as all replicated blocks will have faults. The history index of correct computation (HICC) is examined in a recursive and non-recursive fault-tolerant approach at the bit and module levels to identify reliable blocks on-the-fly and forward their computation results, while ignoring results from unreliable blocks. Simulation results show that recursive and non-recursive HICC offers the best resilience to faults when faults are non-uniformly distributed among redundant blocks. A correct computation rate of 99% is achieved using the recursive HICC when decision units at the bit and module levels are fault free, despite an average fault injection rate of 20% compared to a 68% correct computation rate for the recursive triple modular redundancy voter. When faults are injected everywhere in the design, the non-recursive HICC supports the best correct computation percentage. The effect of circuit size and history indices are also examined and discussed.

...read moreread less

Journal Article•DOI•

[...]

Xiangyu Dong¹, Xiaoxia Wu², Yuan Xie¹, Yi Chen³, Hai Helen Li⁴ - Show less +1 more•Institutions (4)

Pennsylvania State University¹, Qualcomm², University of Pittsburgh³, New York University⁴

Low power field programmable gate array implementation of fast digital signal processing algorithms: characterisation and manipulation of data locality

TL;DR: The simulation result shows that MRAM stacking can provide competitive instruction-per-cycle (IPC) performance with a large reduction in power consumption and compared MRAM against its static random access memory (SRAM) and dynamic randomAccess memory (DRAM) counterparts.

...read moreread less

Abstract: Magnetic random access memory (MRAM) has been considered as a promising memory technology because of its attractive properties such as non-volatility, fast access, zero standby leakage and high density. Although integrating MRAM with complementary metal-oxide-semiconductor (CMOS) logic may incur extra manufacturing cost because of the hybrid magnetic-CMOS fabrication process, it is feasible and cost-effective to fabricate MRAM and CMOS logic separately and then integrate them using 3D stacking. In this work, we first studied the MRAM properties and built an MRAM cache model in terms of performance, energy and area. Using this model, we evaluated the impact of stacking MRAM caches atop microprocessor cores and compared MRAM against its static random access memory (SRAM) and dynamic random access memory (DRAM) counterparts. Our simulation result shows that MRAM stacking can provide competitive instruction-per-cycle (IPC) performance with a large reduction in power consumption.

...read moreread less

Journal Article•DOI•

[...]

S. McKeown¹, Roger Woods¹•Institutions (1)

Queen's University Belfast¹

Reconfigurable five-layer three-dimensional integrated memory-on-logic synthetic aperture radar processor

TL;DR: This study outlines an approach for reducing the dynamic power consumption of a class of fast algorithms by minimising the index space separation, which allows the generation of field programmable gate array (FPGA) implementations with reduced power consumption.

...read moreread less

Abstract: Dynamic power consumption is very dependent on interconnect, so clever mapping of digital signal processing algorithms to parallelised realisations with data locality is vital. This is a particular problem for fast algorithm implementations where typically, designers will have sacrificed circuit structure for efficiency in software implementation. This study outlines an approach for reducing the dynamic power consumption of a class of fast algorithms by minimising the index space separation; this allows the generation of field programmable gate array (FPGA) implementations with reduced power consumption. It is shown how a 50% reduction in relative index space separation results in a measured power gain of 36 and 37% over a Cooley–Tukey Fast Fourier Transform (FFT)-based solution for both actual power measurements for a Xilinx Virtex-II FPGA implementation and circuit measurements for a Xilinx Virtex-5 implementation. The authors show the generality of the approach by applying it to a number of other fast algorithms namely the discrete cosine, the discrete Hartley and the Walsh–Hadamard transforms.

...read moreread less

Journal Article•DOI•

[...]

Thorlindur Thorolfsson¹, Nariman Moezzi-Madani¹, Paul D. Franzon¹•Institutions (1)

North Carolina State University¹

Indicating combinational logic decomposition

TL;DR: A floating-point synthetic aperture radar processor that achieves a power efficiency of 18.0 mW/GFlop in simulation through the use of three-dimensional (3D) integration and reconfiguration of the data path is presented.

...read moreread less

Abstract: In this study, the authors present a floating-point synthetic aperture radar processor that achieves a power efficiency of 18.0 mW/GFlop in simulation through the use of three-dimensional (3D) integration and reconfiguration of the data path. The reconfiguration reduces the number of arithmetic units required in every processing element (PE) from 24 down to 10. The processor uses a 3D integrated memory that reduces the memory power consumption by 70% when compared to a 2D memory. The system processes a SAR image using a two-tier 3D integrated PE, which when compared to an equivalent 2D PE decreases the power consumed in the interconnect of each PE by 15.5% and the footprint by 49.2%, and allows the PE to operate 7.1% faster in simulation. Additionally, by using 3D integration in the memory one can reduce the power consumption of the memory by 70%. Furthermore, the authors show how the 3D aspects of the processor can be realised by using 2D tools, when used in conjunction with the proposed through-silicon via assignment algorithm.

...read moreread less

Journal Article•DOI•

[...]

William B. Toms¹, David A. Edwards¹•Institutions (1)

University of Manchester¹

Reordering the assembly instructions in basic blocks to reduce switching activities on the instruction bus

TL;DR: This study presents a novel method for implementing any m-of-n-encoded function block using ‘bounded gates’, where any gate may be decomposed without violating indication by successively decomposing the input encoding into smaller unordered codes.

...read moreread less

Abstract: Self-timed circuits present an attractive solution to the problem of process variation. However, implementing self-timed combinational logic is complex and expensive. As there are no external timing references, data must be encoded within an unordered (DI) encoding and the outputs of functions must indicate to the environment that transitions on inputs and internal signals have taken place. Mapping large function blocks into cell-libraries is extremely difficult as decomposing gates introduces new signals which may violate indication. This study presents a novel method for implementing any m-of-n-encoded function block using ‘bounded gates’, where any gate may be decomposed without violating indication. This is achieved by successively decomposing the input encoding into smaller unordered codes. The study presents algorithms to determine and quantify potential re-encodings. An exact branch and bound approach to the solution is shown, but the complexity of determining unordered encodings restricts the size of function blocks that may be decomposed. To overcome this problem, an approach has been proposed that uses algebraic extraction techniques to efficiently determine and quantify potential encodings. The results of the synthesis procedures are demonstrated on a range of combinational function blocks.

...read moreread less

Journal Article•DOI•

[...]

Noureddine Chabini, Marilyn Wolf¹•Institutions (1)

Georgia Institute of Technology¹

19 Sep 2011-Iet Computers and Digital Techniques

TL;DR: The authors address the problem of reducing power dissipation of the instruction bus by reordering the instructions in basic blocks without increasing the executing time and the code size, and while maintaining the original functionality of the programme.

...read moreread less

Abstract: Execution time is no longer the only target to achieve when designing programmes for today and next-generation CMOS-based digital systems. One needs to also consider reducing power dissipation. Buses contribute to the power dissipation during the execution of a given programme since instructions and/or operands have to be fetched from the memory. Reducing power dissipation in buses has been addressed in the literature. In this study, the authors address the problem of reducing power dissipation of the instruction bus by reordering the instructions in basic blocks without increasing the executing time and the code size, and while maintaining the original functionality of the programme. The authors target embedded processors having Harvard architecture. They focus on solving this problem for programmes developed at the assembly level, since at that level the machine code can be obtained by simply running an assembler, which allows an accurate computation of switching activities on the instruction bus by considering each pair of instructions. The authors formulate this problem as an integer linear programme (ILP), and they provide two heuristics. Experimental results have shown that the proposed approach can reduce switching activities. The ILP has reduced switching activities by as high as 38%. One of the two proposed heuristics has always resulted in reducing switching activities, and its relative savings are within an average of 5% from the optimum produced using the ILP.

...read moreread less

Journal Article•DOI•

Signal transition graph decomposition: internal communication for speed independent circuit implementation

[...]

Dominic Wist¹, Mark Schaefer², Walter Vogler², Ralf Wollowski¹•Institutions (2)

Hasso Plattner Institute¹, University of Augsburg²

Efficient model checking of PSL safety properties

TL;DR: The authors improve their former work by presenting algorithms for identifying delay transitions and inserting gyroscopes for specifications having a much more general structure and are now able to synthesise controllers from real-life specifications.

...read moreread less

Abstract: Logic synthesis of speed independent circuits based on signal transition graph (STG) decomposition is a promising approach to tackle complexity problems like state-space explosion. Unfortunately, decomposition can result in components that in isolation have irreducible complete state coding conflicts. In earlier work, the authors showed how to resolve such conflicts by introducing internal communication between components, but only for very restricted specification structures. Here, they improve their former work by presenting algorithms for identifying delay transitions and inserting gyroscopes for specifications having a much more general structure. Thus, the authors are now able to synthesise controllers from real-life specifications. For all algorithms, they present correctness proofs and show their successful application to benchmarks, including very complex STGs arising in the context of control resynthesis.

...read moreread less

Journal Article•DOI•

[...]

Tuomas Launiainen¹, Keijo Heljanko¹, Tommi Junttila¹•Institutions (1)

Aalto University¹

Advanced calibration techniques for high-speed source-synchronous interfaces

TL;DR: This work presents an efficient approach to model check safety properties expressed in PSL (IEEE Std 1850 Property Specification Language), an industrial property specification language, and handles a larger syntactic subset of PSL safety properties than earlier translations.

...read moreread less

Abstract: Safety properties are an important class of properties, as in the industrial use of model checking, a large majority of the properties to be checked are safety properties. This work presents an efficient approach to model check safety properties expressed in PSL (IEEE Std 1850 Property Specification Language), an industrial property specification language. The approach can also be used as a sound but incomplete bug-hunting tool for general (non-safety) PSL properties, and it will detect exactly the finite counterexamples that are the informative bad prefixes for the PSL formulae in question. The presented technique is inspired by the temporal testers approach of Pnueli and co-authors, but unlike theirs, the proposed approach is aimed at finding finite counterexamples to properties. The new approach presented here handles a larger syntactic subset of PSL safety properties than earlier translations for PSL safety subsets and has been implemented on top of the open source NuSMV 2 model checker. The experimental results show the approach to be a quite competitive model checking approach when compared to a state-of-the-art implementation of PSL model checking.

...read moreread less

Journal Article•DOI•

[...]

Fotis Plessas, Alexis Alexandropoulos, Sotiris Koutsomitsos, E. Davrazos, Michael Birbas - Show less +1 more

19 Sep 2011-Iet Computers and Digital Techniques

TL;DR: Advanced and dynamic calibration techniques for maximising the link performance of parallel source-synchronous interfaces are introduced and demonstrated in this study, using as a case study a 533-MHz DDR2 SDRAM memory interface implemented in 90-nm standard complementary CMOS.

...read moreread less

Abstract: Advanced and dynamic calibration techniques for maximising the link performance of parallel source-synchronous interfaces are introduced and demonstrated in this study, using as a case study a 533-MHz DDR2 SDRAM memory interface implemented in 90-nm standard complementary metal-oxide-semiconductor (CMOS), whereas most of them have been validated at 800-MHz too. A novel dynamic strobe masking system (DSMS) has also been employed which, in contrast to traditional techniques, adjusts dynamically the length of the masking signal in real time, based on the incoming strobe. Furthermore, optimal data capture is achieved by employing a fast bit-deskew calibration engine, while also a novel I/O calibration scheme is included. Post-layout simulation results demonstrate that the dynamic calibration and skew compensation techniques employed improve the timing margin while providing advanced robustness over process, voltage and temperature variations.

...read moreread less

Journal Article•DOI•

Hybrid timing-address oriented load-store queue filtering for an x86 architecture

[...]

Rubén Apolloni¹, Daniel Chaver², Fernando Castro², Luis Piñuel², Manuel Prieto², Francisco Tirado² - Show less +2 more•Institutions (2)

National University of San Luis¹, Complutense University of Madrid²

Statistical analysis of crosstalk-induced errors for on-chip interconnects

TL;DR: A straightforward filtering mechanism is introduced, which results in a more energy-efficient design than past techniques, using less and simpler hardware, and provides new opportunities for extra types of filtering, which lead to higher energy savings.

...read moreread less

Abstract: In the last few years, many researchers have focused their efforts on the field of low-power processor design. Several jobs in this area have dealt with the logic that enforces correct memory-based dependences – the load-store queue – (LSQ) a pretty energy-consuming structure since many accesses are performed in an associative fashion. Among these proposals, some of them manage to reduce this resource's energy consumption by avoiding unnecessary lookups. In this context, the authors introduce a straightforward filtering mechanism, which results in a more energy-efficient design than past techniques, using less and simpler hardware. Besides, both the new scheme and some previous approaches are tested in the widespread x86 architecture. This microarchitectural model provides new opportunities for extra types of filtering, which lead to higher energy savings. On average, the authors proposal filters up to 75% of the associative accesses to the load queue, 56% to the store queue and 42% to the dependence predictor with a reduced amount of hardware – less than 100 bytes. According to their energy model, this means a dynamic energy saving of more than 39% over a conventional LSQ.

...read moreread less

Journal Article•DOI•

[...]

Basel Halak¹, Alexandre Yakovlev¹•Institutions (1)

Newcastle University¹

Half-buffer retiming and token cages for synchronous elastic circuits

TL;DR: A novel metric called crosStalk error rate is developed which can be a valuable tool for designers to predict the crosstalk effects and explore interconnect design techniques in order to achieve the target performance with minimum overheads.

...read moreread less

Abstract: The impact of crosstalk noise on the resilience of on-chip communication links in the presence of parametric variations is investigated. A novel metric called crosstalk error rate is developed which can be a valuable tool for designers to predict the crosstalk effects and explore interconnect design techniques in order to achieve the target performance with minimum overheads. Closed-form expressions of crosstalk error rate are presented. This metric is used to compare different crosstalk avoidance methods in the 90 nm technology.

...read moreread less

Journal Article•DOI•

[...]

Mario R. Casu¹•Institutions (1)

Polytechnic University of Turin¹

Routing of asynchronous Clos networks

TL;DR: Token cages improve the performance of join controllers that use the early-evaluation firing rule and half-buffer retiming allows the creation of input queues by relocating one of the latches of the elastic buffer which follows the join controller.

...read moreread less

Abstract: Synchronous elastic circuits borrow the tolerance of computation and communication latencies from the asynchronous design style The datapath is made elastic by turning registers into elastic buffers and adding a control layer that uses synchronous handshake signals and join/fork controllers Join elements are the objective of two improvements discussed in this study Half-buffer retiming allows the creation of input queues by relocating one of the latches of the elastic buffer which follows the join controller Token cages improve the performance of join controllers that use the early-evaluation firing rule Their effect on throughput is discussed by means of examples representative of typical topologies, simulations with synthetic benchmarks and a realistic microarchitecture Area and power costs of the control logic and the possible impact on the datapath are evaluated, based on the results of logic synthesis experiments on a 45 nm CMOS technology

...read moreread less

Journal Article•DOI•

[...]

Wei Song¹, Doug Edwards¹, Zuo Liu¹, Sohini Dasgupta¹•Institutions (1)

University of Manchester¹