scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Computers in 1996"


Journal ArticleDOI
TL;DR: This approach is based on the conjecture that the following problem is NP-complete, given an OBDD G representing f and a size bound s, does there exist an O BDD G* (respecting an arbitrary variable ordering) representing f with at most s nodes?
Abstract: Ordered binary decision diagrams are a useful representation of Boolean functions, if a good variable ordering is known. Variable orderings are computed by heuristic algorithms and then improved with local search and simulated annealing algorithms. This approach is based on the conjecture that the following problem is NP-complete. Given an OBDD G representing f and a size bound s, does there exist an OBDD G* (respecting an arbitrary variable ordering) representing f with at most s nodes? This conjecture is proved.

599 citations


Journal ArticleDOI
TL;DR: The proposed method is applicable to any multiplier size and adaptable to any technology for which speed parameters are known, and it is easy to incorporate this method in silicon compilation or logic synthesis tools.
Abstract: This paper presents a method and an algorithm for generation of a parallel multiplier, which is optimized for speed. This method is applicable to any multiplier size and adaptable to any technology for which speed parameters are known. Most importantly, it is easy to incorporate this method in silicon compilation or logic synthesis tools. The parallel multiplier produced by the proposed method outperforms other schemes used for comparison in our experiment. It uses the minimal number of cells in the partial product reduction tree. These findings are tested on design examples simulated in 1 /spl mu/ CMOS ASIC technology.

370 citations


Journal ArticleDOI
TL;DR: Experimental tests on graph problems with published solutions showed that the new genetic algorithms performed comparable to or better than the multistart Kernighan-Lin algorithm and the simulated annealing algorithm.
Abstract: Hybrid genetic algorithms (GAs) for the graph partitioning problem are described. The algorithms include a fast local improvement heuristic. One of the novel features of these algorithms is the schema preprocessing phase that improves GAs' space searching capability, which in turn improves the performance of GAs. Experimental tests on graph problems with published solutions showed that the new genetic algorithms performed comparable to or better than the multistart Kernighan-Lin algorithm and the simulated annealing algorithm. Analyses of some special classes of graphs are also provided showing the usefulness of schema preprocessing and supporting the experimental results.

340 citations


Journal ArticleDOI
TL;DR: In this paper, an Address Resolution Buffer (ARB) is proposed for dynamic reordering of memory references in the sequential instruction stream, which supports disambiguation of memory reference addresses in a decentralized manner.
Abstract: To exploit instruction level parallelism, it is important not only to execute multiple memory references per cycle, but also to reorder memory references-especially to execute loads before stores that precede them in the sequential instruction stream. To guarantee correctness of execution in such situations, memory reference addresses have to be disambiguated. This paper presents a novel hardware mechanism, called an Address Resolution Buffer (ARB), for performing dynamic reordering of memory references. The ARB supports the following features: (1) dynamic memory disambiguation in a decentralized manner, (2) multiple memory references per cycle, (3) out-of-order execution of memory references, (4) unresolved loads and stores, (5) speculative loads and stores, and (6) memory renaming. The paper presents the results of a simulation study that we conducted to verify the efficacy of the ARB for a superscalar processor. The paper also shows the ARB's application in a multiscalar processor.

266 citations


Journal ArticleDOI
TL;DR: A bit parallel structure for a multiplier withLow complexity in Galois fields is introduced and a complete set of primitive field polynomials for composite fields is provided which perform module reduction with low complexity.
Abstract: A bit parallel structure for a multiplier with low complexity in Galois fields is introduced. The multiplier operates over composite fields GF((2/sup n/)/sup m/), with k=nm. The Karatsuba-Ofman algorithm (A. Karatsuba and Y. Ofmanis, 1963) is investigated and applied to the multiplication of polynomials over GF(2/sup n/). It is shown that this operation has a complexity of order O(k/sup log23/) under certain constraints regarding k. A complete set of primitive field polynomials for composite fields is provided which perform module reduction with low complexity. As a result, multipliers for fields GF(2/sup k/) up to k=32 with low gate counts and low delays are listed. The architectures are highly modular and thus well suited for VLSI implementation.

202 citations


Journal ArticleDOI
TL;DR: An algorithm for GF(2/sup m/) multiplication/division is presented and a new, more generalized definition of duality is proposed and the bit-serial Berlekamp multiplier is derived and shown to be a specific case of a more general class of multipliers.
Abstract: In this paper an algorithm for GF(2/sup m/) multiplication/division is presented and a new, more generalized definition of duality is proposed. From these the bit-serial Berlekamp multiplier is derived and shown to be a specific case of a more general class of multipliers. Furthermore, it is shown that hardware efficient, bit-parallel dual basis multipliers can also be designed. These multipliers have a regular structure, are easily extended to different GF(2/sup m/) and hence suitable for VLSI implementations. As in the bit-serial case these bit-parallel multipliers can also be hardwired to carry out constant multiplication. These constant multipliers have reduced hardware requirements and are also simple to design. In addition, the multiplication/division algorithm also allows a bit-serial systolic finite field divider to be designed. This divider is modular, independent of the defining irreducible polynomial for the field, easily expanded to different GF(2/sup m/) and its longest delay path is independent of m.

143 citations


Journal ArticleDOI
TL;DR: The (preemptive) distance constrained task system model is proposed which can serve as a more intuitive and adequate scheduling model for "repetitive" task executions and an efficient scheduling scheme for the model is designed, and a schedulability condition for the scheduling scheme is derived.
Abstract: In hard real time systems, each task must not only be functionally correct but also meet its timing constraints. A common approach to characterizing hard real time tasks with repetitive requests is the periodic task model. In the periodic task model, every task needs to be executed once during each of its periods. The execution of a task in one period is independent of the execution of the same task in another period. Hence, the executions of the same task in two consecutive periods may be right next to each other, or at the far ends of the two periods. While the periodic task model can serve as a simple paradigm for scheduling tasks with repetitive requests, it may not be suitable for all real time applications. For example, in some real time systems, the temporal distance between the finishing times of any two consecutive executions of the same task must be less than or equal to a given value. In other words, each execution of a task has a deadline relative to the finishing time of the previous execution of the same task. Scheduling algorithms designed for the periodic task model may not provide efficient solutions for tasks with temporal distance constraints. We propose the (preemptive) distance constrained task system model which can serve as a more intuitive and adequate scheduling model for "repetitive" task executions. We design an efficient scheduling scheme for the model, and derive a schedulability condition for the scheduling scheme. We also discuss how to apply the scheduling scheme to real time sporadic task scheduling and to real time communications.

141 citations


Journal ArticleDOI
TL;DR: A gate-level transient fault simulation environment which has been developed based on realistic fault models and can be used for any transient fault which can be modeled as a transient pulse of some width is described.
Abstract: Mixed analog and digital mode simulators have been available for accurate /spl alpha/-particle-induced transient fault simulation. However, they are not fast enough to simulate a large number of transient faults on a relatively large circuit in a reasonable amount of time. In this paper, we describe a gate-level transient fault simulation environment which has been developed based on realistic fault models. Although the environment was developed for /spl alpha/-particle-induced transient faults, the methodology can be used for any transient fault which can be modeled as a transient pulse of some width. The simulation environment uses a gate level timing fault simulator as well as a zero-delay parallel fault simulator. The timing fault simulator uses logic level models of the actual transient fault phenomenon and latch operation to accurately propagate the fault effects to the latch outputs, after which point the zero-delay parallel fault simulator is used to speed up the simulation without any loss in accuracy. The environment is demonstrated on a set of ISCAS-89 sequential benchmark circuits.

140 citations


Journal ArticleDOI
TL;DR: This study examines the performance of two of the most promising RAID architectures, the mirrored array and the rotated parity array, and proposes several scheduling policies for the mirrored arrays and a new data layout, group-rotate declustering, and compares their performance with each other and in combination with other data layout schemes.
Abstract: In today's computer systems, the disk I/O subsystem is often identified as the major bottleneck to system performance. One proposed solution is the so called redundant array of inexpensive disks (RAID). We examine the performance of two of the most promising RAID architectures, the mirrored array and the rotated parity array. First, we propose several scheduling policies for the mirrored array and a new data layout, group-rotate declustering, and compare their performance with each other and in combination with other data layout schemes. We observe that a policy that routes reads to the disk with the smallest number of requests provides the best performance, especially when the load on the I/O system is high. Second, through a combination of simulation and analysis, we compare the performance of this mirrored array architecture to the rotated parity array architecture. This latter study shows that: 1) given the same storage capacity (approximately double the number of disks), the mirrored array considerably outperforms the rotated parity array; and 2) given the same number of disks, the mirrored array still outperforms the rotated parity array in most cases, even for applications where I/O requests are for large amounts of data. The only exception occurs when the I/O size is very large; most of the requests are writes, and most of these writes perform full stripe write operations.

137 citations


Journal ArticleDOI
TL;DR: Using information on the state of each node's neighbors, an adaptive fault-tolerant deadlock-free routing scheme for n-dimensional meshes and hypercubes with only two virtual channels per physical link is developed.
Abstract: We present an adaptive deadlock-free routing algorithm which decomposes a given network into two virtual interconnection networks, VIN/sub 1/ and VIN/sub 2/. VIN/sub 1/ supports deterministic deadlock-free routing, and VIN/sub 2/ supports fully-adaptive routing. Whenever a channel in VIN/sub 1/ or VIN/sub 2/ is available, it can be used to route a message. Each node is identified to be in one of three states: safe, unsafe, and faulty. The unsafe state is used for deadlock-free routing, and an unsafe node can still send and receive messages. When nodes become faulty/unsafe, some channels in VIN/sub 2/ around the faulty/unsafe nodes are used as the detours of those channels in VIN/sub 1/ passing through the faulty/unsafe nodes, i.e., the adaptability in VIN/sub 2/ is transformed to support fault-tolerant deadlock-free routing. Using information on the state of each node's neighbors, we have developed an adaptive fault-tolerant deadlock-free routing scheme for n-dimensional meshes and hypercubes with only two virtual channels per physical link. In an n-dimensional hypercube, any pattern of faulty nodes can be tolerated as long as the number of faulty nodes is no more than [n/2]. The maximum number of faulty nodes that can be tolerated is 2/sup n-1/, which occurs when all faulty nodes can be encompassed in an (n-1)-cube. In an n-dimensional mesh, we use a more general fault model, called a disconnected rectangular block. Any arbitrary pattern of faulty nodes can be modeled as a rectangular block after finding both unsafe and disabled nodes (which are then treated as faulty nodes). This concept can also be applied to k-ary n-cubes with four virtual channels, two in VIN/sub 1/ and the other two in VIN/sub 2/. Finally, we present simulation results for both hypercubes and 2-dimensional meshes by using various workloads and fault patterns.

119 citations


Journal ArticleDOI
TL;DR: This work proposes the first reported learning automaton based solution to the uniform graph partitioning problem, and believes that it is the fastest algorithm reported to date.
Abstract: Given a graph G, we intend to partition its nodes into two sets of equal size so as to minimize the sum of the cost of the edges having end points in different sets. This problem, called the uniform graph partitioning problem, is known to be NP complete. We propose the first reported learning automaton based solution to the problem. We compare this new solution to various reported schemes such as the B.W. Kernighan and S. Lin's (1970) algorithm, and two excellent recent heuristic methods proposed by E. Rolland et al. (1994; 1992)-an extended local search algorithm and a genetic algorithm. The current automaton based algorithm outperforms all the other schemes. We believe that it is the fastest algorithm reported to date. Additionally, our solution can also be adapted for the GPP in which the edge costs are not constant but random variables whose distributions are unknown.

Journal ArticleDOI
TL;DR: This work proposes an alternative approach to system diagnosis by allowing a few upper bounded number of units to be diagnosed incorrectly, called t/k-diagnosability, and shows that a substantial increase in the degree of diagnosability is achieved, at the cost of a comparably small number of incorrectly diagnosed units.
Abstract: The classical diagnosability approach has its limitation when dealing with large fault sets in large multiprocessor systems. This is due to limited diagnosability of large multiprocessor systems connected using regular interconnection structures. We propose an alternative approach to system diagnosis by allowing a few upper bounded number of units to be diagnosed incorrectly. This measure is called t/k-diagnosability. Using this new measure, it is possible to increase the degree of diagnosability of large system considerably. The t/k-diagnosis guarantees that all the faulty units (processors) in a system are detected (provided the number of faulty units does not exceed t) while at most k units are incorrectly diagnosed. We provide necessary and sufficient conditions for t/k-diagnosability and discuss their implication. To demonstrate the power of this approach, we analyze the diagnosability of large systems connected as hypercube, star-graph, and meshes. It is shown that a substantial increase in the degree of diagnosability of these structures is achieved, compared with the degree of diagnosability achieved using the classic diagnosability approach, at the cost of a comparably small number of incorrectly diagnosed units.

Journal ArticleDOI
TL;DR: The theoretical analysis shows that this transparent BIST technique does not decrease the fault coverage for modeled faults, it behaves better for unmodeled ones and does not increase the aliasing with respect to the initial test algorithm.
Abstract: I present the theoretical aspects of a technique called transparent BIST for RAMs. This technique applies to any RAM test algorithm and transforms it into a transparent one. The interest of the transparent test algorithms is that testing preserves the contents of the RAM. The transparent test algorithm is then used to implement a transparent BIST. This kind of BIST is very suitable for periodic testing of RAMs. The theoretical analysis shows that this transparent BIST technique does not decrease the fault coverage for modeled faults, it behaves better for unmodeled ones and does not increase the aliasing with respect to the initial test algorithm. Furthermore, transparent BIST involves only slightly higher area overhead with respect to standard BIST. Thus, transparent BIST becomes more attractive than standard BIST since it can be used for both fabrication testing and periodic testing.

Journal ArticleDOI
TL;DR: Both recovery time and performance degradation during recovery are substantially reduced in clustered RAID; moreover, these gains can be achieved using fairly small C/G ratios.
Abstract: A Redundant Array of Independent Disks (RAID) of G disks provides protection against single disk failures by adding one parity block for each G-1 data blocks. In a clustered RAID, the G data/parity blocks are distributed over a cluster of C disks (C>G), thus reducing the additional load on each disk due to a single disk failure. However, most methods proposed for implementing such a mapping do not work for general C and G values. In this paper, we describe a fast mapping algorithm based on almost-random permutations. An analytical model is constructed, based on the queue with a permanent customer, to predict recovery time and read/write performance. The accuracy of the results derived from this model is validated by comparing with simulations. Our analysis shows that clustered RAID is significantly more tolerant of disk failure than the basic RAID scheme. Both recovery time and performance degradation during recovery are substantially reduced in clustered RAID; moreover, these gains can be achieved using fairly small C/G ratios.

Journal ArticleDOI
TL;DR: This work presents methods to minimize fixed polarity Reed-Muller expressions (FPRMs) using ordered functional decision diagrams (OFDDs), and investigates the close relation between both representations and uses efficient algorithms on OFDDs for exact and heuristic minimization.
Abstract: We present methods to minimize fixed polarity Reed-Muller expressions (FPRMs), i.e., two-level fixed polarity AND/EXOR canonical representations of Boolean functions, using ordered functional decision diagrams (OFDDs). We investigate the close relation between both representations and use efficient algorithms on OFDDs for exact and heuristic minimization of FPRMs. In contrast to previously published methods, our algorithm can also handle circuits with several outputs. Experimental results on large benchmarks are given to show the efficiency of our approach.

Journal ArticleDOI
TL;DR: This work considers a real time task model where a task receives a "reward" that depends on the amount of service received prior to its deadline, and observes that the best performance is exhibited by a two level policy.
Abstract: We consider a real time task model where a task receives a "reward" that depends on the amount of service received prior to its deadline. The reward of the task is assumed to be an increasing function of the amount of service that it receives, i.e., the task has the property that it receives increasing reward with increasing service (IRIS). We focus on the problem of online scheduling of a random arrival sequence of IRIS tasks on a single processor with the goal of maximizing the average reward accrued per task and per unit time. We describe and evaluate several policies for this system through simulation and through a comparison with an unachievable upper bound. We observe that the best performance is exhibited by a two level policy where the top level algorithm is responsible for allocating the amount of service to tasks and the bottom level algorithm, using the earliest deadline first (EDF) rule, is responsible for determining the order in which tasks are executed. Furthermore, the performance of this policy approaches the theoretical upper bound in many cases. We also show that the average number of preemptions of a task under this two level policy is very small.

Journal ArticleDOI
TL;DR: By applying a specific, parameterized model of workload locality, this work is able to derive a closed form solution for the optimal size of each hierarchy level, and finds that money spent on an n level hierarchy is spent in a fixed proportion until another level is added.
Abstract: Memory hierarchies have long been studied by many means: system building, trace driven simulation, and mathematical analysis. Yet little help is available for the system designer wishing to quickly size the different levels in a memory hierarchy to a first order approximation. We present a simple analysis for providing this practical help and some unexpected results and intuition that come out of the analysis. By applying a specific, parameterized model of workload locality, we are able to derive a closed form solution for the optimal size of each hierarchy level. We verify the accuracy of this solution against exhaustive simulation with two case studies: a three level I/O storage hierarchy and a three level processor cache hierarchy. In all but one case, the configuration recommended by the model performs within 5% of optimal. One result of our analysis is that the first place to spend money is the cheapest (rather than the fastest) cache level, particularly with small system budgets. Another is that money spent on an n level hierarchy is spent in a fixed proportion until another level is added.

Journal ArticleDOI
TL;DR: A simple hardware design for buddy-system allocation that takes advantage of the speed of a pure combinational-logic implementation and uses memory more efficiently than the standard software approach is presented.
Abstract: Object-oriented programming languages tend to allocate and deallocate blocks of memory very frequently. The growing popularity of these languages increases the importance of high-performance memory allocation. For speed and simplicity in memory allocation, the buddy system has been the method of choice for nearly three decades. A software realization incurs the overhead of internal fragmentation and of memory traffic due to splitting and coalescing memory blocks. This paper presents a simple hardware design for buddy-system allocation that takes advantage of the speed of a pure combinational-logic implementation. Two binary trees formed by anding and oring propagate information about the allocation status of blocks and subblocks. They implement a nonbacktracking search for the address of the first free block that is large enough to satisfy a request. Although the buddy system may allocate a block that is much larger than the requested size, the logic that finds a free block can be augmented by a "bit-flipper" to relinquish the unused portion at the end of the block. This effectively eliminates internal fragmentation. Simulation results show that the buddy system modified in this way uses less memory in most, though not all, programs than the unmodified buddy. Hence, the hardware buddy-system allocator is faster and uses memory more efficiently than the standard software approach.

Journal ArticleDOI
TL;DR: Another fault-tolerant routing algorithm, which requires only a constant of five virtual networks in wormhole routing to ensure the property of deadlock freeness for a hypercube of any size, is presented in this research.
Abstract: We investigate fault-tolerant routing which aims at finding feasible minimum paths in a faulty hypercube. The concept of unsafe node and its extension are used in our scheme. A set of stringent criteria is proposed to identify the possibly bad candidates for forwarding a message. As a result, the number of such undesirable nodes is reduced without sacrificing the functionality of the mechanism. Furthermore, the notion of degree of unsafeness for classifying the unsafe nodes is introduced to facilitate the design of efficient routing algorithms which rely on having each node keep the states of its nearest neighbors. We show that a feasible path of length no more than the Hamming distance between the source and the destination plus four can always be established by the routing algorithm as long as the hypercube is not fully unsafe. The issue of deadlock freeness is also addressed in this research. More importantly, another fault-tolerant routing algorithm, which requires only a constant of five virtual networks in wormhole routing to ensure the property of deadlock freeness for a hypercube of any size, is presented in this paper.

Journal ArticleDOI
TL;DR: Performance results indicate that phased logic tends to be tolerant of logic delay imbalances and has predictable worst-case timing behavior, and has the potential to shorten the design cycle by reducing timing complexities.
Abstract: Phased logic is proposed as a solution to the increasing problem of timing complexity in digital design. It is a delay-insensitive design methodology that seeks to restore the separation between logical and physical design by eliminating the need to distribute low-skew clock signals and carefully balance propagation delays. However, unlike other methodologies that avoid clocks, phased logic supports the cyclic, deterministic behavior of the synchronous design paradigm. This permits the designer to rely chiefly on current experience and CAD tools to create phased logic systems. Marked graph theory is used as a framework for governing the interaction of phased logic gates that operate directly on Level-Encoded two-phase Dual-Rail (LEDR) signals. A synthesis algorithm is developed for converting clocked systems to phased logic systems and is applied to benchmark examples. Performance results indicate that phased logic tends to be tolerant of logic delay imbalances and has predictable worst-case timing behavior. Although phased logic requires additional circuitry, it has the potential to shorten the design cycle by reducing timing complexities.

Journal ArticleDOI
TL;DR: This paper presents new, fast hardware for computing the exponential function, sine, and cosine by using low-precision arithmetic components to approximate high precision computations, and to correct very quickly the approximation error periodically.
Abstract: This paper presents new, fast hardware for computing the exponential function, sine, and cosine. The main new idea is to use low-precision arithmetic components to approximate high precision computations, and then to correct very quickly the approximation error periodically so that the effect is to get high precision computation at near low-precision speed. The algorithm used in the paper is a nontrivial modification of the well-known CORDIC algorithm, and might be applicable to the computation of other functions than the ones presented.

Journal ArticleDOI
TL;DR: It is proved that, due to the lack of additional operations, DCORDIC compares favorably with the previously known redundant methods in terms of latency and computational complexity.
Abstract: The CORDIC algorithm is a well-known iterative method for the efficient computation of vector rotations, and trigonometric and hyperbolic functions. Basically, CORDIC performs a vector rotation which is not a perfect rotation, since the vector is also scaled by a constant factor. This scaling has to be compensated for following the CORDIC iteration. Since CORDIC implementations using conventional number systems are relatively slow, current research has focused on solutions employing redundant number systems which make a much faster implementation possible. The problem with these methods is that either the scale factor becomes variable, making additional operations necessary to compensate for the scaling, or additional iterations are necessary compared to the original algorithm. In contrast we developed transformations of the usual CORDIC algorithm which result in a constant scale factor redundant implementation without additional operations. The resulting "Differential CORDIC Algorithm" (DCORDIC) makes use of on-line (most significant digit first redundant) computation. We derive parallel architectures for the radix-2 redundant number systems and present some implementation results based on logic synthesis of VHDL descriptions produced by a DCORDIC VHDL generator. We finally prove that, due to the lack of additional operations, DCORDIC compares favorably with the previously known redundant methods in terms of latency and computational complexity.

Journal ArticleDOI
TL;DR: N-1 directed edge disjoint spanning trees on the star network are constructed to derive a near optimal single node broadcasting algorithm, and fault tolerant algorithms for the single node and multinode broadcasting, and for thesingle node andMultinode scattering problems.
Abstract: Data communication and fault tolerance are important issues in parallel computers in which the processors are interconnected according to a specific topology. One way to achieve fault tolerant interprocessor communication is by exploiting the disjoint paths that exist between pairs of source and destination nodes. We construct n-1 directed edge disjoint spanning trees on the star network. These spanning trees are used to derive a near optimal single node broadcasting algorithm, and fault tolerant algorithms for the single node and multinode broadcasting, and for the single node and multinode scattering problems. Broadcasting is the distribution of the same group of messages from one processor to all the other processors. Scattering is the distribution of distinct groups of messages from one processor to all the other processors. We consider broadcasting and scattering from a single processor of the network and simultaneously from all processors of the network. The single node broadcasting algorithm offers a speed up of n-1 for a large number of messages, over the straightforward algorithm that uses a single shortest path spanning tree. Fault tolerance is achieved by transmitting the same messages through a number of edge disjoint spanning trees. The fault tolerant algorithms operate successfully in the presence of up to n-2 faulty nodes or edges in the network. No prior knowledge of the faulty nodes or edges is required. All of the algorithms operate under the store and forward, all port communication model.

Journal ArticleDOI
TL;DR: A new method for detecting groups of symmetric variables of completely specified Boolean functions using the canonical Generalized Reed-Muller forms and a set of signatures that allow for detecting symmetries of any number of inputs simultaneously.
Abstract: In this paper, we present a new method for detecting groups of symmetric variables of completely specified Boolean functions. The canonical Generalized Reed-Muller (GRM) forms are used as a powerful analysis tool. To reduce the search space we have developed a set of signatures that allow us to identify quickly sets of potentially symmetric variables. Our approach allows for detecting symmetries of any number of inputs simultaneously. Totally symmetric functions can be detected very quickly. The traditional definitions of symmetry have also been extended to include more types. This extension has the advantage of grouping input variables into more classes. Experiments have been performed on MCNC benchmark cases and the results verify the efficiency of our method.

Journal ArticleDOI
TL;DR: It is shown that an implementation of the sigmoid generator outperforms, in both precision and speed, existing schemes using a bit serial pipelined implementation.
Abstract: A piecewise second order approximation scheme is proposed for computing the sigmoid function. The scheme provides high performance with low implementation cost; thus, it is suitable for hardwired cost effective neural emulators. It is shown that an implementation of the sigmoid generator outperforms, in both precision and speed, existing schemes using a bit serial pipelined implementation. The proposed generator requires one multiplication, no look-up table and no addition. It has been estimated that the sigmoid output is generated with a maximum computation delay of 21 bit serial machine cycles representing a speedup of 1.57 to 2.23 over other proposals.

Journal ArticleDOI
TL;DR: This paper facilitates a BIST strategy for high performance datapath architectures that uses the functionality of existing hardware, is entirely integrated with the circuit under test, and results in at-speed testing with no performance degradation and no area overhead.
Abstract: Existing built-in self-test (BIST) strategies require the use of specialized test pattern generation hardware which introduces significant area overhead and performance degradation. In this paper, we propose an entirely new approach to generate test patterns. The method is based on adders widely available in data-path architectures used in digital signal processing circuits and general purpose processors. The resultant test patterns, generated by continuously accumulating a constant value, provide a complete state coverage on subspaces of contiguous bits. This new test generation scheme, along with the recently introduced accumulator-based compaction scheme (Rajski and Tyszer, 1993) facilitates a BIST strategy for high performance datapath architectures that uses the functionality of existing hardware, is entirely integrated with the circuit under test, and results in at-speed testing with no performance degradation and no area overhead.

Journal ArticleDOI
TL;DR: A new algorithm to compute the performability distribution that deals only with nonnegative numbers bounded by one and its computational complexity is polynomial, which allows it to determine truncation steps and so to improve the execution time of the algorithm.
Abstract: We propose, in this paper, a new algorithm to compute the performability distribution. Its computational complexity is polynomial and it deals only with nonnegative numbers bounded by one. This important property allows us to determine truncation steps and so to improve the execution time of the algorithm.

Journal ArticleDOI
TL;DR: The performance figures obtained indicate that in a wide class of applications requiring a high degree of fault tolerance, software-implemented fail-silent nodes constructed simply by utilizing standard "off-the-shelf" components are an attractive alternative to their hardware-IMplemented counterparts that do require special-purpose hardware components, such as fault-tolerant clocks, comparator, and bus interface circuits.
Abstract: A fail-silent node is a self-checking node that either functions correctly or stops functioning after an internal failure is detected. Such a node can be constructed from a number of conventional processors. In a software-implemented fail-silent node, the nonfaulty processors of the node need to execute message order and comparison protocols to "keep in step" and check each other, respectively. In this paper, the design and implementation of efficient protocols for a two processor fail-silent node are described in detail. The performance figures obtained indicate that in a wide class of applications requiring a high degree of fault tolerance, software-implemented fail-silent nodes constructed simply by utilizing standard "off-the-shelf" components are an attractive alternative to their hardware-implemented counterparts that do require special-purpose hardware components, such as fault-tolerant clocks, comparator, and bus interface circuits.

Journal ArticleDOI
TL;DR: A vision for "self-monitoring" hardware/software whose reliability is augmented through embedded suites of run-time correctness checkers and correctors suitable for monitoring the multiplication and division functionalities of an arbitrary arithmetic processor and seamlessly correcting erroneous output which may occur for any reason during the lifetime of the chip.
Abstract: We review the field of result-checking and suggest that it be extended to a methodology for enforcing hardware/software reliability. We thereby formulate a vision for "self-monitoring" hardware/software whose reliability is augmented through embedded suites of run-time correctness checkers. In particular, we suggest that embedded checkers and correctors may be employed to safeguard against arithmetic errors such as that which has bedeviled the Intel Pentium Microprocessor. We specify checkers and correctors suitable for monitoring the multiplication and division functionalities of an arbitrary arithmetic processor and seamlessly correcting erroneous output which may occur for any reason during the lifetime of the chip.

Journal ArticleDOI
TL;DR: An integrated system for synthesizing self-recovering microarchitectures called /spl SscR//spl Yscr//spl Nscr //spl Cscr //spl Escr / incorporates detection constraints by ensuring that two copies of the computation are executed on disjoint hardware.
Abstract: We describe an integrated system for synthesizing self-recovering microarchitectures called /spl Sscr//spl Yscr//spl Nscr//spl Cscr//spl Escr//spl Rscr//spl Escr/ in the /spl Sscr//spl Yscr//spl Nscr//spl Cscr//spl Escr//spl Rscr//spl Escr/ model for self-recovery, transient faults are detected using duplication and comparison, while recovery from transient faults is accomplished via checkpointing and rollback. /spl Sscr//spl Yscr//spl Nscr//spl Cscr//spl Escr//spl Rscr//spl Escr/ initially inserts checkpoints subject to designer specified recovery time constraints. Subsequently, /spl Sscr//spl Yscr//spl Nscr//spl Cscr//spl Escr//spl Rscr//spl Escr/ incorporates detection constraints by ensuring that two copies of the computation are executed on disjoint hardware. Towards ameliorating the dedicated hardware required for the original and duplicate computations, /spl Sscr//spl Yscr//spl Nscr//spl Cscr//spl Escr//spl Rscr//spl Escr/ imposes intercopy hardware disjointness at a sub-computation level instead of at the overall computation level. The overhead is further moderated by restructuring the pliable input representation of the computation. /spl Sscr//spl Yscr//spl Nscr//spl Cscr//spl Escr//spl Rscr//spl Escr/ has successfully derived numerous self-recovering microarchitectures. Towards validating the methodology for designing fault-tolerant VLSI ICs, we carried out a physical design of a self-recovering 16-point FIR filter.