scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Computers in 1992"


Journal ArticleDOI
TL;DR: In this article, the diagnosability and diagnosis problems for a self-diagnosable multiprocessor system where processors compare the results of tasks performed by other processors in the system are analyzed.
Abstract: The diagnosability and diagnosis problems for a model introduced by J. Maeng and M. Malek (1981) of a self-diagnosable multiprocessor system where processors compare the results of tasks performed by other processors in the system are analyzed. A set of criteria is given for determining whether the faulty processors in the system can be diagnosed on the basis of the comparisons, and a polynomial-time algorithm is presented to identify the faulty units of such a system on the basis of the comparison results when the system is known to be diagnosable. >

343 citations


Journal ArticleDOI
TL;DR: Manetho is a new transparent rollback-recovery protocol for long-running distributed computations that achieves the advantages of pessimistic message logging and, fast output commit, and the advantage of optimistic message logging.
Abstract: Manetho is a new transparent rollback-recovery protocol for long-running distributed computations. It uses a novel combination of antecedence graph maintenance, uncoordinated checkpointing, and sender-based message logging. Manetho simultaneously achieves the advantages of pessimistic message logging, namely limited rollback and, fast output commit, and the advantage of optimistic message logging, namely low failure-free overhead. These advantages come at the expense of a complex recovery scheme. >

322 citations


Journal ArticleDOI
TL;DR: A unified framework for simulating Markovian models of highly dependable systems is presented and it is shown that a variance reduction technique called importance sampling can be used to speed up the simulation by many orders of magnitude over standard simulation.
Abstract: The authors present a unified framework for simulating Markovian models of highly dependable systems. It is shown that a variance reduction technique called importance sampling can be used to speed up the simulation by many orders of magnitude over standard simulation. This technique can be combined very effectively with regenerative simulation to estimate measures such as steady-state availability and mean time to failure. Moveover, it can be combined with conditional Monte Carlo methods to quickly estimate transient measures such as reliability, expected interval availability, and the distribution of interval availability. The authors show the effectiveness of these methods by using them to simulate large dependability models. They discuss how these methods can be implemented in a software package to compute both transient and steady-state measures simultaneously from the same sample run. >

225 citations


Journal ArticleDOI
TL;DR: A design for buffers that provide non-FIFO message handling and efficient storage allocation for variable size packets using linked lists managed by a simple on-chip controller is presented and outperforms alternative buffers.
Abstract: Small n*n switches are key components of interconnection networks used in multiprocessors and multicomputers. The architecture of these n*n switches, particularly their internal buffers, is critical for achieving high-throughput low-latency communication with cost-effective implementations. Several buffer structures are discussed and compared in terms of implementation complexity, inter-switch handshaking requirements, and their ability to deal with variations in traffic patterns and message lengths. A design for buffers that provide non-FIFO message handling and efficient storage allocation for variable size packets using linked lists managed by a simple on-chip controller is presented. The new buffer design is evaluated by comparing it to several alternative designs in the context of a multistage interconnection network. The modeling and simulation show that the new buffer outperforms alternative buffers and can thus be used to improve the performance of a wide variety of systems currently using less efficient buffers. >

222 citations


Journal ArticleDOI
TL;DR: A quantitative problem model, algorithms for optimal and suboptimal solutions, and simulation results are provided and discussed and the task allocation problem is addressed with the goal of maximizing the system reliability.
Abstract: For distributed systems, system reliability is defined as the probability that the system can run an entire task successfully. When the system's hardware configuration is fixed, the system reliability is mainly dependent on the software design. The task allocation problem is addressed with the goal of maximizing the system reliability. A quantitative problem model, algorithms for optimal and suboptimal solutions, and simulation results are provided and discussed. >

218 citations


Journal ArticleDOI
Harold S. Stone1, John Turek1, Joel L. Wolf1
TL;DR: An efficient combinatorial algorithm for determining the optimal steady-state allocation, which, in theory, could be used to reduce the length of the transient, is described and generalizes to multilevel cache memories.
Abstract: A model for studying the optimal allocation of cache memory among two or more competing processes is developed and used to show that, for the examples studied, the least recently used (LRU) replacement strategy produces cache allocations that are very close to optimal. It is also shown that when program behavior changes, LRU replacement moves quickly toward the steady-state allocation if it is far from optimal, but converges slowly as the allocation approaches the steady-state allocation. An efficient combinatorial algorithm for determining the optimal steady-state allocation, which, in theory, could be used to reduce the length of the transient, is described. The algorithm generalizes to multilevel cache memories. For multiprogrammed systems, a cache-replacement policy better than LRU replacement is given. The policy increases the memory available to the running process until the allocation reaches a threshold time beyond which the replacement policy does not increase the cache memory allocated to the running process. >

212 citations


Journal ArticleDOI
TL;DR: The design of the 56-b significant adder used in the Advanced Micro Devices Am29050 microprocessor is described, which employs a novel method for combining carries which does not require the back propagation associated with carry lookahead, and is not limited to radix-2 trees.
Abstract: The design of the 56-b significant adder used in the Advanced Micro Devices Am29050 microprocessor is described. Originally implemented in a 1- mu m design role CMOS process, it evaluates 56-b sums in well under 4 ns. The adder employs a novel method for combining carries which does not require the back propagation associated with carry lookahead, and is not limited to radix-2 trees, as is the binary lookahead carry tree of R.P. Brent and H.T. Kung (1982). The adder also utilizes a hybrid carry lookahead-carry select structure which reduces the number of carriers that need to be derived in the carry lookahead tree. This approach produces a circuit well suited for CMOS implementation because of its balanced load distribution and regular layout. >

171 citations


Journal ArticleDOI
TL;DR: A fault-tolerant communication scheme that facilitates near-optimal routing and broadcasting in hypercube computers subject to node failures is described and it is shown that by only using 'feasible' paths that try to avoid unsafe nodes, routing and Broadcasting can be substantially simplified.
Abstract: A fault-tolerant communication scheme that facilitates near-optimal routing and broadcasting in hypercube computers subject to node failures is described. The concept of an unsafe node is introduced to identify fault-free nodes that may cause communication difficulties. It is shown that by only using 'feasible' paths that try to avoid unsafe nodes, routing and broadcasting can be substantially simplified. A computationally efficient routing algorithm that uses local information is presented. It can route a message via a path of length no greater than p+2, where p is the minimum distance from the source to the destination, provided that not all nonfaulty nodes in the hypercube are unsafe. Broadcasting can be achieved under the same fault conditions with only one more time unit than the fault-free case. The problems posed by deadlock in faulty hypercubes are discussed, and deadlock-free implementations of the proposed communication schemes are presented. >

162 citations


Journal ArticleDOI
TL;DR: It is shown that almost all functions are not sensitive to variable ordering, and a tight upper bound of (2/sup n//n)(2+ epsilon) for the worst case OBDD size is derived.
Abstract: The behavior of ordered binary decision diagrams (OBDD) for general Boolean functions is studied. A tight upper bound of (2/sup n//n)(2+ epsilon ) for the worst case OBDD size is derived. Although the size of an OBDD is dependent on the ordering of decision variables, it is shown that almost all functions are not sensitive to variable ordering. >

150 citations


Journal ArticleDOI
TL;DR: The cache invalidation patterns of several parallel applications are analyzed and a classification scheme for data objects found in parallel programs is proposed, indicating that cache line sizes in the 32-byte range yield the lowest data and invalidation traffic.
Abstract: The cache invalidation patterns of several parallel applications are analyzed. The results are based on multiprocessor simulations with 8, 16, and 32 processors. To provide deeper insight into the observed invalidation behavior the invalidations observed in the simulations are linked to the high-level objects causing them in the programs. To predict what the invalidation patterns would look like beyond 32 processors, a classification scheme for data objects found in parallel programs is proposed. The classification scheme provides a powerful conceptual tool to reason about the invalidation patterns of parallel applications. Results indicate that it should be possible to scale well-written parallel programs to a large number of processors without an explosion in invalidation traffic. At the same time, the invalidation patterns are such that directory-based schemes with just a few pointers per entry can be very effective. The variations in invalidation behavior with different cache line sizes are discussed. The results indicate that cache line sizes in the 32-byte range yield the lowest data and invalidation traffic. >

150 citations


Journal ArticleDOI
TL;DR: In this paper, a general synthesis method for efficiently implementing any family of Boolean functions over a set of variables, as a self-timed logic module, is proposed, and a formal proof of correctness is provided.
Abstract: The authors propose a general synthesis method for efficiently implementing any family of Boolean functions over a set of variables, as a self-timed logic module. Interval temporal logic is used to express the constraints that are formulated for the self-timed logic module. A method is provided for proving the correct behavior of the designed circuit, by showing that it obeys all the functional constraints. The resulting circuit is compared with alternative proposed self-timed methodologies. This approach is shown to require less gates than other methods. The proposed method is appropriate for automatic synthesis of self-timed systems. A formal proof of correctness is provided. >

Journal ArticleDOI
TL;DR: An optimal on-line scheduler is given for a set of real-time tasks with one common deadline on m processors and it is shown that no optimal scheduler can exist for tasks with two distinct deadlines.
Abstract: An optimal on-line scheduler is given for a set of real-time tasks with one common deadline on m processors. It is shown that no optimal scheduler can exist for tasks with two distinct deadlines. Finally, an optimal on-line scheduler is given for situations where processors can go down unexpectedly. >

Journal ArticleDOI
TL;DR: Structures for parallel multipliers of a class of fields GF(2/sup m/) based on irreducible all one polynomials (AOP) and equally spaced polynmials (ESP) are presented and it is shown that it is advantageous to use the ESP-based parallel multiplier.
Abstract: Structures for parallel multipliers of a class of fields GF(2/sup m/) based on irreducible all one polynomials (AOP) and equally spaced polynomials (ESP) are presented. The structures are simple and modular, which is important for hardware realization. Relationships between an irreducible AOP and the corresponding irreducible ESPs have been exploited to construct ESP-based multipliers of large fields by a regular expansion of the basic modules of the AOP-based multiplier of a small field. Some features of the structures also enable a fast implementation of squaring and multiplication algorithms and therefore make fast exponentiation and inversion possible. It is shown that, if for a certain degree, an irreducible AOP as well as an irreducible ESP exist, then from the complexity point of view, it is advantageous to use the ESP-based parallel multiplier. >

Journal ArticleDOI
TL;DR: An adaptive algorithm for managing fully associative cache memories shared by several identifiable processes is presented and it is shown that such an increase in hit-ratio in a system with a heavy throughput of I/O requests can provide a significant decrease in disk response time.
Abstract: An adaptive algorithm for managing fully associative cache memories shared by several identifiable processes is presented. The on-line algorithm extends an earlier model due to H.S. Stone et al. (1989) and partitions the cache storage in disjoint blocks whose sizes are determined by the locality of the processes accessing the cache. Simulation results of traces for 32-MB disk caches show a relative improvement in the overall and read hit-ratios in the range of 1% to 2% over those generated by a conventional least recently used replacement algorithm. The analysis of a queuing network model shows that such an increase in hit-ratio in a system with a heavy throughput of I/O requests can provide a significant decrease in disk response time. >

Journal ArticleDOI
TL;DR: A performance analysis shows that through the integration of agile sources or receivers, and wavelength division multiple access, systems can be developed with significant increases in performance yet at a reduction in communication subsystem complexity.
Abstract: A hypercube-based structure in which optical multiple access channels span the dimensional axes is introduced. This severely reduces the required degree, since only one I/O port is required per dimension. However, good performance is maintained through the high-capacity characteristics of optical communication. The reduction in degree is shown to have significant system complexity implications. Four star-coupled configurations are studied as the basis for the optical multiple access channels, three of which exhibit the optical self-routing characteristic. A performance analysis shows that through the integration of agile sources or receivers, and wavelength division multiple access, systems can be developed with significant increases in performance yet at a reduction in communication subsystem complexity. >

Journal ArticleDOI
TL;DR: Three novel methods for realizing this class of reduced complexity single-precision multipliers are introduced and their performance analyzed.
Abstract: When two numbers are multiplied, a double-wordlength product is produced. In applications where only the single-precision product is required, the double-wordlength result is rounded to single-precision. Hence, in single-precision applications, it is not necessary to compute the least significant part of the product exactly. Instead, it is only necessary to estimate the carries generated in the computation of the least significant part that will ripple into the most significant part of the product. This will produce a single-precision multiplier with significantly reduced circuit complexity. Three novel methods for realizing this class of reduced complexity single-precision multipliers are introduced and their performance analyzed. >

Journal ArticleDOI
TL;DR: The authors propose a load-balancing algorithm that determines the optimal load for each host so as to minimize the overall mean job response time in a distributed computer system that consists of heterogeneous hosts.
Abstract: The authors propose a load-balancing algorithm that determines the optimal load for each host so as to minimize the overall mean job response time in a distributed computer system that consists of heterogeneous hosts. The algorithm is a simplified and easily understandable version of the single-point algorithm originally presented by A.N. Tantawi and D. Towsley (1985). >

Journal ArticleDOI
TL;DR: A distributed system-level diagnosis algorithm called Adaptive DSD is shown to minimize network resources and has resulted in a practical implementation and is proven correct in that each fault-free node reaches an accurate independent diagnosis of the fault conditions of the remaining nodes.
Abstract: The practical application and implementation of online distributed system-level diagnosis theory is documented. Proven distributed diagnosis algorithms are shown to be impractical in real systems due to high resource requirements. A distributed system-level diagnosis algorithm called Adaptive DSD is shown to minimize network resources and has resulted in a practical implementation. Adaptive DSD assumes a distributed network, in which network nodes can test other nodes and determine them to be faulty or fault-free. Tests are issued from each node adaptively and depend on the fault situation of the network. Test result reports are generated from test results and forwarded between nodes in the network. Adaptive DSD is proven correct in that each fault-free node reaches an accurate independent diagnosis of the fault conditions of the remaining nodes. No restriction is placed on the number of faulty nodes; any fault situation with any number of faulty nodes is diagnosed correctly. An implementation of the Adaptive DSD algorithm is described. >

Journal ArticleDOI
TL;DR: A constant-factor redundant-CordIC (CFR-CORDIC) scheme, where the scale factor is kept constant while an angle for plane rotations is computed, is developed and found to provide an execution time similar to that of redundant CORDIC with a variable scaling factor, with a significant saving in area.
Abstract: A constant-factor redundant-CORDIC (CFR-CORDIC) scheme, where the scale factor is kept constant while an angle for plane rotations is computed, is developed. The direction of rotation is determined from an estimate of the sign, and convergence is assured by suitably placed correcting iterations. The number of iterations in the CORDIC rotation unit is reduced by about 25% by expressing the direction of the rotation in radix-2 and radix-4, and conversion to conventional representation is done on the fly. The performance of CFR-CORDIC is estimated and compared with that of previously proposed schemes. It is found to provide an execution time similar to that of redundant CORDIC with a variable scaling factor, with a significant saving in area. >

Journal ArticleDOI
TL;DR: Two new schedulers are proposed which improve this 2/3 result to a new 0.7 density threshold and can be viewed as a generalization of the previously known scheduler techniques, i.e. they can handle a larger class of pinwheel instances including all instances schedulable by the previous known techniques.
Abstract: The pinwheel is a hard-real-time scheduling problem for scheduling satellite ground stations to service a number of satellites without data loss. Given a multiset of positive integers (instance) A=(a/sub 1/, . . . a/sub n/), the problem is to find an infinite sequence (schedule) of symbols from (1,2, . . . n) such that there is at least one symbol i within any interval of a/sub i/ symbols (slots). Not all instances A can be scheduled; for example, no 'successful' schedule exists for instances whose density is larger than 1. It has been shown that any instance whose density is less than 2/3 can always be scheduled. Two new schedulers are proposed which improve this 2/3 result to a new 0.7 density threshold. These two schedulers can be viewed as a generalization of the previously known schedulers, i.e. they can handle a larger class of pinwheel instances including all instances schedulable by the previously known techniques. >

Journal ArticleDOI
TL;DR: A systolic structure for bit-serial division over the field GF(2/sup m/) is developed to avoid global data communications and dependency of the time step duration on m, important for applications where the value of m is large.
Abstract: A systolic structure for bit-serial division over the field GF(2/sup m/) is developed. Consideration is given to avoid global data communications and dependency of the time step duration on m. This is important for applications where the value of m is large. The divider requires only three basic processors and one simple control signal and its circuit and time complexities are proportional to m/sup 2/ and m, respectively. It does not depend on the irreducible polynomial and can be expanded easily. Moreover, with m additional simple processors, a bit-serial systolic multiplier is developed which uses part of the divider structure. This is advantageous from the implementation point of view, as both the divider and multiplier can be fabricated on a single chip, resulting in a reduction of area. >

Journal ArticleDOI
TL;DR: An improved method which guarantees a constant scale factor when employing redundant addition schemes and an architecture with increased parallelism which considerably reduces the CORDIC latency time and the amount of hardware is described.
Abstract: Several methods for increasing the speed of the CORDIC algorithm are presented. First, an improved method which guarantees a constant scale factor when employing redundant addition schemes is developed. Then, an architecture with increased parallelism which considerably reduces the CORDIC latency time and the amount of hardware is described. >

Journal ArticleDOI
TL;DR: A case study of the impact of transient faults on microprocessor-based jet-engine controller is used to identify the critical fault propagation paths, the module most sensitive to fault propagation, and the module with the highest potential for causing external errors.
Abstract: FOCUS, a simulation environment for conducting fault-sensitivity analysis of chip-level designs, is described. The environment can be used to evaluate alternative design tactics at an early design stage. A range of user specified faults is automatically injected at runtime, and their propagation to the chip I/O pins is measured through the gate and higher levels. A number of techniques for fault-sensitivity analysis are proposed and implemented in the FOCUS environment. These include transient impact assessment on latch, pin and functional errors, external pin error distribution due to in-chip transients, charge-level sensitivity analysis, and error propagation models to depict the dynamic behavior of latch errors. A case study of the impact of transient faults on microprocessor-based jet-engine controller is used to identify the critical fault propagation paths, the module most sensitive to fault propagation, and the module with the highest potential for causing external errors. >

Journal ArticleDOI
TL;DR: By comparing synthetic traces with real traces of identical locality parameters, it is demonstrated that synthetic traces exhibit miss ratios and lifetime functions that compare well with those of the real traces they mimic, both in fully associative and in set-associate memories.
Abstract: Two techniques for producing synthetic address traces that produce good emulations of the locality of reference of real programs are presented. The first algorithm generates synthetic addresses by simulating a random walk in an infinite address-space with references governed by a hyperbolic probability law. The second algorithm is a refinement of the first in which the address space has a given finite size. The basic model for the random walk has two parameters that correspond to the working set size and the locality of reference. By comparing synthetic traces with real traces of identical locality parameters, it is demonstrated that synthetic traces exhibit miss ratios and lifetime functions that compare well with those of the real traces they mimic, both in fully associative and in set-associative memories. >

Journal ArticleDOI
TL;DR: An efficient hierarchical clustering and allocation algorithm that drastically reduces the interprocess communications cost while observing lower and upper bounds of utilization for the individual processors is proposed and evaluated.
Abstract: The authors propose and evaluate an efficient hierarchical clustering and allocation algorithm that drastically reduces the interprocess communications cost while observing lower and upper bounds of utilization for the individual processors. They compare the algorithm with branch-and-bound-type algorithms that can produce allocations with minimal communication cost, and show a very encouraging time complexity/suboptimality tradeoff in favor of the algorithm, at least for a class of process clusters and their random combinations which it is believed occur naturally in distributed applications. The heuristic allocation is well suited for a changing environment, where processors may fail or be added to the system and where the workload patterns may change unpredictably and/or periodically. >

Journal ArticleDOI
TL;DR: Based on the measurements from two DEC VAX-cluster multicomputer systems, the issue of correlated failures is addressed and two validated models, the c- dependent model and the p-dependent model, are developed to evaluate the dependability of systems with correlated failures.
Abstract: Based on the measurements from two DEC VAX-cluster multicomputer systems, the issue of correlated failures is addressed. In particular, the characteristics of correlated failures, their impact and their modelling on dependability, are discussed. It is found from the data that most correlated failures are related to errors in shared resources and propagate from one machine to another. Comparisons between measurement-based models and analytical models that assume failure independence show that the impact of correlated failures on dependability is significant. Two validated models. the c-dependent model and the p-dependent model, are developed to evaluate the dependability of systems with correlated failures. >

Journal ArticleDOI
TL;DR: System and delay models necessary for the study of time performances of synchronous and asynchronous systems are developed and a mode of clocking that reduces the clock skew substantially is proposed and examined.
Abstract: Continuous advances in VLSI technology have made it possible to implement a system on a chip. One consequence of this is that the system will use a homogeneous technology for interconnections, gates, and synchronizers. Another consequence is that the system size and operation speed increase, which leads to increased problems with timing and synchronization. System and delay models necessary for the study of time performances of synchronous and asynchronous systems are developed. Clock skew is recognized as a key factor for the performance of synchronous systems. A mode of clocking that reduces the clock skew substantially is proposed and examined. Time penalty introduced by synchronizers is recognized as a key factor for the performance of asynchronous systems. This parameter is expressed in terms of system parameters. Different techniques and recommendations concerning performance improvement of synchronous and asynchronous systems are discussed. >

Journal ArticleDOI
TL;DR: A novel general algorithm for signed number division in the residue number system (RNS) is presented that is simple, efficient, and practical for implementation on a real RNS divider.
Abstract: A novel general algorithm for signed number division in the residue number system (RNS) is presented. The parity checking technique used for sign and overflow detection in this algorithm is more efficient and practical than conventional methods. Sign magnitude arithmetic division is implemented using binary search. There is no restriction to the dividend and the divisor (except zero divisor), and no quotient estimation is necessary before the division is executed. Only simple operations are needed to accomplish this RBS division. All these characteristics have made the algorithm simple, efficient, and practical for implementation on a real RNS divider. >

Journal ArticleDOI
TL;DR: It is shown that it can be evaluated without carry propagation, thus drastically reducing the computation time, and a circuit is proposed for the fast evaluation of A+B=K conditions that can significantly improve processor performance.
Abstract: The response time of parallel adders is mainly determined by the carry propagation delay. The evaluation of conditions of the type A+B=K is addressed. Although an addition is involved in the comparison, it is shown that it can be evaluated without carry propagation, thus drastically reducing the computation time. Dependencies produced by branches degrade the performance of pipelined computers. The evaluation of conditions is often one of the critical paths in the execution of branch instructions. A circuit is proposed for the fast evaluation of A+B=K conditions that can significantly improve processor performance. >

Journal ArticleDOI
TL;DR: Results of simulation studies are presented that show that the rollback chip can virtually eliminate the state saving and rollback overheads that plague current software implementations of Time Warp.
Abstract: Existing approaches to implement state saving are not appropriate for large Time Warp programs. The authors propose a component called the rollback chip (RBC) that efficiently implements state saving. Such a component could be used in a programmable, special purpose parallel discrete event simulation engine based on Time Warp. The algorithms implemented by the rollback chip are described, as well as mechanisms that allow efficient implementation. Results of simulation studies are presented that show that the rollback chip can virtually eliminate the state saving and rollback overheads that plague current software implementations of Time Warp. >