scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Computers in 1986"


Journal ArticleDOI
TL;DR: In this paper, the authors present a data structure for representing Boolean functions and an associated set of manipulation algorithms, which have time complexity proportional to the sizes of the graphs being operated on, and hence are quite efficient as long as the graphs do not grow too large.
Abstract: In this paper we present a new data structure for representing Boolean functions and an associated set of manipulation algorithms. Functions are represented by directed, acyclic graphs in a manner similar to the representations introduced by Lee [1] and Akers [2], but with further restrictions on the ordering of decision variables in the graph. Although a function requires, in the worst case, a graph of size exponential in the number of arguments, many of the functions encountered in typical applications have a more reasonable representation. Our algorithms have time complexity proportional to the sizes of the graphs being operated on, and hence are quite efficient as long as the graphs do not grow too large. We present experimental results from applying these algorithms to problems in logic design verification that demonstrate the practicality of our approach.

9,021 citations


Journal ArticleDOI
Fortes1
TL;DR: A technique for partitioning and mapping algorithms into VLSI systolic arrays is presented and an approach to algorithm partitioning which is also based on algorithm transformations is presented.
Abstract: A technique for partitioning and mapping algorithms into VLSI systolic arrays is presented in this paper. Algorithm partitioning is essential when the size of a computational problem is larger than the size of the VLSI array intended for that problem. Computational models are introduced for systolic arrays and iterative algorithms. First, we discuss the mapping of algorithms into arbitrarily large size VLSI arrays. This mapping is based on the idea of algorithm transformations. Then, we present an approach to algorithm partitioning which is also based on algorithm transformations. Our approach to the partitioning problem is to divide the algorithm index set into bands and to map these bands into the processor space. The partitioning and mapping technique developed throughout the paper is summarized as a six step procedure. A computer program implementing this procedure was developed and some results obtained with this program are presented.

490 citations


Journal ArticleDOI
Kim1
TL;DR: The advantages and limitations of the proposed disk interleaving scheme are analyzed using the M/G/1 queueing model and compared to the conventional disk access mechanism.
Abstract: A group of disks may be interleaved to speed up data transfers in a manner analogous to the speedup achieved by main memory interleaving. Conventional disks may be used for interleaving by spreading data across disks and by treating multiple disks as if they were a single one. Furthermore, the rotation of the interleaved disks may be synchronized to simplify control and also to optimize performance. In addition, check- sums may be placed on separate check-sum disks in order to improve reliability. In this paper, we study synchronized disk interleaving as a high-performance mass storage system architecture. The advantages and limitations of the proposed disk interleaving scheme are analyzed using the M/G/1 queueing model and compared to the conventional disk access mechanism.

332 citations


Journal ArticleDOI
TL;DR: The problem considered in this paper is the deterministic scheduling of tasks on a set of identical processors, but the model presented differs from the classical one by the requirement that certain tasks need more than one processor at a time for their processing.
Abstract: The problem considered in this paper is the deterministic scheduling of tasks on a set of identical processors. However, the model presented differs from the classical one by the requirement that certain tasks need more than one processor at a time for their processing. This assumption is especially justified in some microprocessor applications and its impact on the complexity of minimizing schedule length is studied. First we concentrate on the problem of nonpreemptive scheduling. In this case, polynomial-time algorithms exist only for unit processing times. We present two such algorithms of complexity O(n) for scheduling tasks requiring an arbitrary number of processors between 1 and k at a time where k is a fixed integer. The case for which k is not fixed is shown to be NP-complete. Next, the problem of preemptive scheduling of tasks of arbitrary length is studied. First an algorithm for scheduling tasks requiring one or k processors is presented. Its complexity depends linearly on the number of tasks. Then, the possibility of a linear programming formulation for the general case is analyzed.

275 citations


Journal ArticleDOI
TL;DR: An automatic verification system for sequential circuits in which specifications are expressed in a propositional temporal logic, which does not require any user assistance and is quite fast—experimental results show that state machines with several hundred states can be checked for correctness in a matter of seconds.
Abstract: Verifying the correctness of sequential circuits has been an important problem for a long time. But lack of any formal and efficient method of verification has prevented the creation of practical design aids for this purpose. Since all the known techniques of simulation and prototype testing are time consuming and not very reliable, there is an acute need for such tools. In this paper we describe an automatic verification system for sequential circuits in which specifications are expressed in a propositional temporal logic. In contrast to most other mechanical verification systems, our system does not require any user assistance and is quite fast—experimental results show that state machines with several hundred states can be checked for correctness in a matter of seconds!

217 citations


Journal ArticleDOI
Yamakawa1, Miki
TL;DR: Nine basic fuzzy logic circuits employing p-ch and n-ch current mirrors are presented, and the fuzzy information processing hardware system design at a low cost with only one kind of master slice (semicustom fuzzy logic IC) is described.
Abstract: Nine basic fuzzy logic circuits employing p-ch and n-ch current mirrors are presented, and the fuzzy information processing hardware system design at a low cost with only one kind of master slice (semicustom fuzzy logic IC) is described. The fuzzy logic circuits presented here will be indispensable for a "fuzzy computer" in the near future.

180 citations


Journal ArticleDOI
TL;DR: In this paper, an approximation algorithm for systematically converting a stiff Markov chain into a nonstiff chain with a smaller state space is discussed, and the algorithm proceeds by further classifying fast states into fast recurrent subsets and a fast transient subset.
Abstract: An approximation algorithm for systematically converting a stiff Markov chain into a nonstiff chain with a smaller state space is discussed in this paper. After classifying the set of all states into fast and slow states, the algorithm proceeds by further classifying fast states into fast recurrent subsets and a fast transient subset. A separate analysis of each of these fast subsets is done and each fast recurrent subset is replaced by a single slow state while the fast transient subset is replaced by a probabilistic switch. After this reduction, the remaining small and nonstiff Markov chain is analyzed by a conventional technique.

175 citations


Journal ArticleDOI
TL;DR: In this paper, the authors proposed a suitable instruction issuing scheme for high-end scientific computation tasks with multiple functional units, such as CRAY-1, Cyber 205, and FPS 164.
Abstract: Processors with multiple functional units, such as CRAY-1, Cyber 205, and FPS 164, have been used for high-end scientific computation tasks. Much effort has been put into increasing the throughput of such systems. One critical consideration in their design is the identification and implementation of a suitable instruction issuing scheme. Existing approaches do not issue enough instructions per machine cycle to fully utilize the functional units and realize the high-performance level achievable with these powerful execution resources.

164 citations


Journal ArticleDOI
TL;DR: The results show that only a certain class of cellular automata rules exhibit group characteristics based on rule multiplication, however, many other of these automata reveal groups based on permutations of their global states.
Abstract: The study of one-dimensional cellular automata exhibiting group properties is presented The results show that only a certain class of cellular automata rules exhibit group characteristics based on rule multiplication However, many other of these automata reveal groups based on permutations of their global states It is further shown how these groups may be utilized in the design of modulo arithmetic units The communication properties of cellular automata are observed to map favorably to optimal communication graphs for VLSI layouts They exploit the implementation medium and properly address the physical limits on computational structures Comparisons of cellular automata-based modulo arithmetic units with other VLSI algorithms are presented using area-time complexity measures

152 citations


Journal ArticleDOI
TL;DR: A new model for parallel computations and parallel computer systems that is based on data flow principles is presented, which can be used to model computer systems including data driven and parallel processors.
Abstract: In this paper, a new model for parallel computations and parallel computer systems that is based on data flow principles is presented. Uninterpreted data flow graphs can be used to model computer systems including data driven and parallel processors. A data flow graph is defined to be a bipartite graph with actors and links as the two vertex classes. Actors can be considered similar to transitions in Petri nets, and links similar to places. The nondeterministic nature of uninterpreted data flow graphs necessitates the derivation of liveness conditions.

142 citations


Journal ArticleDOI
De Souza E Silva1, Gail1
TL;DR: The distribution of cumulative operational time is calculated, which is the distribution of the total time during which the system was in operation over a finite observation period, based on the randomization technique.
Abstract: We consider repairable computer systems, those for which repair can be performed to put the system back in operation. The behavior of the system is assumed to be modeled as a homogeneous Markov process. We calculate numerically the distribution of cumulative operational time, which is the distribution of the total time during which the system was in operation over a finite observation period. The method is based on the randomization technique. The main advantages include the ability to specify error tolerances in advance, numerical stability, and simplicity of implementation. We also show that other quantities of interest can be calculated as a byproduct of the method without any significant extra computational effort.

Journal ArticleDOI
Reddy1
TL;DR: It is shown that all single FET stuck-open faults, in a specific design using a single CMOs complex gate, are detectable by tests that remain valid in the presence of arbitrary circuit delays.
Abstract: In this paper, potential invalidation of stuck-open fault-detecting tests, derived by neglecting circuit delays and charge distribution in CMOS logic circuits, is studied. Several classes of circuits derived from sum of products and product of sums expressions for a given combinational logic function are investigated to determine the testability of FET stuck-open faults by tests which will remain valid in the presence of arbitrary circuit delays. Necessary and sufficient conditions for the existence of tests that will remain valid in the presence of arbitrary circuit delays are derived. Using these conditions, it is shown that all single FET stuck-open faults, in a specific design using a single CMOs complex gate, are detectable by tests that remain valid in the presence of arbitrary circuit delays. For several other realizations, methods to augment them, to insure detectability of all single FET stuck-open faults by tests that will remain valid in the presence of arbitrary circuit delays are proposed. It is observed that in many of the logic circuits investigated it is also possible to avoid test invalidation due to charge distribution.

Journal ArticleDOI
TL;DR: A key element (one is tempted to say the heart) of most digital systems is the clock, which determines the rate at which data are processed, and so should be made as small as possible, consistent with reliable operation.
Abstract: A key element (one is tempted to say the heart) of most digital systems is the clock. Its period determines the rate at which data are processed, and so should be made as small as possible, consistent with reliable operation.

Journal ArticleDOI
TL;DR: A dynamic programming algorithm is presented that ensures that backup, or contingency, schedules can be efficiently embedded within the original, "primary" schedule to ensure that hard deadlines continue to be met in the face of up to a given maximum number of processor failures.
Abstract: Multiprocessors used in life-critical real-time systems must recover quickly from failure. Part of this recovery consists of switching to a new task schedule that ensures that hard deadlines for critical tasks continue to be met. We present a dynamic programming algorithm that ensures that backup, or contingency, schedules can be efficiently embedded within the original, "primary" schedule to ensure that hard deadlines continue to be met in the face of up to a given maximum number of processor failures. Several illustrative examples are included.

Journal ArticleDOI
Chan1, Saad1
TL;DR: Several mappings of the mesh points onto the nodes of the cube are presented, which are based on binary reflected Gray codes, which results in a communication effective implementation of multigrid algorithms on the hypercube multiprocessor.
Abstract: This paper examines several ways of implementing multigrid algorithms on the hypercube multiprocessor. We consider both the standard multigrid algorithms and a concurrent version proposed by Gannon and Van Rosendale. We present several mappings of the mesh points onto the nodes of the cube. The main property of these mappings, which are based on binary reflected Gray codes, is that the distance between neighboring grid points remains constant from one grid level to another. This results in a communication effective implementation of multigrid algorithms on the hypercube multiprocessor.

Journal ArticleDOI
TL;DR: The measurements show that the probability of a CPU related error (the load hazard) increases nonlinearly with increasing workload; i.e., the CPU rapidly deteriorates as end points are reached.
Abstract: This paper proposes and validates a methodology to measure explicitly the increase in the risk of a processor error with increasing workload. By relating the occurrence of a CPU related error to the system activity just prior to the occurrence of an error, the approach measures the dynamic CPU workload/failure relationship. The measurements show that the probability of a CPU related error (the load hazard) increases nonlinearly with increasing workload; i.e., the CPU rapidly deteriorates as end points are reached. The load hazard is observed to be most sensitive to system CPU utilization, the I/O rate, and the interrupt rates. The results are significant because they indicate that it may not be useful to push a system close to its performance limits (the previously accepted operating goal) since what we gain in slightly improved performance is more than offset by the degradation in reliability. Importantly, they also indicate that conventional reliability models need to be reevaluated so as to take system work-load explicity into account.

Journal ArticleDOI
TL;DR: Heuristic and optimal solution procedures are developed and computational experience with these procedures is reported and Implications of the model for designing distributed systems are discussed.
Abstract: Design of distributed computer systems is a complex task requiring solutions for several difficult problems. Location of computing resources and databases in a wide-area network is one of these problems which has not yet been solved satisfactorily. Solution of this problem involves determining number and size of computer facilities and their locations, configuring databases and allocating these databases among computer facilities. An integer programming formulation of the problem is presented. Heuristic and optimal solution procedures are developed and computational experience with these procedures is reported. Implications of the model for designing distributed systems are discussed.

Journal ArticleDOI
Mukaidono1
TL;DR: This correspondence describes the fundamental properties and representations of the regular ternary logic functions.
Abstract: A special group of ternary functions, called regular ternary logic functions, are defined. These functions are useful in switching theory, programming languages, algorithm theory, and many other fields—if we are concerned with the indefinite state in such fields. This correspondence describes the fundamental properties and representations of the regular ternary logic functions.

Journal ArticleDOI
Hagmann1
TL;DR: This correspondence presents a method of performing crash recovery for database systems designed to provide fast transaction processing, to effectively use multiple processors, and to perform a fast restart after a crash.
Abstract: Database systems normally have been designed with the assumption that main memory is too small to hold the entire database. With the decreasing cost and increasing performance of semiconductor memories, future database systems may be constructed that keep most or all of the data for the database in main memory. The challenge in the design of these systems is to provide fast transaction processing, to effectively use multiple processors, and to perform a fast restart after a crash. This correspondence presents a method of performing crash recovery for these systems.

Journal ArticleDOI
Thomasian1
TL;DR: An efficient algorithm to determine the mean completion time and related performance measures for a task system: a set of tasks with precedence relationships in their execution sequence, such that the resulting graph is acyclic.
Abstract: This paper is concerned with the performance evaluation of a realistic model of parallel computations. We present an efficient algorithm to determine the mean completion time and related performance measures for a task system: a set of tasks with precedence relationships in their execution sequence, such that the resulting graph is acyclic. A queueing network (QN) is used to model tasks executing on a single or multicomputer system. In the case of multicomputer systems, we take into account the delay due to interprocess communication. A straight- forward application of a QN solver to the problem is not possible due to variations in the state of the system (composition of tasks in execution). An accurate algorithm based on hierarchical decomposition is presented for solving task systems. At the higher level, the system behavior is specified by a Markov chain whose states correspond to the combination of tasks in execution. The state transition rate matrix for the Markov chain is triangular (since the task system graph was assumed to be acyclic), therefore it can be solved efficiently to compute the state probabilities and the task initiation/completion times. At the lower level, the transition rates among the states of the Markov chain are computed using a QN solver, which determines the throughput of the computer system for each system state. The model and the solution method can be used in performance evaluation of applications exhibiting concurrency in centralized/distributed systems where there are conflicting goals of load balancing and minimizing interprocess communication overhead.

Journal ArticleDOI
TL;DR: In this correspondence, anomalies of parallel branch-and-bound algorithms using the same search strategy as the corresponding serial algorithms are studied and sufficient conditions to guarantee no degradation in performance and necessary conditions for allowing parallelism to have a speedup greater than the number of processors are presented.
Abstract: A general technique that can be used to solve a wide variety of discrete optimization problems is the branch-and-bound algorithm. We have adapted and extended branch-and-bound algorithms for parallel processing. The computational efficiency of these algorithms depends on the allowance function, the data structure, and the search strategies. Anomalies owing to parallelism may occur. In this correspondence, anomalies of parallel branch-and-bound algorithms using the same search strategy as the corresponding serial algorithms are studied. Sufficient conditions to guarantee no degradation in performance due to parallelism and necessary conditions for allowing parallelism to have a speedup greater than the number of processors are presented.

Journal ArticleDOI
Iyer1, Donatiello, Heidelberger
TL;DR: For large mission times and Markovian models, it is shown that known limit theorems lead to an asymptotic normal distribution for the aggregate reward.
Abstract: Performability, a composite measure for the performance and reliability, may be interpreted as the probability density function of the aggregate reward obtained from a system during its mission time. For large mission times we show that known limit theorems lead to an asymptotic normal distribution for the aggregate reward. For finite mission times and Markovian models we obtain the expressions for all moments of performability and give recursions to compute coefficients involved in the expressions. We illustrate the use of the results through an example of a multiple processor computer system.

Journal ArticleDOI
TL;DR: It is shown that foqur complete problems for P (nonsparse versions of unification, path system accessibility, monotone circuit value, and ordered depth-first search) are parallelizable.
Abstract: Previous theoretical work in computational complexity has suggested that any problem which is log-space complete for P is not likely in NC, and thus not parallelizable. In practice, this is not the case. To resolve this paradox, we introduce new complexity classes PC and PC* that capture the practical notion of parallelizability we discuss in this paper. We show that foqur complete problems for P (nonsparse versions of unification, path system accessibility, monotone circuit value, and ordered depth-first search) are parallelizable. That is, their running times are O(E + V) on a sequential RAM and O(E/P + V log P) on an EXCLUSIVE-READ EXCLUSIVE-WRITE Parallel RAM with P processors where V and E are the numbers of vertices and edges in the inputed instance of the problem. These problems are in PC and PC*, since an appropriate choice of P can speed up their sequential running times by a factor of μ(P). Several interesting open questions are raised regarding these new parallel complexity classes PC and PC*. Unification is particularly important because it is a basic operation in theorem proving, in type inference algorithms, and in logic programming languages such as Prolog. A fast parallel implementation of Prolog is needed for software development in the Fifth Generation project.

Journal ArticleDOI
Manoj Kumar1, J. R. Jump1
TL;DR: Lower and upper bounds on the solution of this recurrence relation are derived and one possible implementation for crossbar switches that could be used in unbuffered delta networks is discussed.
Abstract: The throughput of unbuffered shuffle-exchange networks (also known as delta networks) is related to the arrival rate by a quadratic recurrence relation. Lower and upper bounds on the solution of this recurrence relation are derived in this paper. Two approaches for improving the throughput of unbuffered delta networks are investigated. The first approach combines multiple delta subnetworks of size N × N each, in parallel, to obtain a network of size N × N. Three policies used to distribute the incoming packets between the subnetworks are discussed and the relative effect of each on the throughput is compared. The second approach replaces each link of the simple delta networks by K parallel links (K equals 2,4,...,). The throughput of such networks is analyzed and one possible implementation for crossbar switches that could be used in these networks is discussed. The throughput of such networks with four parallel links is almost equal to the throughput of crossbars.

Journal ArticleDOI
TL;DR: A faster algorithm for finding a maximum independent set in a graph is presented and is an improved version of the one by Tarjan and Trojanowski.
Abstract: A faster algorithm for finding a maximum independent set in a graph is presented. The algorithm is an improved version of the one by Tarjan and Trojanowski [7]. A technique to further accelerate this algorithm is also described.

Journal ArticleDOI
Rich1
TL;DR: Techniques of storing multiple bits of information in a single memory location are reviewed and the peripheral circuitry required to distinguish between the states stored in the memory areas is discussed.
Abstract: Techniques of storing multiple bits of information in a single memory location are reviewed. Any of several states can be stored in ROM's by adjusting the threshold voltage or the size of a particular memory device. In dynamic RAM's, this can be achieved by varying the charge stored on the cell capacitor. The peripheral circuitry required to distinguish between the states stored in the memory areas is discussed.

Journal ArticleDOI
TL;DR: This paper presents a graph-theoretic model for determining upper and lower bounds on the number of checks needed for achieving concurrent fault detection and location, and estimates the overhead in time and thenumber of processors required for such a scheme.
Abstract: An important consideration in the design of high- performance multiple processor systems should be in ensuring the correctness of results computed by such complex systems which are extremely prone to transient and intermittent failures. The detection and location of faults and errors concurrently with normal system operation can be achieved through the application of appropriate on-line checks on the results of the computations. This is the domain of algorithm-based fault tolerance, which deals with low-cost system-level fault-tolerance techniques to produce reliable computations in multiple processor systems, by tailoring the fault-tolerance techniques toward specific algorithms. This paper presents a graph-theoretic model for determining upper and lower bounds on the number of checks needed for achieving concurrent fault detection and location. The objective is to estimate ate the overhead in time and the number of processors required for such a scheme. Faults in processors, errors in the data, and checks on the data to detect and locate errors are represented as a tripartite graph. Bounds on the time and processor overhead are obtained by considering a series of subproblems. First, using some crude concepts for t-fault detection and t-fault location, bounds on the maximum size of the error patterns that can arise from such fault patterns are obtained. Using these results, bounds are derived on the number of checks required for error detection and location. Some numerical results are derived from a linear programming formulation.

Journal ArticleDOI
TL;DR: A simulation study of the CRAY X-MP interleaved memory system with attention focused on steady state performance for sequences of vector operations, identifying the occurrence of linked conflicts, repeating sequences of conflicts between two or more vector streams that result in reduced steadyState performance.
Abstract: One of the significant differences between the CRAY X-MP and its predecessor, the CRAY-1S, is a considerably increased memory bandwidth for vector operations. Up to three vector streams in each of the two processors may be active simultaneously. These streams contend for memory banks as well as data paths. All memory conflicts are resolved dynamically by the memory system. This paper describes a simulation study of the CRAY X-MP interleaved memory system with attention focused on steady state performance for sequences of vector operations. Because it is more amenable to analysis, we first study the interaction of vector streams issued from a single processor. We identify the occurrence of linked conflicts, repeating sequences of conflicts between two or more vector streams that result in reduced steady state performance. Both worst case and average case performance measures are given. The discussion then turns to dual processor interactions. Finally, based on our simulations, possible modifications to the CRAY X-MP memory system are proposed and compared. These modifications are intended to eliminate or reduce the effects of linked conflicts.

Journal ArticleDOI
Aggarwal1
TL;DR: The problem of finding the maximum of a set of values stored one per processor on a two-dimensional array of processors with a time-shared global bus is considered and the algorithm given by Bokhari is shown to be optimal, within a multiplier constant, for this network and for other d-dimensional arrays.
Abstract: The problem of finding the maximum of a set of values stored one per processor on a two-dimensional array of processors with a time-shared global bus is considered. The algorithm given by Bokhari is shown to be optimal, within a multiplicative constant, for this network and for other d-dimensional arrays. We generalize this model and demonstrate optimal bounds for finding the maximum of a set of values stored in a d-dimensional array with k time-shared global buses.

Journal ArticleDOI
TL;DR: An algorithm for merging k sorted lists of n/k elements using k processors is presented and it is proved its worst case complexity to be 2n, regardless of the number of processors, while neglecting the cost arising from possible conflicts on the broadcast channel.
Abstract: The paper addresses ways in which one can use "broadcast communication" in distributed algorithms and the relevant issues of design and complexity. We present an algorithm for merging k sorted lists of n/k elements using k processors and prove its worst case complexity to be 2n, regardless of the number of processors, while neglecting the cost arising from possible conflicts on the broadcast channel. We also show that this algorithm is optimal under single-channel broadcast communication. In a variation of the algorithm, we show that by using an extra local memory of O(k) the number of broadcasts is reduced to n. When the algorithm is used for sorting n elements with k processors, where each processor sorts its own list first and then merging, it has a complexity of O(n/k log(n/k) + n), and is thus asymptotically optimal for large n. We also discuss the cost incurred by the channel access scheme and prove that resolving conflicts whenever k processors are involved introduces a cost factor of at least log k.