scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Computers in 1993"


Journal ArticleDOI
TL;DR: An object recognition system based on the dynamic link architecture, an extension to classical artificial neural networks (ANNs), is presented and the implementation on a transputer network achieved recognition of human faces and office objects from gray-level camera images.
Abstract: An object recognition system based on the dynamic link architecture, an extension to classical artificial neural networks (ANNs), is presented. The dynamic link architecture exploits correlations in the fine-scale temporal structure of cellular signals to group neurons dynamically into higher-order entities. These entities represent a rich structure and can code for high-level objects. To demonstrate the capabilities of the dynamic link architecture, a program was implemented that can recognize human faces and other objects from video images. Memorized objects are represented by sparse graphs, whose vertices are labeled by a multiresolution description in terms of a local power spectrum, and whose edges are labeled by geometrical distance vectors. Object recognition can be formulated as elastic graph matching, which is performed here by stochastic optimization of a matching cost function. The implementation on a transputer network achieved recognition of human faces and office objects from gray-level camera images. The performance of the program is evaluated by a statistical analysis of recognition results from a portrait gallery comprising images of 87 persons. >

1,973 citations


Journal ArticleDOI
TL;DR: It turns out that SWN's allow the representation of any color function in a structured form, so that any unconstrained high-level net can be transformed into a well-formed net.
Abstract: The class of stochastic well-formed colored nets (SWN's) was defined as a syntactic restriction of stochastic high-level nets. The interest of the introduction of restrictions in the model definition is the possibility of exploiting the symbolic reachability graph (SRG) to reduce the complexity of Markovian performance evaluation with respect to classical Petri net techniques. It turns out that SWN's allow the representation of any color function in a structured form, so that any unconstrained high-level net can be transformed into a well-formed net. Moreover, most constructs useful for the modeling of distributed computer systems and architectures directly match the "well-formed" restriction, without any need of transformation. A nontrivial example of the usefulness of the technique in the performance modeling and evaluation of multiprocessor architectures is included. >

340 citations


Journal ArticleDOI
TL;DR: The author introduces a scheme to generate carry bits with block-carry-in 1 from the carries of a block withBlock- Carry-in 0 to derive a more area-efficient implementation for both the carry-select and parallel-prefix adders.
Abstract: The carry-select or conditional-sum adders require carry-chain evaluations for each block for both the values of block-carry-in, 0 and 1. The author introduces a scheme to generate carry bits with block-carry-in 1 from the carries of a block with block-carry-in 0. This scheme is then applied to carry-select and parallel-prefix adders to derive a more area-efficient implementation for both the cases. The proposed carry-select scheme is assessed relative to carry-ripple, classical carry-select, and carry-skip adders. The analytic evaluation is done with respect to the gate-count model for area and gate-delay units for time. >

263 citations


Journal ArticleDOI
TL;DR: Algorithms that are efficient for solving a variety of problems involving graphs and digitized images are introduced that are asymptotically superior to those previously obtained for the mesh, the mesh with multiple broadcasting, the meshes with multiple buses, theMesh-of-trees, and the pyramid computer.
Abstract: The mesh with reconfigurable bus is presented as a model of computation. The reconfigurable mesh captures salient features from a variety of sources, including the CAAPP, CHiP, polymorphic-torus network, and bus automation. It consists of an array of processors interconnected by a reconfigurable bus system that can be used to dynamically obtain various interconnection patterns between the processors. A variety of fundamental data-movement operations for the reconfigurable mesh are introduced. Based on these operations, algorithms that are efficient for solving a variety of problems involving graphs and digitized images are also introduced. The algorithms are asymptotically superior to those previously obtained for the aforementioned reconfigurable architectures, as well as to those previously obtained for the mesh, the mesh with multiple broadcasting, the mesh with multiple buses, the mesh-of-trees, and the pyramid computer. The power of reconfigurability is illustrated by solving some problems, such as the exclusive OR, more efficiently on the reconfigurable mesh than is possible on the programmable random-access memory (PRAM). >

261 citations


Journal ArticleDOI
TL;DR: A theoretical analysis of error due to finite precision computation was undertaken to determine the necessary precision for successful forward retrieving and back-propagation learning in a multilayer perceptron.
Abstract: Through parallel processing, low precision fixed point hardware can be used to build a very high speed neural network computing engine where the low precision results in a drastic reduction in system cost. The reduced silicon area required to implement a single processing unit is taken advantage of by implementing multiple processing units on a single piece of silicon and operating them in parallel. The important question which arises is how much precision is required to implement neural network algorithms on this low precision hardware. A theoretical analysis of error due to finite precision computation was undertaken to determine the necessary precision for successful forward retrieving and back-propagation learning in a multilayer perceptron. This analysis can easily be further extended to provide a general finite precision analysis technique by which most neural network algorithms under any set of hardware constraints may be evaluated. >

245 citations


Journal ArticleDOI
TL;DR: Hardware is described for implementing the fast modular multiplication algorithm developed by P.L. Montgomery (1985), showing that this algorithm is up to twice as fast as the best currently available and is more suitable for alternative architectures.
Abstract: Hardware is described for implementing the fast modular multiplication algorithm developed by P.L. Montgomery (1985). Comparison with previous techniques shows that this algorithm is up to twice as fast as the best currently available and is more suitable for alternative architectures. The gain in speed arises from the faster clock that results from simpler combinational logic. >

238 citations


Journal ArticleDOI
TL;DR: A systolic array for modular multiplication is presented using the ideally suited algorithm of P.L. Montgomery (1985), where its main use would be where many consecutive multiplications are done, as in RSA cryptosystems.
Abstract: A systolic array for modular multiplication is presented using the ideally suited algorithm of P.L. Montgomery (1985). Throughput is one modular multiplication every clock cycle, with a latency of 2n+2 cycles for multiplicands having n digits. Its main use would be where many consecutive multiplications are done, as in RSA cryptosystems. >

229 citations


Journal ArticleDOI
TL;DR: A dependability evaluation method based on fault injection that establishes the link between the experimental evaluation of the fault tolerance process and the fault occurrence process is described and their interactions are analyzed.
Abstract: The authors describe a dependability evaluation method based on fault injection that establishes the link between the experimental evaluation of the fault tolerance process and the fault occurrence process. The main characteristics of a fault injection test sequence aimed at evaluating the coverage of the fault tolerance process are presented. Emphasis is given to the derivation of experimental measures. The various steps by which the fault occurrence and fault tolerance processes are combined to evaluate dependability measures are identified and their interactions are analyzed. The method is illustrated by an application to the dependability evaluation of the distributed fault-tolerant architecture of the Esprit Delta-4 Project. >

227 citations



Journal ArticleDOI
TL;DR: After a brief survey of the CORDic algorithm, some new results that allow fast and easy signed-digit implementation of CORDIC, without modifying the basic iteration step are given.
Abstract: After a brief survey of the CORDIC algorithm, some new results that allow fast and easy signed-digit implementation of CORDIC, without modifying the basic iteration step are given. A slight modification would make it possible to use a carry-save representation of numbers, instead of a signed-digit one. The method, called the branching CORDIC method, consists of performing in parallel two classic CORDIC rotations. It gives a constant normalization factor. An online implementation of the algorithm is proposed with an online delay equal to 5 for the sine and cosine functions. >

171 citations


Journal ArticleDOI
TL;DR: The design of a modular standard basis inversion for Galois fields GF(2/sup m/) based on Euclid's algorithm for computing the greatest common divisor of two polynomials is presented, resulting in an AT-complexity of O(m/sup 2/).
Abstract: The design of a modular standard basis inversion for Galois fields GF(2/sup m/) based on Euclid's algorithm for computing the greatest common divisor of two polynomials is presented. The asymptotic complexity is linear with m both in computation time and area requirement, thus resulting in an AT-complexity of O(m/sup 2/). This is a significant improvement over the best previous proposal which achieves AT-complexity of only O(m/sup 3/). >

Journal ArticleDOI
TL;DR: An optical communication structure is proposed for multiprocessor arrays which exploits the high communication bandwidth of optical waveguides and time-division multiplexing of messages has the same effect as message pipelining on opticalWaveguides.
Abstract: An optical communication structure is proposed for multiprocessor arrays which exploits the high communication bandwidth of optical waveguides. The structure takes advantage of two properties of optical signal transmissions on waveguides, namely, unidirectional propagation and predictable propagation delays per unit length. Because of these two properties, time-division multiplexing (TDM) of messages has the same effect as message pipelining on optical waveguides. Two TDM approaches are proposed, and the combination of the two is used in the design of the optical communication structure. Analysis and simulation results are given to demonstrate the communication effectiveness of the system. A clock distribution method is proposed to address potential synchronization problems. Feasibility issues with current and future technologies are discussed. >

Journal ArticleDOI
TL;DR: By removing the redundancy, a modified parallel multiplier is presented which is modular and has a lower circuit complexity.
Abstract: A Massey-Omura parallel multiplier of finite fields GF(2/sup m/) contains m identical blocks whose inputs are cyclically shifted versions of one another. It is shown that for fields GF(2/sup m/) generated by irreducible all one polynomials, a portion of the block is independent of the input cyclic shift; hence, the multiplier contains redundancy. By removing the redundancy, a modified parallel multiplier is presented which is modular and has a lower circuit complexity. >

Journal ArticleDOI
TL;DR: It is shown that this method has better performance in terms of minimizing the number of classification errors than the squared error minimization method used in backpropagation.
Abstract: A pattern classification method called neural tree networks (NTNs) is presented. The NTN consists of neural networks connected in a tree architecture. The neural networks are used to recursively partition the feature space into subregions. Each terminal subregion is assigned a class label which depends on the training data routed to it by the neural networks. The NTN is grown by a learning algorithm, as opposed to multilayer perceptrons (MLPs), where the architecture must be specified before learning can begin. A heuristic learning algorithm based on minimizing the L1 norm of the error is used to grow the NTN. It is shown that this method has better performance in terms of minimizing the number of classification errors than the squared error minimization method used in backpropagation. An optimal pruning algorithm is given to enhance the generalization of the NTN. Simulation results are presented on Boolean function learning tasks and a speaker independent vowel recognition task. The NTN compares favorably to both neural networks and decision trees. >

Journal ArticleDOI
TL;DR: A branch target buffer (BTB) can reduce the performance penalty of branches in pipelined processors by predicting the path of the branch and caching information used by the branch as discussed by the authors, but it requires a large number of bits allocated to the BTB implementation.
Abstract: A branch target buffer (BTB) can reduce the performance penalty of branches in pipelined processors by predicting the path of the branch and caching information used by the branch. Two major issues in the design of BTBs that achieves maximum performance with a limited number of bits allocated to the BTB implementation are discussed. The first is BTB management. A method for discarding branches from the BTB is examined. This method discards the branch with the smallest expected value for improving performance; it outperforms the least recently used (LRU) strategy by a small margin, at the cost of additional complexity. The second issue is the question of what information to store in the BTB. A BTB entry can consist of one or more of the following: branch tag, prediction information, the branch target address, and instructions at the branch target. Various BTB designs, with one or more of these fields, are evaluated and compared. >

Journal ArticleDOI
TL;DR: A greedy algorithm which takes only O(n/sup 2/) operations is developed to perform CORDIC angle recoding, and it is proven that this algorithm is able to reduce the total number of required elementary rotation angles by at least 50% without affecting the computational accuracy.
Abstract: The coordinate rotation digital computer (CORDIC), an iterative arithmetic algorithm for computing generalized vector rotations without performing multiplications, is discussed. For applications where the angle of rotation is known in advance, a method to speed up the execution of the CORDIC algorithm by reducing the total number of iterations is presented. This is accomplished by using a technique called angle recoding, which encodes the desired rotation angle as a linear combination of very few elementary rotation angles. Each of these elementary rotation angles takes one CORDIC iteration to compute. The fewer the number of elementary rotation angles, the fewer the number of iterations are required. A greedy algorithm which takes only O(n/sup 2/) operations is developed to perform CORDIC angle recoding. It is proven that this algorithm is able to reduce the total number of required elementary rotation angles by at least 50% without affecting the computational accuracy. >

Journal ArticleDOI
TL;DR: The authors give an online algorithm for computing a canonical signed digit representation of minimal Hamming weight for any integer n and shows that E(K/sub r/) approximately (r-1)k/(r+1) as k to infinity.
Abstract: The authors give an online algorithm for computing a canonical signed digit representation of minimal Hamming weight for any integer n. Using combinatorial techniques, the probability distributions Pr(K/sub r/=h), where K/sub r/ is taken to be a random variable on the uniform probability space of k-digit integers is computed. Also, using a Markov chain analysis, it is shown that E(K/sub r/) approximately (r-1)k/(r+1) as k to infinity . >

Journal ArticleDOI
TL;DR: Two approaches for tackling the numerical accuracy problem of fixed-point CORDIC are described and arguments to support the use of such an architecture in certain special-purpose arrays are presented.
Abstract: The coordinate rotation digital computer (CORDIC) algorithm is used in numerous special-purpose systems for real-time signal processing applications. An analysis of fixed-point CORDIC in the Y-reduction mode, which allows computation of the inverse tangent function, shows that unnormalized input values can result in large numerical errors. The authors describe two approaches for tackling the numerical accuracy problem. The first approach builds on a fixed-point CORDIC unit and eliminates the problem by including additional hardware for normalization. A method for integrating the normalization operation with the CORDIC iterations for efficient implementation in O(n/sup 1.5/) hardware is provided. The second solution to the accuracy problem is to use a floating-point CORDIC unit but reduce the implementation complexity by using a hybrid architecture. Arguments to support the use of such an architecture in certain special-purpose arrays are presented. >

Journal ArticleDOI
TL;DR: An efficient sequential circuit automatic test generation algorithm based on PODEM and uses a nine-valued logic model that saves both the good and the faulty machine states after finding a test to aid in subsequent test generation.
Abstract: This paper presents an efficient sequential circuit automatic test generation algorithm. The algorithm is based on PODEM and uses a nine-valued logic model. Among the novel features of the algorithm are use of Initial Timeframe Algorithm and correct implementation of a solution to the Previous State Information Problem. The Initial Timeframe Algorithm, one of the most important aspects of the test generator, determines the number of timeframes required to excite the fault for which a test is to be derived and the number of timeframes required to observe the excited fault. Correct determination of the number of timeframes in which the fault should be excited (activated) and observed saves the test generator from performing unnecessary search in the input space. Test generation is unidirectional, i.e., it is done strictly in forward time, and flip-flops in the initial timeframe are never assigned a state that needs to be justified later. The algorithm saves both the good and the faulty machine states after finding a test to aid in subsequent test generation. The Previous State Information Problem, which has often been ignored by existing test generators, is presented and discussed in the paper. Experimental results are presented to demonstrate the effectiveness of the algorithm. >

Journal ArticleDOI
TL;DR: A novel scheme for utilizing the regular structure of three neighborhood additive cellular automata (CAs) for pseudoexhaustive test pattern generation is introduced.
Abstract: A novel scheme for utilizing the regular structure of three neighborhood additive cellular automata (CAs) for pseudoexhaustive test pattern generation is introduced. The vector space generated by a CA can be decomposed into several cyclic subspaces. A cycle corresponding to an m-dimensional cyclic subspace has been shown to pseudoexhaustively test an n-input circuit (n>or=m). Such a cycle is shown to supply a (m-1) bit exhaustive pattern including the all-zeros (m-1)-tuple. Schemes have been reported specifying how one or more subsets of (m-1) cell positions of an n-cell CA can be identified to generate exhaustive patterns in an m-dimensional cyclic subspace. >

Journal ArticleDOI
TL;DR: A very fast Jacobi-like algorithm for the parallel solution of symmetric eigenvalue problems is proposed, although only linear convergence is obtained for the most simple version of the new algorithm, the overall operation count decreases dramatically.
Abstract: A very fast Jacobi-like algorithm for the parallel solution of symmetric eigenvalue problems is proposed. It becomes possible by not focusing on the realization of the Jacobi rotation with a CORDIC processor, but by applying approximate rotations and adjusting them to single steps of the CORDIC algorithm, i.e., only one angle of the CORDIC angle sequence defines the Jacobi rotation in each step. This angle can be determined by some shift, add and compare operations. Although only linear convergence is obtained for the most simple version of the new algorithm, the overall operation count (shifts and adds) decreases dramatically. A slow increase of the number of involved CORDIC angles during the runtime retains quadratic convergence. >

Journal ArticleDOI
TL;DR: The addition of a new parameter, the block size, to the two existing parameters of the fault distribution is proposed, which allows the unification of the existing models and, at the same time, adds a whole range of medium-size clustering models.
Abstract: It has been recognized that the yield of fault-tolerant VLSI circuits depends on the size of the fault clusters. Consequently, models for yield analysis have been proposed for large-area clustering and small-area clustering, based on the two-parameter negative-binomial distribution. The addition of a new parameter, the block size, to the two existing parameters of the fault distribution is proposed. This parameter allows the unification of the existing models and, at the same time, adds a whole range of medium-size clustering models. Thus, the flexibility in choosing the appropriate yield model is increased. Methods for estimating the newly defined block size are presented and the approach is validated through simulation and empirical data. >

Journal ArticleDOI
TL;DR: Undetectable and redundant faults in synchronous sequential circuits are analyzed and a distinction is drawn between undetectable faults and faults that are never manifested as output errors.
Abstract: Undetectable and redundant faults in synchronous sequential circuits are analyzed. A distinction is drawn between undetectable faults and faults that are never manifested as output errors. The latter are classified as redundant. It is shown that there are faults for which a test sequence does not exist; however, under certain initial conditions (or initial states) of the circuit, faulty behavior may be observed. Such faults are called partially detectable faults. A partially detectable fault is undetectable, but is not redundant, as it affects circuit operation under some conditions. The author observes that the notion of redundancy cannot be separated from the mode of operation of the circuit. Two modes of operation are considered, representative of common modes, called the synchronization mode and the free mode. Accordingly, the identification of redundant faults calls for different test generation strategies. Two test strategies to generate tests for detectable faults and partial tests for partially detectable faults are defined, called the restricted test strategy and the unrestricted test strategy. >

Journal ArticleDOI
TL;DR: In the final implementation of the technique the extra modulus has been inserted in the set of moduli of the residue system, avoiding redundancy.
Abstract: A technique for number comparison in the residue number system is presented, and its theoretical validity is proved. The proposed solution is based on using a diagonal function to obtain a magnitude order of the numbers. In a first approach the function is computed using a suitable extra modulus. In the final implementation of the technique the extra modulus has been inserted in the set of moduli of the residue system, avoiding redundancy. The technique is compared with other approaches. >

Journal ArticleDOI
TL;DR: The authors optimize the cost of the fault-tolerant architecture by adding exactly k spare processors (while tolerating up to k processor and/or link faults) and minimizing the maximum number of links per processor.
Abstract: This paper presents several techniques for tolerating faults in d-dimensional mesh and hypercube architectures. The approach consists of adding spare processors and communication links so that the resulting architecture will contain a fault-free mesh or hypercube in the presence of faults. The authors optimize the cost of the fault-tolerant architecture by adding exactly k spare processors (while tolerating up to k processor and/or link faults) and minimizing the maximum number of links per processor. For example, when the desired architecture is a d-dimensional mesh and k=1, they present a fault-tolerant architecture that has the same maximum degree as the desired architecture (namely, 2d) and has only one spare processor. They also present efficient layouts for fault-tolerant two- and three-dimensional meshes, and show how multiplexers and buses can be used to reduce the degree of fault-tolerant architectures. Finally, they give constructions for fault-tolerant tori, eight-connected meshes, and hexagonal meshes. >

Journal ArticleDOI
TL;DR: It is shown that the diameter of an n-dimensional hypercube can only increase by an additive constant of 1 when (n-1) faulty processors are present and it is proven that all the n-cubes with a fault-diameter of (n+2) are isomorphic.
Abstract: It is shown that the diameter of an n-dimensional hypercube can only increase by an additive constant of 1 when (n-1) faulty processors are present. Based on the concept of forbidden faulty sets, which guarantees the connectivity of the cube in the presence of up to (2n-3) faulty processors. It is shown that the diameter of the n-cube increases to (n-2) as a result of (2n-3) processor failures. It is also shown that only those nodes whose Hamming distance is (n-2) have the potential to be located at two ends of the diameter of the damaged cube. It is proven that all the n-cubes with (2n-3) faulty processors and a fault-diameter of (n+2) are isomorphic. A generalization to the subject study is presented. >

Journal ArticleDOI
TL;DR: The authors study these graphs by proving the existence of Hamiltonian cycles in any arrangement graph and proving that an arrangement graph contains cycles of all lengths ranging between 3 and the size of the graph.
Abstract: Arrangement graphs have been proposed as an attractive interconnection topology for large multiprocessor systems. The authors study these graphs by proving the existence of Hamiltonian cycles in any arrangement graph. They also prove that an arrangement graph contains cycles of all lengths ranging between 3 and the size of the graph. They show that an arrangement graph can be decomposed into node disjoint cycles in many different ways. >

Journal ArticleDOI
TL;DR: Systems that can be modeled as graphs, such that nodes represent the components and the edges represent the fault propagation between the components, are considered, and the problem of detecting multiple faults is shown to be NP-complete.
Abstract: Systems that can be modeled as graphs, such that nodes represent the components and the edges represent the fault propagation between the components, are considered. Some components are equipped with alarms that ring in response to faulty conditions. In these systems, two types of problem are studies: fault diagnosis and alarm placement. The fault diagnosis problems deal with computing the set of all potential failure sources that correspond to a set of ringing alarms. Single faults, where exactly one component can become faulty at any time, are primarily considered. Systems are classified into zero-time and non-zero-time systems on the basis of fault propagation time. The latter are further classified on the basis of knowledge of propagation times. For each of these classes algorithms are presented for single fault diagnosis. The problem of detecting multiple faults is shown to be NP-complete. An alarm placement problem that requires a single fault to be uniquely diagnosed is examined. >

Journal ArticleDOI
TL;DR: An accumulator-based compaction (ABC) scheme for parallel compaction of test responses is introduced and it is proven that the asymptotic coverage drop in ABC with binary adders is 2/sup -k/, where k is the number of bits in the adder that the fault can reach.
Abstract: An accumulator-based compaction (ABC) scheme for parallel compaction of test responses is introduced. The asymptotic and transient coverage drop introduced by accumulators with binary and 1's complement adders is studied using Markov chain models. It is proven that the asymptotic coverage drop in ABC with binary adders is 2/sup -k/, where k is the number of bits in the adder that the fault can reach. In ABC with 1's complement adders, the asymptotic coverage drop for a fairly general class of faults is (2n-1)/sup -1/, where n is the total number of bits. The analysis of transient behavior relates the coverage drop with the probability of fault injection, the size of the accumulator, and the length of the test experiment. The process is characterized by damping factors derived for various values of these parameters. >

Journal ArticleDOI
TL;DR: The authors show that with load forwarding, the three types of code expanding optimizations jointly improve the performance of small caches and have little effect on large caches.
Abstract: Shows that code expanding optimizations have strong and nonintuitive implications on instruction cache design. Three types of code expanding optimizations are studied in this paper: instruction placement, function inline expansion, and superscalar optimizations. Overall, instruction placement reduces the miss ratio of small caches. Function inline expansion improves the performance for small cache sizes, but degrades the performance of medium caches. Superscalar optimizations increase the miss ratio for all cache sizes. However, they also increase the sequentiality of instruction access so that a simple load forwarding scheme effectively cancels the negative effects. Overall, the authors show that with load forwarding, the three types of code expanding optimizations jointly improve the performance of small caches and have little effect on large caches. >