scispace - formally typeset
Search or ask a question

Showing papers on "Degree of parallelism published in 1989"


Book ChapterDOI
01 Nov 1989
TL;DR: It is hypothesize that trends in mass storage technology are making database machines that attempt to exploit a high degree of parallelism to enhance performance an idea whose time has passed.
Abstract: In this paper we take a critical look at the future of the field of database machines. We hypothesize that trends in mass storage technology are making database machines that attempt to exploit a high degree of parallelism to enhance performance an idea whose time has passed.

123 citations


Patent
28 Dec 1989
TL;DR: In this paper, a real-time Robotic Controller and Simulator (RRCS) with an MIMD-SIMD parallel architecture for interfacing with an external host computer provides a high degree of parallelism in computation for robotics control and simulation.
Abstract: A Real-time Robotic Controller and Simulator (RRCS) with an MIMD-SIMD parallel architecture for interfacing with an external host computer provides a high degree of parallelism in computation for robotics control and simulation. A host processor receives instructions from, and transmits answers to, the external host computer. A plurality of SIMD microprocessors, each SIMD processor being an SIMD parallel processor, is capable of exploiting fine-grain parallelism and is able to operate asynchronously to form an MIMD architecture. Each SIMD processor comprises an SIMD architecture capable of performing two matrix-vector operations in parallel while fully exploiting parallelism in each operation. A system bus connects the host processor to the plurality of SIMD microprocessors and a common clock provides a continuous sequence of clock pulses. A ring structure interconnects the plurality of SIMD microprocessors and is connected to the clock for providing clock pulses to the SIMD microprocessors and provides a path for the flow of data and instructions between the SIMD microprocessors. The host processor includes logic for controlling the RRCS by interpreting instructions sent by the external host computer, decomposing the instructions into a series of computations to be performed by the SIMD microprocessors, using the system bus to distribute associated data among the SIMD microprocessors, and initiating activity of the SIMD microprocessors to perform the computations on the data by procedure call.

72 citations


Journal ArticleDOI
TL;DR: A simple but powerful method for solving the transient stability problem with a high degree of parallelism is implemented and can significantly increase computational efficiency in a parallel-processing environment.
Abstract: A simple but powerful method for solving the transient stability problem with a high degree of parallelism is implemented. The transient stability is seen as a coupled set of nonlinear algebraic and differential equations. By applying a discretization method such as the trapezoidal rule, the overall algebraic-differential set of equations is transformed into a unique algebraic problem at each time step. A solution that considers every time step, not in a sequential way, but concurrently, is suggested. The solution of this set of equations with a relaxation-type indirect method gives rise to a highly parallel algorithm. The parallelism consists of a parallelism in space (that is in the equations at each time step) and a parallelism in time. Another characteristic of the algorithm is that the time step can be changed between iterations using a nested iteration multigrid technique from a coarse time grid to the desired fine time grid to enhance the convergence of the algorithm. The method has been tested on various size power systems, for various solution time periods, and various types of disturbances. It is shown that the method has good convergence properties and can significantly increase computational efficiency in a parallel-processing environment. >

71 citations


Journal ArticleDOI
TL;DR: A modular and flexible architecture that realizes a parallel algorithm for real-time image template matching is described, which is especially suitable for applications in which adjustments of the dimension of the search area are constantly required.
Abstract: A modular and flexible architecture that realizes a parallel algorithm for real-time image template matching is described. Symmetrically permuted template data (SPTD) are employed in this algorithm to obtain a processing structure with a high degree of parallelism and pipelining, reduce the number of memory accesses to a minimum, and eliminate the use of delay elements that render the dimension of search area to be processed unchangeable. The inherent temporal parallelism and spatial parallelism of the algorithm are fully exploited in developing the hardware architecture. The latter, which is mainly constructed from two types of basic cells, exhibits a high degree of modularity and regularity. The architecture is especially suitable for applications in which adjustments of the dimension of the search area are constantly required. A hardware prototype has been constructed using standard integrated circuits for moving-object detection and interframe motion estimation. It is capable of operating on a search area of size up to 256*256 pixels in real time. >

36 citations


Journal ArticleDOI
TL;DR: This paper presents a database model and its associated architecture, which is based on the principles of data-driven computation, and allows the model to be mapped onto a computer architecture consisting of large numbers of independent disk units and processing elements.
Abstract: In recent years, a number of database machines consisting of large numbers of parallel processing elements have been proposed. Unfortunately, there are two main limitations in database processing that prevent a high degree of parallelism; these are the available I/O bandwidth of the underlying storage devices and the concurrency control mechanisms necessary to guarantee data integrity. The main problem with conventional approaches is the lack of a computational model capable of utilizing the potential of any significant number of processing elements and storage devices and, at the same time, preserving the integrity of the database.This paper presents a database model and its associated architecture, which is based on the principles of data-driven computation. According to this model, the database is represented as a network in which each node is conceptually an independent, asynchronous processing element, capable of communicating with other nodes by exchanging messages along the network arcs. To answer a query, one or more such messages, called tokens, are created and injected into the network. These then propagate asynchronously through the network in search of results satisfying the given query.The asynchronous nature of processing permits the model to be mapped onto a computer architecture consisting of large numbers of independent disk units and processing elements. This increases both the available I/O bandwidth as well as the processing potential of the machine. At the same time, new concurrency control and error recovery mechanisms are necessary to cope with the resulting parallelism.

24 citations


Journal ArticleDOI
TL;DR: Discusses the implementation of a residue arithmetic circuit using multiple-valued bidirectional current-mode MOS technology and designs and fabricates the mod7 three-operand multiply adder as an integrated circuit based on 10- mu m CMOS technology.
Abstract: Discusses the implementation of a residue arithmetic circuit using multiple-valued bidirectional current-mode MOS technology Each residue digit is represented by new multiple-valued coding suitable for highly parallel computation By the coding, mod m/sub i/ multiplication can be simply performed by a shift operation In mod m/sub i/ addition, radix-5 signed-digit (SD) arithmetic is employed for a high degree of parallelism and multiple-operand addition, so that high-speed arithmetic operations can be achieved Finally, the mod7 three-operand multiply adder is designed and fabricated as an integrated circuit based on 10- mu m CMOS technology >

19 citations


Journal ArticleDOI
J. Roos1
01 Apr 1989
TL;DR: A single chip VLSI support processor has been designed that provides predictable and uniformly low overhead for the entire semantics of a rendezvous, so that the powerful real-time constructs of Ada can be used freely without performance degradation.
Abstract: Task synchronization in Ada causes excessive run-time overhead due to the complex semantics of the rendezvous. To demonstrate that the speed can be increased by two orders of magnitude by using special purpose hardware, a single chip VLSI support processor has been designed. By providing predictable and uniformly low overhead for the entire semantics of a rendezvous, the powerful real-time constructs of Ada can be used freely without performance degradation.The key to high performance is the set of primitive operations implemented in hardware. Each operation is complex enough to replace a considerable amount of code was designed to execute with a minimum of communication overhead. Task control blocks are stored on-chip as well as headers for entry, delay and ready queues. All necessary scheduling is integrated in the operations. Delays are handled completely on-chip using an internal real-time clock.A multilevel design strategy, based on silicon compilation, made it possible to run actual Ada programs on a functional emulator of the chip and use the results to verify the detailed design. A high degree of parallelism and pipelining together with an elaborate internal addressing scheme has reduced the number of clock cycles needed to perform each operation. Using 2 mm CMOS, the processor can run at 20 MHz. A complex rendezvous, including the calling sequence and all necessary scheduling, can be performed in less than 15 ms.

18 citations


Journal ArticleDOI
TL;DR: A parallel algorithm for extracting the roots of a polynomial is presented, based on Graeffe's method, which is rarely used in serial implementations, because it is slower than many common serial algorithms, but is particularly well suited to parallel implementation.
Abstract: A parallel algorithm for extracting the roots of a polynomial is presented. The algorithm is based on Graeffe's method, which is rarely used in serial implementations, because it is slower than many common serial algorithms, but is particularly well suited to parallel implementation. Graeffe's method is an iterative technique, and parallelism is used to reduce the execution time per iteration. A high degree of parallelism is possible, and only simple interprocessor communication is required. For a degree-n polynomial executed on an (n+1)-processor SIMD machine, each iteration in the parallel algorithm has arithmetic complexity of approximately 2n and a communications overhead n. In general, arithmetic speedup is on the order of p/2 for a p-processor implementation. >

16 citations


Journal ArticleDOI
TL;DR: A graph-theoretic approach is used to derive asymptotically optimal algorithms for parallel Gaussian elimination on SIMD⧸MIMD couputers with a shared memory system, evidences the high degree of parallelism that can be achieved.

15 citations


Journal ArticleDOI
TL;DR: To search for the best preconditioner on a parallel machine, it has to consider the tradeoffs between fast convergence rate and high degree of parallelism as well as the architecture of the target parallel computer.
Abstract: This paper presents the results of the Connection Machine implementation of a number of preconditioners for the preconditioned conjugate gradient method. The preconditioners implemented include those based on the incomplete LU factorization, the modified incomplete LU factorization, the symmetric successive overrelaxation, and others such as several polynomial preconditioners and the hierarchical basis preconditioner. Results based on numerical experiments show that both the degree of parallelism inherent in a preconditioner and its convergence rate improvement play important roles on the overall execution time performance on parallel computers. Factors that affect the performance of the preconditioners will also be discussed. We conclude that to search for the best preconditioner on a parallel machine, we have to consider the tradeoffs between fast convergence rate and high degree of parallelism as well as the architecture of the target parallel computer.

11 citations


01 Jan 1989
TL;DR: A novel partitioning strategy is outlined for maximizing the degree of parallelism in structural analysis and design that was implemented on the CRAY X-MP/4 and the Alliant FX/8 computers.
Abstract: A review is given of the recent advances in computer technology that are likely to impact structural analysis and design. The computational needs for future structures technology are described. The characteristics of new and projected computing systems are summarized. Advances in programming environments, numerical algorithms, and computational strategies for new computing systems are reviewed, and a novel partitioning strategy is outlined for maximizing the degree of parallelism. The strategy is designed for computers with a shared memory and a small number of powerful processors (or a small number of clusters of medium-range processors). It is based on approximating the response of the structure by a combination of symmetric and antisymmetric response vectors, each obtained using a fraction of the degrees of freedom of the original finite element model. The strategy was implemented on the CRAY X-MP/4 and the Alliant FX/8 computers. For nonlinear dynamic problems on the CRAY X-MP with four CPUs, it resulted in an order of magnitude reduction in total analysis time, compared with the direct analysis on a single-CPU CRAY X-MP machine.

Proceedings Article
11 Dec 1989
TL;DR: The method exploits the high degree of parallelism available with cellular automata and retains important features of the method of characteristics and yields high numerical accuracy and extends naturally to adaptive meshes and domain decomposition methods for perturbed conservation laws.
Abstract: We present a new method for computing solutions of conservation laws based on the use of cellular automata with the method of characteristics. The method exploits the high degree of parallelism available with cellular automata and retains important features of the method of characteristics. It yields high numerical accuracy and extends naturally to adaptive meshes and domain decomposition methods for perturbed conservation laws. We describe the method and its implementation for a Dirichlet problem with a single conservation law for the one-dimensional case. Numerical results for the one-dimensional law with the classical Burgers nonlinearity or the Buckley-Leverett equation show good numerical accuracy outside the neighborhood of the shocks. The error in the area of the shocks is of the order of the mesh size. The algorithm is well suited for execution on both massively parallel computers and vector machines. We present timing results for an Alliant FX/8, Connection Machine Model 2, and CRAY X-MP.

Proceedings ArticleDOI
Mats Brorsson1
03 Jan 1989
TL;DR: A decentralized scheme for virtual memory management on MIMD (multiple-instruction-multiple-data) multiprocessors with shared memory has been developed, using a variant of the Dennings working set page replacement algorithm, in which each process owns a page list.
Abstract: A decentralized scheme for virtual memory management on MIMD (multiple-instruction-multiple-data) multiprocessors with shared memory has been developed Control and data structures are kept local to the processing elements (PE), which reduces the global traffic and makes a high degree of parallelism possible Each of the PEs in the target architecture consists of a processor and part of the shared memory and is connected to the others by a common bus The traditional approach, based on replication or sharing of data structures is not suitable in this case when the number of PEs is of the magnitude of 100 This is due to the excessive global traffic caused by consistency or mutual exclusion protocols A variant of the Dennings working set page replacement algorithm is used, in which each process owns a page list Shared pages are not present in more than one list, and it is shown that this will not increase the page fault rate in most cases >

Proceedings ArticleDOI
01 Dec 1989
TL;DR: This thesis is that evaluation methods can be viewed as implementing a choice ofeways information propagation graphs, or sips, which determines the set of goals and facts that must be evaluated, and provides a formal measure of parallelism to make this comparison.
Abstract: There is a tension between the objectives of avoiding irrelevant computation and extracting parallelism, in that a computational step used to restrict another must precede the latter. Our thesis, following [3], is that evaluation methods can be viewed as implementing a choice ofsideways information propagation graphs, or sips, which determines the set of goals and facts that must be evaluated. Two evaluation methods that implement the same sips can then be compared to see which obtains a greater degree of parallelism, and we provide a formal measure of parallelism to make this comparison.

Journal ArticleDOI
TL;DR: The paper stresses the constraints that affect the support because of occam staticity, and describes a possible support to a parallel object model, called PO for a massively parrallel architecture, which has been implemented in occam for an architecture based on several Transputers.

Journal ArticleDOI
TL;DR: It is shown from theoretical and experimental results that a 40% degree of parallelism is optimal for this algorithm, and an effective speedup of more than 70 times over the sequential implementation on a Vax 11/785 running Unix.

01 Jul 1989
TL;DR: It turns out that parallelism afford only limited opportunity for reducing the computing time with linear multi-stage multi-value methods, and parallel one-step methods offer no speedup over serial one- step methods for the standard linear test problem.
Abstract: Numerical methods for ordinary initial value problems that do not depend on special properties of the system are usually found in the class of linear multi-stage multi-value methods, first formulated by J.C. Butcher. Among these the explicit methods are easiest to implement. For these reasons there has been considerable research activity devoted to generating methods of this class which utilize independent function evaluations that can be performed in parallel. Each such group of concurrent function evaluations can be regarded as a stage of the method. It turns out that parallelism afford only limited opportunity for reducing the computing time with such methods. This is most evident for the simple linear homogenous constant coefficient test problem, which is essentially a matter of approximating the exponential by an algebraic function. For a given number of stages and a given number of saved values, parallelism offers a somewhat enlarged set of algebraic functions from which to choose. However there is absolutely no benefit in having the degree of parallelism (number of processors) exceed the number of saved values of the method. Thus, in particular, parallel one-step methods offer no speedup over serial one-step methods for the standard linear test problem. Because of themore » fairly remarkable modelling ability of the standard test problem, one should not expect much better speedups for general nonlinear problems. 8 refs.« less

Journal ArticleDOI
TL;DR: In this paper, the authors have implemented a parallel prolog interpreter made up of a set of parallel processes communicating according to a message-passing protocol and evaluated the actual degree of parallelism exploited by the execution model, and the efficiency of the used resolution algorithm.

Proceedings ArticleDOI
22 Mar 1989
TL;DR: The authors present an AND-parallelism detection scheme which does not order literals or analyze data dependency, and finds that this scheme generally uses the highest degree of parallelism.
Abstract: The authors present an AND-parallelism detection scheme which does not order literals or analyze data dependency. The simple concept of variable holding allows the system to utilize AND-parallel literals concurrently without causing binding conflicts. The run-time support of this scheme is minimal: groundness test, independence test, and variable holding. Without extra runtime processing, this scheme also supports an option mode declaration, which provides its programmers with greater flexibility to specify parallel execution. The execution graphs generated by this scheme are compared with the graphs generated by other models. The authors find that this scheme generally uses the highest degree of parallelism. >

15 Dec 1989
TL;DR: It is shown that most of the algorithms for robotic computations possess highly regular properties and some common structures, especially the linear recursive structure, and are well-suited to be implemented on a single-instruction-stream multiple-data-stream (SIMD) computer with reconfigurable interconnection network.
Abstract: The kinematics, dynamics, Jacobian, and their corresponding inverse computations are six essential problems in the control of robot manipulators. Efficient parallel algorithms for these computations are discussed and analyzed. Their characteristics are identified and a scheme on the mapping of these algorithms to a reconfigurable parallel architecture is presented. Based on the characteristics including type of parallelism, degree of parallelism, uniformity of the operations, fundamental operations, data dependencies, and communication requirement, it is shown that most of the algorithms for robotic computations possess highly regular properties and some common structures, especially the linear recursive structure. Moreover, they are well-suited to be implemented on a single-instruction-stream multiple-data-stream (SIMD) computer with reconfigurable interconnection network. The model of a reconfigurable dual network SIMD machine with internal direct feedback is introduced. A systematic procedure internal direct feedback is introduced. A systematic procedure to map these computations to the proposed machine is presented. A new scheduling problem for SIMD machines is investigated and a heuristic algorithm, called neighborhood scheduling, that reorders the processing sequence of subtasks to reduce the communication time is described. Mapping results of a benchmark algorithm are illustrated and discussed.

Journal ArticleDOI
TL;DR: In this paper optical implementations of Boolean matrix operations are proposed for manipulating the constraint matrices to perform forward checking and thereby increase the search efficiency.
Abstract: Many problems in artificial intelligence are intractable owing to the exponential growth of the solution space with problem size. Often these problems can benefit from heuristic search or forward-checking techniques, which attempt to prune the search space down to a manageable size before or during the actual search procedure. Many interesting search problems can be formulated as consistent labeling problems in which the initial problem information is given in the form of a set of binary constraints, for which Boolean matrices are a natural data representation. In this paper optical implementations of Boolean matrix operations are proposed for manipulating the constraint matrices to perform forward checking and thereby increase the search efficiency. The high degree of parallelism afforded by using optical techniques and the relatively low accuracy requirements of Boolean matrix operations suggest that optical techniques are well matched to this problem.

01 Jan 1989
TL;DR: Advances in programming environments, numerical algorithms, and computational strategies for new computing systems are reviewed, and a novel partitioning strategy is outlined for maximizing the degree of parallelism on multiprocessor computers with a shared memory.
Abstract: Recent advances in computer technology that are likely to impact computational mechanics are reviewed. The technical needs for computational mechanics technology are outlined. The major features of new and projected computing systems, including supersystems, parallel processing machines, special-purpose computing hardware, and small systems are described. Advances in programming environments, numerical algorithms, and computational strategies for new computing systems are reviewed, and a novel partitioning strategy is outlined for maximizing the degree of parallelism on multiprocessor computers with a shared memory.

Proceedings ArticleDOI
23 May 1989
TL;DR: The design of a high-speed complex signal processor is presented, based on a novel multiplier whose hardware implementation is shown to be characterized by simplicity, a high degree of parallelism, regularity, and modularity.
Abstract: The design of a high-speed complex signal processor is presented. It is based on a novel multiplier whose hardware implementation is shown to be characterized by simplicity, a high degree of parallelism, regularity, and modularity. The multiplier design is made possible by recent advances in the theory of performing polynomial multiplication in modular rings with reduced complexity. This latter development is based on the polynomial residue number system (PNRS). While traditional parallel complex multiplication requires four real multiplications, the proposed scheme is based on decomposing the process into eight smaller concurrent processes. If p is the performance of each of the four processors of the traditional technique and h is the investment in hardware required for its realization, the performance of each of the processors of the proposed method is 4p and its hardware investment is h/16. >

Journal ArticleDOI
TL;DR: The software requirements for a drift chamber which is difficult to calibrate by measurements only are reviewed, finding fast algorithms and a high degree of parallelism are necessary to reach the accuracy demanded.

Patent
20 Apr 1989
TL;DR: In this paper, the authors propose to use a task level higher than an arithmetic level to carry out the parallel processes in a parallel computer suited to the large-scale calculation by using the data driving theory.
Abstract: PURPOSE:To effectively form a parallel computer suited to the large-scale calculation by using a task level higher than an arithmetic level to carry out the parallel processes. CONSTITUTION:Plural element processors 2 are connected to a mutual connec tion network 1. Each processor 2 consists of a token matching unit 3, a local memory part 4, a processor 5 and a fan-out unit 6. Then, an application program having a satisfactory degree of parallelism at a task level is processed based on the data driving theory without applying a large load to the network 1 or a token waiting part 3. In such a way, a parallel computer suited to the large-scale calculation is effectively formed.

Journal ArticleDOI
TL;DR: This paper proposes an efficient testing algorithm which reduces the testing time considerably and requires little hardware overhead for VLSI/WSI arrays testing.

Dissertation
01 Jan 1989
TL;DR: An alternative concept to a VLSI-architecture the Soft-Systolic Simulation System (SSSS), is introduced and developed as a working model of virtual machine with the power to simulate hard systolic arrays and more general forms of concurrency such as the SIMD and MIMD models of computation.
Abstract: Systolic arrays have proved to be well suited for Very Large Scale Integrated technology (VLSI) since they: - Consist of a regular network of simple processing cells, - Use local communication between the processing cells only, - Exploit a maximal degree of parallelism. However, systolic arrays have one main disadvantage compared with other parallel computer architectures: they are special purpose architectures only capable of executing one algorithm, e.g., a systolic array designed for sorting cannot be used to form matrix multiplication. Several approaches have been made to make systolic arrays more flexible, in order to be able to handle different problems on a single systolic array. In this thesis an alternative concept to a VLSI-architecture the Soft-Systolic Simulation System (SSSS), is introduced and developed as a working model of virtual machine with the power to simulate hard systolic arrays and more general forms of concurrency such as the SIMD and MIMD models of computation. The virtual machine includes a processing element consisting of a soft-systolic processor implemented in the virtual.machine language. The processing element considered here was a very general element which allows the choice of a wide range of arithmetic and logical operators and allows the simulation of a wide class of algorithms but in principle extra processing cells can be added making a library and this library be tailored to individual needs. The virtual machine chosen for this implementation is the Instruction Systolic Array (ISA). The ISA has a number of interesting features, firstly it has been used to simulate all SIMD algorithms and many MIMD algorithms by a simple program transformation technique, further, the ISA can also simulate the so-called wavefront processor algorithms, as well as many hard systolic algorithms. The ISA removes the need for the broadcasting of data which is a feature of SIMD algorithms (limiting the size of the machine and its cycle time) and also presents a fairly simple communication structure for MIMD algorithms. The model of systolic computation developed from the VLSI approach to systolic arrays is such that the processing surface is fixed, as are the processing elements or cells by virtue of their being embedded in the processing surface. The VLSI approach therefore freezes instructions and hardware relative to the movement of data with the virtual machine and softsystolic programming retaining the constructions of VLSI for array design features such as regularity, simplicity and local communication, allowing the movement of instructions with respect to data. Data can be frozen into the structure with instructions moving systolically. Alternatively both the data and instructions can move systolically around the virtual processors, (which are deemed fixed relative to the underlying architecture). The ISA is implemented in OCCAM programs whose execution and output implicitly confirm the correctness of the design. The soft-systolic preparation comprises of the usual operating system facilities for the creation and modification of files during the development of new programs and ISA processor elements. We allow any concurrent high level language to be used to model the softsystolic program. Consequently the Replicating Instruction Systolic Array Language (RI SAL) was devised to provide a very primitive program environment to the ISA but adequate for testing. RI SAL accepts instructions in an assembler-like form, but is fairly permissive about the format of statements, subject of course to syntax. The RI SAL compiler is adopted to transform the soft-systolic program description (RISAL) into a form suitable for the virtual machine (simulating the algorithm) to run. Finally we conclude that the principles mentioned here can form the basis for a soft-systolic simulator using an orthogonally connected mesh of processors. The wide range of algorithms which the ISA can simulate make it suitable for a virtual simulating grid.