scispace - formally typeset
Search or ask a question

Showing papers on "Degree of parallelism published in 1987"


01 Jan 1987
TL;DR: A multiprocessor-based, distributed simulation testbed is described that was designed to facilitate controlled experimentation with distributed simulation algorithms and demonstrated that message population and the degree to which processes can look ahead in simulated time play critical roles in the performance of distributed simulators using these algorithms.
Abstract: : Although many distributed simulation strategies have been developed, to data, little empirical data is available to evaluate their performance. A multiprocessor-based, distributed simulation testbed is described that was designed to facilitate controlled experimentation with distributed simulation algorithms. Using this testbed, the performance of simulation strategies using deadlock avoidance and deadlock detection and recovery techniques was examined under various synthetic workloads. The distributed simulators were compared with a uniprocessor-based event list implementation. Results of a series of experiments are reported that demonstrate that message population and the degree to which processes can look ahead in simulated time play critical roles in the performance of distributed simulators using these algorithms. An avalanche phenomenon was observed in the deadlock detection and recovery simulators as message population was increased, and was found to be a necessary condition for achieving good performance. It is demonstrated that these distributed simulation algorithms can provide significant speedups over sequential event list implementations for some workloads, even in the presence of only a moderate amount of parallelism and many feedback loops. However, a moderate to high degree of parallelism was not sufficient to guarantee good performance for all workloads that were tested.

134 citations


Patent
05 Jun 1987
TL;DR: In this article, a computer architecture for analyzing automatic image understanding problems is described, which can efficiently perform a wide spectrum of tasks ranging from low level or iconic processing to high level or symbolic processing tasks.
Abstract: A computer architecture is disclosed for analyzing automatic image understanding problems. The architecture is designed so that it can efficiently perform a wide spectrum of tasks ranging from low level or iconic processing to high level or symbolic processing tasks. A first level (12) of image processing elements is provided for operating on the image matrix on a pixel per processing element basis. A second level (14) of processing elements is provided for operating on a plurality of pixels associated with a given array of first level processing elements. A third level (16) of processing elements is designed to instruct the first and second level processing elements, as well as for operating on a larger segment of the matrix. A host computer (18) is provided that directly communicates with at least each third level processing element. A high degree of parallelism is provided so that information can be readily transferred within the architecture at high speeds.

64 citations


Journal ArticleDOI
TL;DR: A modular architecture for adaptive multichannel lattice algorithms is presented, which requires no matrix computations and has a regular structure, which significantly simplifies its implementation as compared to theMultichannel (matrix) version of the same algorithms.
Abstract: A modular architecture for adaptive multichannel lattice algorithms is presented. This architecture requires no matrix computations and has a regular structure, which significantly simplifies its implementation as compared to the multichannel (matrix) version of the same algorithms. Because the suggested architecture exhibits a high degree of parallelism and local communication, it is well suited for implementation in dedicated (VLSI) hardware. The derivation of this modular architecture demonstrates a powerful principle for modular decomposition of multichannel recursions into systolic-arraylike architectures. The scope of applicability of this principle extends beyond multichannel lattice (and related least-squares) algorithms to other algorithms involving matrix computations, such as multiplication; factorization, and inversion.

52 citations


Journal ArticleDOI
01 Nov 1987
TL;DR: An alternative decomposition for a tridiagonal matrix which has the property that the decomposition as well as the subsequent solution process can be done in two parallel parts is analysed, equivalent to the two-sided Gaussian elimination algorithm.
Abstract: We analyse an alternative decomposition for a tridiagonal matrix which has the property that the decomposition as well as the subsequent solution process can be done in two parallel parts. This decomposition is equivalent to the two-sided Gaussian elimination algorithm that has been discussed by Babuska. In the context of parallel computing a similar approach has been suggested by Joubert and Cloete. The computational complexity of this alternative decomposition is the same as for the standard decomposition and a remarkable aspect is that it often leads to slightly more accurate solutions than the standard process does. The algorithm can be combined with recursive doubling or cyclic reduction in order to increase the degree of parallelism and vectorizability.

30 citations


Proceedings ArticleDOI
01 Jun 1987
TL;DR: A unified resource management and execution control mechanism for data flow machines that integrates load control, depth-first execution control, cache memory control and a load balancing mechanism is presented.
Abstract: This paper presents a unified resource management and execution control mechanism for data flow machines. The mechanism integrates load control, depth-first execution control, cache memory control and a load balancing mechanism. All of these mechanisms are controlled by such basic information as the number of active state processes, Na. In data flow machines, synchronization among processes is an essential hardware function. Hence, Na can easily be detected by the hardware.Load control and depth-first execution control make it possible to execute a program with a designated degree of parallelism, and depth-first order. A cache memory of data flow processors in multiprocessing environments can be realized by using load and depth-first execution controls together with a deterministic replacement algorithm, i.e. replacement of only waiting state processes. A new load balancing method called group load balancing is also presented to evaluate the above mentioned mechanisms in multiprocessor environments.These unified control mechanisms are evaluated on a register transfer level simulator for a list-processing oriented data flow machine.

26 citations


Journal ArticleDOI
TL;DR: An architecture based on distributed arithmetic, which eliminates the use of multipliers, is described and a minimum-cycle-time filter architecture, which features a high degree of parallelism and pipelining, is shown to have a throughput rate that is independent of the filter order.
Abstract: This paper introduces the problem of and presents some state-of-the-art approaches for high-speed digital image processing. An architecture based on distributed arithmetic, which eliminates the use of multipliers, is described. A minimum-cycle-time filter architecture, which features a high degree of parallelism and pipelining, is shown to have a throughput rate that is independent of the filter order. Furthermore, a new multiprocessing-element architecture is proposed. This leads to a filter structure which can be implemented using identical building blocks. A modular VLSI architecture based on the decomposition of the kernel matrix of a two-dimensional (2-D) transfer function is also presented. In this approach, a general 2-D transfer function is expanded in terms of low-order 2-D polynomials. Each one of these 2-D polynomials is then implemented by a VLSI chip using a bit-sliced technique. In addition, a class of nonlinear 2-D filters based on the extension of one-dimensional (1-D) quadratic digital filters is introduced. It is shown that with the use of matrix decomposition, these 2-D quadratic filters can be implemented using linear filters with some extra operations. Finally, comparisons are made among the different approaches in terms of cycle time, latency, hardware complexity, and modularity.

25 citations


01 Jan 1987
TL;DR: A computational strategy is proposed for maximizing the degree of parallelism at different levels of the finite element analysis process including: 1) formulation level (through the use of mixed finite element models); 2) analysis level; 3) numerical algorithm level; and 4) implementation level.
Abstract: A brief review is made of the fundamental concepts and basic issues of parallel processing. Discussion focuses on parallel numerical algorithms, performance evaluation of machines and algorithms, and parallelism in finite element computations. A computational strategy is proposed for maximizing the degree of parallelism at different levels of the finite element analysis process including: 1) formulation level (through the use of mixed finite element models); 2) analysis level (through additive decomposition of the different arrays in the governing equations into the contributions to a symmetrized response plus correction terms); 3) numerical algorithm level (through the use of operator splitting techniques and application of iterative processes); and 4) implementation level (through the effective combination of vectorization, multitasking and microtasking, whenever available).

22 citations


Journal ArticleDOI
J. Sanz1, E. Hinkle
TL;DR: This paper proposes some new pipeline configurations which achieve a remarkable degree of parallelism in the computation of projection data and, in fact, of many other geometrical descriptors of digital images.
Abstract: This paper deals with the problem of computing projections of digital images. The novelty of our contribution is that we present algorithms which are suitable for implementation in general purpose image processing and image analysis pipeline architectures. No random access of the image memory is necessary. We propose some new pipeline configurations which achieve a remarkable degree of parallelism in the computation of projection data and, in fact, of many other geometrical descriptors of digital images. Fast computation of projections of digital images is not only important for extracting geometrical information from images, it also makes possible performing a large number of operations on images in Radon space, thereby reducing two-dimensional problems to a series of one-dimensional problems.

21 citations


Book ChapterDOI
08 Jun 1987
TL;DR: It turns out that certain well-known standard MG methods already contain a sufficiently high degree of parallelism, and are needed to exploit both, the high MG-efficiency and the full computational power of modern supercomputers.
Abstract: Multigrid (MG) methods for partial differential equations (and for other important mathematical models in scientific computing) have turned out to be optimal on sequential computers. Clearly, one wants to apply them also on vector and parallel computers in order to exploit both, the high MG-efficiency (compared to classical methods) and the full computational power of modern supercomputers. For this purpose, parallel MG methods are needed. It turns out that certain well-known standard MG methods (with RB and zebra-type relaxation, as described in [25]) already contain a sufficiently high degree of parallelism.

17 citations


Proceedings ArticleDOI
01 Dec 1987
TL;DR: The performance effects of the combination of both hardware and software techniques are considered and the “common parallelism” extracted by the two methods is considered, using new metrics.
Abstract: It has been shown that parallelism is a very promising alternative for enhancing computer performance. Parallelism, however, introduces much complexity to the programming effort. This has lead to the development of automatic concurrency extraction techniques. Prior work has demonstrated that static program restructuring via compiler based techniques provides a large degree of parallelism to the target machine. Purely hardware based extraction techniques (without software preprocessing) have also demonstrated significant (but lesser) degrees of parallelism. This paper considers the performance effects of the combination of both hardware and software techniques. The concurrency extracted from a given set of benchmarks by each technique separately, and together, is determined via simulations and or analysis. The "common parallelism" extracted by the two methods is thus also considered, using new metrics. The analytic techniques for predicting the performance of specific programs are also described.

15 citations


Journal ArticleDOI
TL;DR: The architecture of proposed system, its instruction set and control possibilities for bus controlling, communication among processors and for reconfiguration will be described in this contribution.

Journal ArticleDOI
Lothar Nowak1
TL;DR: A novel control unit, which asynchronously controls instruction execution by tokens, allows the evaluation of very complex expressions without any reference to clock cycles.
Abstract: A high-performance, general purpose processor has been designed, using various technology independent methods to improve performance. Its structure offers a large degree of parallelism and is adjusted to the application. A novel control unit, which asynchronously controls instruction execution by tokens, allows the evaluation of very complex expressions without any reference to clock cycles. The main memory communicates via 4 ports with the processor and avoids a bottleneck in accessing data. The processor performance is measured and compared with several commercial systems.

Book ChapterDOI
15 Jun 1987
TL;DR: The principle of stream-parallelism is used to discuss a general method for AND-Parallelism in executing logic programs written in Prolog, which tries to achieve a higher degree of parallelism than other dynamic methods for clauses containing shared variables.
Abstract: The principle of stream-parallelism is used to discuss a general method for AND-parallelism in executing logic programs written in Prolog. The method is entirely transparent, it delivers solutions in the same order as Prolog, but tries to achieve a higher degree of parallelism than other dynamic methods for clauses containing shared variables.

Book ChapterDOI
E A M Odijk1
01 Mar 1987
TL;DR: This paper reports on the approach to highly parallel computers and applications followed at Philips Research Laboratories, Eindhoven, as subproject A of Esprit project 415, the Decentralized Object-Oriented Machine, DOOM.
Abstract: This paper surveys the concepts of the Parallel Object-Oriented Language POOL and a highly parallel, general purpose computer system for execution of programs in this language: the Decentralized Object-Oriented Machine, DOOM. It reports on the approach to highly parallel computers and applications followed at Philips Research Laboratories, Eindhoven, as subproject A of Esprit project 415. The first sections present a short overview of the goals and premises of the subproject. In Section 3 the programming language POOL and its characteristics are introduced. Section 4 presents an abstract machine model for the execution of POOL programs. Section 5 describes the architecture of the DOOM-system. It is a collection of self contained computers, connected by a direct, packet-switching network. The resident operating system kernels facilitate the execution of a multitude of communicating objects, perform local management and cooperate to perform system wide resource management. In Section 6 we introduce the applications that are being designed to demonstrate the merits of the system. These symbolic applications will be shown to incorporate a high degree of parallelism. In the last section some conclusions will be drawn.

Journal ArticleDOI
TL;DR: The use of a special-purpose VLSI chip for solving a linear programming problem is presented, structured as a mesh of trees and designed to implement the well-known simplex algorithm.
Abstract: The use of a special-purpose VLSI chip for solving a linear programming problem is presented. The chip is structured as a mesh of trees and is designed to implement the well-known simplex algorithm. A high degree of parallelism is introduced in each pivot step, which can be carried out in O (log n) time using an m × n mesh of trees having an O(mn log m log3 n) area where m − 1 and n − 1 are the number of constraints and variables, respectively. Two variants of the simplex algorithm are also considered: the two-phase method and the revised one. The proposed chip is intended as being a possible basic block for a VLSI operations research machine.

Proceedings ArticleDOI
01 Dec 1987
TL;DR: The parallel iterative solution of systems arising from the finite element discretization of elliptic PDEs, is the problem considered here and a red-black ordering is introduced to increase the degree of parallelism of the computation.
Abstract: The parallel iterative solution of systems arising from the finite element discretization of elliptic PDEs, is the problem considered here A red-black ordering is introduced to increase the degree of parallelism of the computation Space-time domain expansion techniques are used to partition the computation for a proposed fixed-size VLSI architecture

Proceedings ArticleDOI
01 Dec 1987
TL;DR: Regression models indicate strong nonlinear correlation between the degree of load imbalance and job speedup and a linear effect due to CPU/IO intensity, and locality of workload is shown to be a minor but significant effect.
Abstract: It has been well established that the performance of a parallel processor computer system is affected by many design alternatives and the underlying degree of parallelism in the workload. We look at the impact of workloads which load processors in the network unevenly to observe the performance degradation. We constrain the parallel processor architecture to the family of hypercube networks. Each node is loaded with some portion of the workload composed of CPU bursts and I/O, and allowed to run at its on pace until it completes. Since message transmission preempts node processing, communication between nodes complicates the concurrent operation of the network. We vary the degree of load balance, the processing node locality and the ratio of CPU burst time to message transmission time across a generic 16 node hypercube and use total processing time speedup as the performance criteria. Regression models indicate strong nonlinear correlation between the degree of load imbalance and job speedup and a linear effect due to CPU/IO intensity. The locality of workload is shown to be a minor but significant effect. The impact of the load balance, CPU/IO intensity and locality effects on algorithm decomposition is discussed.

01 Jan 1987
TL;DR: The idea is to employ a systematic approach in partitioning an existing program into a set of large grains such that the best performance in terms of total execution time is achieved by evaluating the tradeoff between parallelism and communication cost.
Abstract: There are relative advantages and disadvantages of small-grain and large-grain parallelism. It is well established that, for MIMD machines, small-grain parallelism is not recommended because of associated excessive interprocessor communication overhead. On the other hand, the large-grain approach does not provide an adequate degree of parallelism and may not provide necessary speedup. In our work, we have adopted an optimal-grain approach such that the parallelism obtained at small-grain level is retained while minimizing the communication overhead. The idea is to employ a systematic approach in partitioning an existing program into a set of large grains such that the best performance in terms of total execution time is achieved by evaluating the tradeoff between parallelism and communication cost. To do this effectively, we introduce a model which can accurately represent the possible communication between various computational units of a program, and can measure possible computational overlap between interacting computational units. The tradeoff between parallelism and communication cost leads to an improved performance. Based on this model, software packages have been developed to accept a program written in FORTRAN, to analyze its data dependency, and to partition the program into a set of large grains. Extensive experiments conducted on EISPACK subroutines show substantial improvement in execution time on MIMD machines.

Patent
16 May 1987
TL;DR: In this paper, the authors proposed a method to measure the degree of parallelism of a matter to be measured without turning over the same and without receiving the effect of dust etc., by arranging Fizeau type interferometers to the front and back sides of the matter in opposed relation to each other.
Abstract: PURPOSE:To make it possible to simultaneously and accurately measure the degree of parallelism of a matter to be measured without turning over the same and without receiving the effect of dust etc., by arranging Fizeau type interferometers to the front and back sides of the matter to be measured in opposed relation to each other. CONSTITUTION:In using the titled apparatus, opposed reference plates 16, 26 are set as a first stage while an interference fringe is observed so as to bring the degree of parallelism between both plates to zero. Next, a plane-parallel plate 30 being a matter to be measured is set at a predetermined position. In this setting, the parallel plane plate 30 is supported by a support mechanism constituted so as to be three-dimensionally movable on the common optical axis of first and second Fizeau type interferometers. In such a state that the plane-parallel plate 30 is held as it is, the degree of parallelism between said plate 26 and the back surface 32 of the matter to be measured is calculated using the second Fizeau type interferometer. By the above mentioned measuring procedure, the degree of parallelism of the front surface 31 and back surface 32 of the plane-parallel plate 30 can be easily measured.

Journal ArticleDOI
TL;DR: A review of the various approaches towards tolerating hardware faults in multiprocessor systems and an analysis of the state-of-the-art is given which points out the major aspects of fault-tolerance in such architectures.
Abstract: Multiprocessor systems which afford a high degree of parallelism are used in a variety of applications. The extremely stringent reliability requirement has made the provision of fault-tolerance an important aspect in the design of such systems. This paper presents a review of the various approaches towards tolerating hardware faults in multiprocessor systems. It. emphasizes the basic concepts of fault tolerant design and the various problems to be taken care of by the designer. An indepth survey of the various models, techniques and methods for fault diagnosis is given. Further, we consider the strategies for fault-tolerance in specialized multiprocessor architectures which have the ability of dynamic reconfiguration and are suited to VLSI implementation. An analysis of the state-of-the-art is given which points out the major aspects of fault-tolerance in such architectures.

Journal ArticleDOI
TL;DR: This paper introduces a truncated version of interval arithmetic cyclic reduction dedicated to reduce the computation time, and instead of really truncating steps, replaces them by easily computable intervals.
Abstract: In many numerical problems the solution of tridiagonal systems of equations consumes an important part of the computation time. For their efficient solution on vector or parallel computers the recursive Gauss algorithm has often to be replaced by a method with a higher degree of parallelism. Among other methods cyclic reduction has been widely discussed. In the present paper we discuss some aspects of the numerical treatment of tridiagonal systems with interval coefficients which arise, for example, as part of interval arithmetic Newton-like methods combined with a “fast Poisson solver” [8, 9]. We have discussed interval arithmetic cyclic reduction in [10]. Here we introduce a truncated version dedicated to reduce the computation time. In contrast to the non-interval case we have to preserve inclusion properties. Instead of really truncating steps, we replace them by easily computable intervals. In contrast to the non-interval case we can “truncate” both the reduction and the solution phase.

Journal ArticleDOI
TL;DR: An algorithm with high degree of parallelism for the mean squared error parameter estimation of non-casual image models and can be performed on a linear array of M processors, with M being the order of the model.

Book ChapterDOI
01 Jul 1987
TL;DR: A message-passing approach is proposed to implement the deduction in Omega and an extension to Common LISP is suggested to provide the necessary message-Passing primitives.
Abstract: Omega is a description system for knowledge embedding which enables representation of knowledge in conceptual taxonomies. Reasoning on this knowledge can be carried out by a process called taxonomic reasoning, which is based on operations of traversing the lattice of descriptions. This process can be performed with a high degree of parallelism, by spreading the activities among the nodes of the lattice. Reasoning strategies expressed at the metalevel of Omega can be used to tailor deductions to specific applications. A message-passing approach is proposed to implement the deduction in Omega. An extension to Common LISP is suggested to provide the necessary message-passing primitives.

Proceedings ArticleDOI
01 Feb 1987
TL;DR: It is concluded that dynamic process allocation does not seem efficient for logic programs and a distributed computation model based on static process allocation is proposed, which improves locality, reduces the requirements of interprocess communication and provides opportunities for machine learning.
Abstract: One of the most appealing characteristics of logic programs is the natural and abundant non-determinism of execution. This non-determinism allows a non-conventional computer to pursue highly parallel computation. Considerable effort has been expended exploiting this potential. The AND/OR Process Model in Conery's dissertation [1] and Concurrent Prolog by Shapiro [2] are two famous pioneering efforts in this area. They are frequently referenced in the literature as bases of improvement and comparison.In these models and their successors, a logic program is solved by a set of conceptually tree-structured processes. Each process is assumed to have a separate copy of the whole program and dynamically spawns dependant processes to solve subgoals. Eventually, a process can solve its goal by simply matching the goal with a unit clause in the program and reporting the solution (or a failure message) to its parent process.While these models faithfully create dynamic processes to solve parallel literals and achieve a high degree of parallelism, they also suffer several difficulties. First of all, they are all based on the traditional view that a program is a stream of machine instructions. Notably, the knowledge base semantics of a logic program is not considered. A logic program could be modified at run-time, which is semantically understood as knowledge base maintenance. The assert/retract “predicates” in Prolog are simple but typical examples of knowledge base maintenance. For an intelligent system, knowledge base maintenance means learning. These models fail to support this requirement effectively.Moreover, more parallelism does not necessarily mean greater speed. In a multi-processing environment, the overhead of control and communication to distribute subtasks, to coordinate them and to collect results is typically very high. Only when the size of a subtask is larger than the overhead can we consider such distribution favorable. A logic program has poor locality. It also requires each process to remember a large amount of administrative information even after a subgoal is solved, and it demands excessive interprocess communication to reference variable binding information. These indicate that there is a large overhead associated with the parallel execution based on these models. But each subtask assigned to a process is comparatively simple: the unification between a goal and a clause is just a string matching process. We summarize these observations and conclude that dynamic process allocation does not seem efficient for logic programs.A distributed computation model based on static process allocation is proposed. A logic program is partitioned prior to execution, and those logically related program clauses are physically grouped together and allocated to the same process. Thus, the relationships among processes are actually relationships among their knowledge bases, which are static and known at the time of process allocation. Each process handles all the requests of knowledge base maintenance or knowledge deduction (normal execution) directed to its local knowledge base. Each process references only its local knowledge base and communicates with a few pre-determined processes whose knowledge bases are logically related. Newly derived knowledge in each process may be inserted into the local knowledge base to improve its performance in the future as a simple level of learning. Inconsistent information in a knowledge base may also be resolved locally as a deeper level of learning. This model increases the power and intelligence of each process and thus improves locality, reduces the requirements of interprocess communication and provides opportunities for machine learning.A prototype system which simulates this static process allocation model for parallel interpretation of subset Prolog programs is being implemented in Ada. This prototype system may be used for further studying different schemes of program partitioning, allocation and the resulting influences on machine learning and system performance.