scispace - formally typeset
Search or ask a question

Showing papers on "Degree of parallelism published in 1988"


Proceedings ArticleDOI
01 Jun 1988
TL;DR: The authors describe the HERCULES system and address the hardware description problem, behavioral synthesis, optimization using a method called the reference stack and the mapping of behavior onto a structure, allowing varying degrees of parallelism in the resulting hardware.
Abstract: This paper presents an approach to high-level synthesis of VLSI processors and systems. Synthesis consists of two phases: behavioral synthesis, which involves implementation-independent representations, and structural synthesis, that relates to the transformation of a behavior into an implementation. We describe HERCULES, a system for high-level synthesis developed at Stanford University. In particular, we address the hardware description problem, behavioral synthesis and optimization using a method called the reference stack, and the mapping of behavior onto a structure. We present a model for control based on sequencing graphs that supports multiple threads of execution flow, allowing varying degree of parallelism in the resulting hardware. Results are then presented for three examples: MC6502, Intel8251 and FRISC, a 16-bit microprocessor.

107 citations


Journal ArticleDOI
01 Apr 1988
TL;DR: This work proposes a modification of the partition method of Wang which reduces the amount of data transport considerably, without affecting the computational complexity and which has about the same degree of parallelism as the original version.
Abstract: The partition method of Wang, for the solution of tridiagonal linear systems, is analysed with regard to data transport between the processors of a parallel (local memory) computer. We propose a modification which reduces the amount of data transport considerably, without affecting the computational complexity and which has about the same degree of parallelism as the original version. We will also discuss the effects of this modification to a generalized version for banded systems. The parallel solution of a bidiagonal system is considered.

37 citations


Journal ArticleDOI
TL;DR: It is shown that convergence is achieved with a rate that does not depend on the polynomial degree of the spectral solution, and can be effectively implemented on multiprocessor systems due to their high degree of parallelism.
Abstract: We propose and analyze several block iteration preconditioners for the solution of elliptic problems by spectral collocation methods in a region partitioned into several rectangles. It is shown that convergence is achieved with a rate that does not depend on the polynomial degree of the spectral solution. The iterative methods here presented can be effectively implemented on multiprocessor systems due to their high degree of parallelism.

37 citations


Journal ArticleDOI
F.W. Burton1
TL;DR: The author's approach is to determine how much storage would have been available to a task in a sequential system, and to ensure that at least as much storage is available to the task when it is executed in a distributed system.
Abstract: Many parallel algorithms, particularly divide-and-conquer algorithms, may be structured as dynamic trees of tasks. In general, as parallelism increases, storage requirements also increase. A sequential program keeps storage requirements small by finishing one task (or procedure invocation) before going on to another, so that the entire task tree is never in existence at any one time. This corresponds to a depth-first expansion of a tree of tasks. On the other hand, a breadth-first expansion often produces a higher degree of parallelism, but also may produce an exponential growth in storage requirements. The resources (processors and memory) of a given system can be matched by alternating between depth-first and breadth-first expansion. The author's approach is to determine how much storage would have been available to a task in a sequential system, and to ensure that at least as much storage is available to the task when it is executed in a distributed system. >

27 citations


Proceedings ArticleDOI
11 Apr 1988
TL;DR: A constrained maximum-likelihood estimator is derived by incorporating a rotationally invariant roughness penalty proposed by I.J. Good (1981) into the likelihood functional, which leads to a set of nonlinear differential equations the solution of which is a spline-smoothing of the data.
Abstract: A constrained maximum-likelihood estimator is derived by incorporating a rotationally invariant roughness penalty proposed by I.J. Good (1981) into the likelihood functional. This leads to a set of nonlinear differential equations the solution of which is a spline-smoothing of the data. The nonlinear partial differential equations are mapped onto a grid via finite differences, and it is shown that the resulting computations possess a high degree of parallelism as well as locality in the data-passage, which allows an efficient implementation on a 48-by-48 mesh-connected array of NCR GAPP processors. The smooth reconstruction of the intensity functions of Poisson point processes is demonstrated in two dimensions. >

16 citations


Journal ArticleDOI
TL;DR: A novel partitioning strategy is outlined for maximizing the degree of parallelism on computers with a small number of powerful processors.
Abstract: A brief review is made of the fundamental concepts and basic issues of parallel processing. Discussion focuses on mechanisms for parallel processing, construction and implementation of parallel numerical algorithms, performance evaluation of parallel processing machines and numerical algorithms, and parallelism in finite element computations. A novel partitioning strategy is outlined for maximizing the degree of parallelism on computers with a small number of powerful processors.

15 citations


Journal ArticleDOI
17 May 1988
TL;DR: The GREEDY network is presented, a new interconnection network (IN) for tightly coupled multiprocessors (TCMs) and an original and cost effective hardware synchronization mechanism is proposed which may be achieved at execution time on a very large spectrum of loops.
Abstract: To satisfy the growing need for computing power, a high degree of parallelism will be necessary in future supercomputers. Up to the late 70s, supercomputers were either multiprocessors (SIMD-MIMD) or pipelined monoprocessors. Current commercial products combine these two levels of parallelism.Effective performance will depend on the spectrum of algorithms which is actually run in parallel. In a previous paper [Je86], we have presented the DSPA processor, a pipeline processor which is actually performant on a very large family of loops.In this paper, we present the GREEDY network, a new interconnection network (IN) for tightly coupled multiprocessors (TCMs). Then we propose an original and cost effective hardware synchronization mechanism. When DSPA processors are connected with a shared memory through a GREEDY network and synchronized by our synchronization mechanism, a very high parallelism may be achieved at execution time on a very large spectrum of loops including loops where independency of the successive iterations cannot be checked at compile time as e.g. loop 1: DO 1 I=1 N1 A(P(I)=A(Q(I))

8 citations


Journal ArticleDOI
TL;DR: The unification is a basic component of Prolog processing and its parallel processing has not been well studied because the number of arguments, which corresponds to the degree of the unification parallelism, is small, and a consistency check operation is necessary after a parallel unification operation.
Abstract: The unification is a basic component of Prolog processing. However, its parallel processing has not been well studied because the number of arguments, which corresponds to the degree of the unification parallelism, is small, and a consistency check operation is necessary after a parallel unification operation. On these issues, we have implemented the following ideas: (1) enhancing the degree of parallelism by decomposing a compound term into a functor and the arguments at compile-time; (2) allocating decomposed unification processing to multiple processor units (PUs) at run-time; (3) decreasing the number of consistency checks by the compile-time clustering and reducing the overhead by embedding the consistency check operations into the unification processing; and (4) stopping the operations of the other processors if the unification fails. To clarify the effect, we have developed and evaluated a Prolog processor on a multiprocessor system. The results show that statistically: (1) the decomposition of compound terms makes the number of arguments 3.2 on the average even after clustering, and that dynamically, (1) the unification parallelism performs 41 percent speed up, and the effect is evident at a small number of processors; (2) the compile-time clustering makes the consistency check unnecessary; (3) the stop operation of processors, running in parallel, attains 0.5 – 6 percent (and 10 percent for some problems) performance improvement; and (4) the processing of clause head occupies 60 – 70 percent of dynamic microsteps and is an important object of parallel processing.

7 citations


Proceedings ArticleDOI
24 May 1988
TL;DR: A residue arithmetic circuit based on multiple-valued bidirectional current-mode MOS technology using radix-five signed-digit full adders to obtain a high degree of parallelism and multiple-operand addition, so that the high-speed arithmetic operation can be achieved.
Abstract: A residue arithmetic circuit based on multiple-valued bidirectional current-mode MOS technology is proposed. Each residue digit is represented by multiple-valued coding suitable for highly parallel computation. Using the coding, mod m/sub i/ multiplication can be simply performed by a shift operation. In mod m/sub i/ addition, radix-five signed-digit full adders are used to obtain a high degree of parallelism and multiple-operand addition, so that the high-speed arithmetic operation can be achieved. A novel parallel scaling algorithm is discussed. A mod-seven three-operand multiply-adder is designed for an integrated circuit based on 10- mu m CMOS technology. >

6 citations


01 Jan 1988
TL;DR: A stabilized parallel algorithm for direct-form recursive lters is obtained using a new method of derivation in the Z domain, which is regular and modular, so very ecient VLSI architectures can be constructed to implement it.
Abstract: A stabilized parallel algorithm for direct-form recursive lters is obtained using a new method of derivation in the Z domain. The algorithm is regular and modular, so very ecient VLSI architectures can be constructed to implement it. The degree of parallelism in these implementations can be chosen freely, and is not restricted to be a power of two.

5 citations


Book ChapterDOI
E. Clementi1, D. Logan1
01 Nov 1988
TL;DR: This chapter focuses on parallel processing with the loosely coupled array of processors system, which is a multiple instruction stream/multiple data stream (MIMD) system, in the form of a distributed network of nodes.
Abstract: Publisher Summary This chapter focuses on parallel processing with the loosely coupled array of processors system. The system reviewed in the chapter is a multiple instruction stream/multiple data stream (MIMD) system, in the form of a distributed network of nodes. The distributed nature of the system allows appropriate utilization of the parallelism of the code; that is, the “degree of parallelism” of the software is matched by the hardware. An important advantage of high-level parallelism, as in high-level programming languages, is portability. A high-level parallel code may be executed on any system with a rudimentary communication protocol. In addition, improvements at the instruction level can be independently pursued without disturbing the setup of the parallel algorithm. The loosely coupled array of processors (LCAP)-1 system, hosted by either an IBM 4341 or 4381, runs under the IBM Virtual Machines/System Product (VM/SP) operating system. VM/SP is a time-sharing system in which jobs run on virtual machines (VM) created by the system; these VMs simulate real computing systems.

Journal ArticleDOI
TL;DR: Several parallel algorithms for image edge relaxation on array processors with different numbers of processing elements (PEs) connected by a mesh or hypercube network are described.

Book ChapterDOI
01 Oct 1988
TL;DR: A semantic model for distributed real-time programs that preserves the basic properties of process autonomy and considers their nondeterministic execution in a dense time domain, and supports an arbitrary degree of parallelism.
Abstract: A semantic model for distributed real-time programs is proposed. The semantics is state-based and compositional. It preserves the basic properties of process autonomy and considers their nondeterministic execution in a dense time domain. The internal actions and communications of a command are treated in a uniform way to obtain a simple semantic domain. An ordering on this domain for information approximation is developed. The absence of global objects in the semantics of a command makes it possible for modular changes to adapt the model for different communication mechanisms and different execution environments. To illustrate this, we show how process executions can be modelled in an environment with limited processors. The proposed semantics models termination, failure, divergence, deadlock, and starvation, and supports an arbitrary degree of parallelism.

01 Apr 1988
TL;DR: A semantic model for developing and justifying specifications of communicating real-time processes is proposed and models termination, failure, divergence, deadlock, and starvation, and supports an arbitrary degree of parallelism.
Abstract: A semantic model for developing and justifying specifications of communicating real-time processes is proposed. The semantics is state-based and compositional. The basic semantic objects are timed-observations of values of program variables, where time is assumed to be in the domain of real numbers. The internal actions and communications of a command are treated in a uniform way to obtain a simple semantic domain. An ordering on this domain for information approximation is developed. The proposed semantics models termination, failure, divergence, deadlock, and starvation, and supports an arbitrary degree of parallelism.

Proceedings ArticleDOI
25 Oct 1988
TL;DR: This paper describes the problem of 2-D state-space filter algorithms and presents some high speed VLSI implementations and comparisons among the different architectures are given in terms of hardware complexity, throughput rate, latency and efficiency.
Abstract: This paper describes the problem of 2-D state-space filter algorithms and presents some high speed VLSI implementations. The state-space filters are known to be capable of minimizing the finite-word-length effects, but the computations will be increased. By exploiting concurrency in two-dimensional state-space systems, the following speed up architectures are obtained. The local speed-up processors realize the matrix-vector multiplications and decrease the processing time for each pixel. The global speed-up structures in addition use the inherent spatial concurrency and decrease the total processing time in a global sense. These architectures feature a high degree of parallelism and pipelining. They can work on multiple columns or multiple lines of images concurrently. The throughput rate can be up to one column or one line of images per clock time. Another high speed architecture, based on the 2-D block-state update technique, is then presented. It is shown that the throughput rate can be adjusted by varying the block size. Finally, comparisons among the different architectures are given in terms of hardware complexity, throughput rate, latency and efficiency.

Journal ArticleDOI
17 May 1988
TL;DR: An efficient mapping from a tree structure into a pipelined array of 2logN , stages is presented for processing a NxN image and the identification of the information growing property inherit in feature extraction algorithms allows us to exploit bit-level concurrency in the architectural design.
Abstract: Geometric feature extraction can be characterized as a computationally intensive task in the environment of real-time automated vision systems requiring algorithms with a high degree of parallelism and pipelining under the raster-scan I/O constraint. Using divide-and-conquer techniques, many feature extraction algorithms have been formulated as a pyramid and then as a binary tree structure. An efficient mapping from a tree structure into a pipelined array of 2logN, stages is presented for processing a NxN image. In the proposed mapping structure, the identification of the information growing property inherit in feature extraction algorithms allows us to exploit bit-level concurrency in the architectural design. Accordingly, the design of each staged pipelined processor is simplified.A single VLSI chip which can generate (p+1)(q+1) moments concurrently in real-time applications is reported. This chip posses a hardware complexity of O(pq(p+q)log2N) where p, q stand for the orthogonal orders of the moment. This hardware complexity is an improvement over other reported methods O(pq(p+q)2log2N).

01 Jan 1988
TL;DR: In this article, a graph-theoretic approach is used to derive asymptotically optimal algorithms for parallel Gaussian elimination on SIMD/MIMD couputers with a shared memory system.
Abstract: This paper uses a graph-theoretic approach to derive asymptotically optimal algorithms for parallel Gaussian elimination on SIMD⧸MIMD couputers with a shared memory system. Given a problem of sinze n and using p = αn processors, where α ⩽ 2√ 43 ≅ 0.305, the asymptotically optimal algorithms are of efficiency eα = 1(1 + α3) ⩾ 0.972. This evidences the high degree of parallelism that can be achieved.

Book ChapterDOI
01 Jan 1988
TL;DR: The field of Computer Vision is characterized by the need of processing very large amounts of data in a time that, for many applications, is extremely short, hence the demand for computer arrays whose structure reflects the problem’s structure, and for powerful tools that allow an optimal mapping of logical to physical architectures.
Abstract: The field of Computer Vision is characterized by the need of processing very large amounts of data in a time that, for many applications, is extremely short. Moreover, many of the algorithms proposed for solving the so called low level vision tasks exhibit a high degree of parallelism while, as far as one ascends to higher perception levels, it becomes evident the need to resort to complex reasoning schemes, where information derived from a variety of sophisticated computations, involving sequential processing, has to be combined with a knowledge base. In any case, all has to be performed fast enough to interact with the real world changes. Hence the demand for computer arrays whose structure reflects the problem’s structure, and for powerful tools that allow an optimal mapping of logical to physical architectures.

Journal ArticleDOI
TL;DR: This work describes a parallel rendering algorithm for shared memory MIMD machines which takes advantage of image coherence to reduce computation and examines the possible synchronization bottlenecks by statically assigning different numbers of CPUs to sections of the screen.
Abstract: Fractal surfaces are a sueful modeling technique for terrain in computer graphics. Although an algorithm exists for ray tracing (Mandelbrot) fractal surfaces, the technique is computationally very expensive. The large degree of parallelism inherent in the problem suggests the use of parallel architectures for generating these images. We describe a parallel rendering algorithm for shared memory MIMD machines which takes advantage of image coherence to reduce computation. This algorithm has, on a Sequent Balance 2100 with 20 processors, demonstrated a near-linear speedup. We examine the possible synchronization bottlenecks by statically assigning different numbers of CPUs to sections of the screen.

Proceedings Article
09 Mar 1988
TL;DR: In this article, a high-performance general purpose processor was designed, using various technology independent methods to improve performance, including a control unit which asynchronously controls instruction execution by tokens, allowing the evaluation of very complex expressions without any reference to clock cycles.
Abstract: A high-performance, general purpose processor has been designed, using various technology independent methods to improve performance. Its structure offers a large degree of parallelism and is adjusted to the application. A novel control unit, which asynchronously controls instruction execution by tokens, allows the evaluation of very complex expressions without any reference to clock cycles. The main memory communicates via 4 ports with the processor and avoids a bottleneck in accessing data. The processor performance is measured and compared with several commercial systems.

Journal ArticleDOI
01 Oct 1988
TL;DR: DeSPOt is an algorithm for the dynamic distribution of such non-uniform tasks to achieve automatic load balancing on a distributed memory hypercube multiprocessor and its performance characteristics are described.
Abstract: Large scale scientific applications such as weather modelling and continuous simulation require the orders of magnitude performance improvement available with the new generation of parallel vector supercomputers such as the Floating Point Systems T Series. Many of these applications exhibit a high degree of parallelism, much of which can be expressed as computational tasks which are of varying size and degrees of dependence on one another, and can be partially ordered for execution. DeSPOT (A Distributed Self-Scheduler for Partially Ordered Tasks) is an algorithm for the dynamic distribution of such non-uniform tasks to achieve automatic load balancing on a distributed memory hypercube multiprocessor. This paper describes the DeSPOt algorithm and presents its performance characteristics of various test cases using result timings on the FPS T 20.

26 Sep 1988
TL;DR: Experimental shows that the B-HIVE compiler produces more efficient codes than existing techniques, and Numerous enhancements are needed to improve the capabilities of the parallelizing compiler.
Abstract: The software models used in implementing the parallelizing compiler for the B-HIVE multiprocessor system are described. The various models and strategies used in the compiler development are: flexible granularity model, which allows a compromise between two extreme granularity models; communication model, which is capable of precisely describing the interprocessor communication timings and patterns; loop type detection strategy, which identifies different types of loops; critical path with coloring scheme, which is a versatile scheduling strategy for any multicomputer with some associated communication costs; and loop allocation strategy, which realizes optimum overlapped operations between computation and communication of the system. Using these models, several sample routines of the AIR3D package are examined and tested. It may be noted that automatically generated codes are highly parallelized to provide the maximized degree of parallelism, obtaining the speedup up to a 28 to 32-processor system. A comparison of parallel codes for both the existing and proposed communication model, is performed and the corresponding expected speedup factors are obtained. The experimentation shows that the B-HIVE compiler produces more efficient codes than existing techniques. Work is progressing well in completing the final phase of the compiler. Numerous enhancements are needed to improve the capabilities of the parallelizing compiler.

01 Jan 1988
TL;DR: An efficient mapping from a tree structure into a pipelined array of ZbgN, stages is presented for processing a NXN image and the identification of the information growing property inherit in feature extraction algorithms allows us to exploit bit-level concurrency in the architectural design.
Abstract: Geometric feature extraction can be characterized as a computationally intensive task in the environment of real-time automated vision systems requiring algorithms with a high degree of parallelism and pipelining under the raster-n 110 constraint. Using divide-and-conquer techniques, many feature extraction algorithms have been formulated as a pyramid and then as a binary tree structure. An efficient mapping from a tree structure into a pipelined array of ZbgN, stages is presented for processing a NXN image. In the proposed mapping structure, the identification of the information growing property inherit in feature extraction algorithms allows us to exploit bit-level concurrency in the architectural design. Accordingly, the design of each staged pipelined processor is simplified. A single VLSI chip which can generate (pklxq+l) moments concurrently in real-time applications is reported. This chip posses a hardware complexity of O(pq(p+q)log2N) where p, q stand for the orthogonal orders of the moment. This hardware complexity is an improvement over other reported methods o(pq(p+q>210g2N).

01 Jan 1988
TL;DR: The joint probabilistic data association (JPDA) algorithm has been previously reported to be suitable for the problem of tracking multiple targets in the presence of clutter and an approximation of the JPDA has been suggested in this paper.
Abstract: The joint probabilistic data association (JPDA) algorithm has been previously reported to be suitable for the problem of tracking multiple targets in the presence of clutter. Although it make few assumptions and can handle many targets, the complexity of this algorithm increases rapidly with the number of targets and returns. An approximation of the JPDA has been suggested in this paper. The proposed algorithm uses an analog computational network to solve the data association problem. The problem is viewed as that of optimizing a suitably chosen objective function. Simple neural network structures for the approximate minimization of such functions have been proposed by other researchers. The analog network used here offers a significant degree of parallelism and thus can compute the association probabilities more rapidly. Computer simulations indicate the ability of the algorithm to track many targets simultaneously in the presence of moderate density clutter.