scispace - formally typeset
Search or ask a question

Showing papers on "Degree of parallelism published in 1991"


Journal ArticleDOI
TL;DR: A novel domain decomposition approach for the parallel finite element solution of equilibrium equations is presented, which exhibits a degree of parallelism that is not limited by the bandwidth of the finite element system of equations.
Abstract: A novel domain decomposition approach for the parallel finite element solution of equilibrium equations is presented. The spatial domain is partitioned into a set of totally disconnected subdomains, each assigned to an individual processor. Lagrange multipliers are introduced to enforce compatibility at the interface nodes. In the static case, each floating subdomain induces a local singularity that is resolved in two phases. First, the rigid body modes are eliminated in parallel from each local problem and a direct scheme is applied concurrently to all subdomains in order to recover each partial local solution. Next, the contributions of these modes are related to the Lagrange multipliers through an orthogonality condition. A parallel conjugate projected gradient algorithm is developed for the solution of the coupled system of local rigid modes components and Lagrange multipliers, which completes the solution of the problem. When implemented on local memory multiprocessors, this proposed method of tearing and interconnecting requires less interprocessor communications than the classical method of substructuring. It is also suitable for parallel/vector computers with shared memory. Moreover, unlike parallel direct solvers, it exhibits a degree of parallelism that is not limited by the bandwidth of the finite element system of equations.

1,302 citations


Book
01 Jan 1991
TL;DR: The inadequacies of conventional parallel languages for programming multicomputers are identified, and a compiler that translates C* programs into C programs suitable for compilation and execution on a hypercube multicomputer is presented.
Abstract: The inadequacies of conventional parallel languages for programming multicomputers are identified. The C* language is briefly reviewed, and a compiler that translates C* programs into C programs suitable for compilation and execution on a hypercube multicomputer is presented. Results illustrating the efficiency of executing data-parallel programs on a hypercube multicomputer are reported. They show the speedup achieved by three hand-compiled C* programs executing on an N-Cube 3200 multicomputer. The first two programs, Mandelbrot set calculation and matrix multiplication, have a high degree of parallelism and a simple control structure. The C* compiler can generate relatively straightforward code with performance comparable to hand-written C code. Results for a C* program that performs Gaussian elimination with partial pivoting are also presented and discussed. >

294 citations


Proceedings ArticleDOI
02 Dec 1991
TL;DR: It is concluded that under most circumstances, hardware partitioning is the best strategy for multiprogramming a multiprocessor, no matter how much parallelism applications employ or how frequently synchronization occurs.
Abstract: Many solutions have been proposed to the problem of multiprogramming a multiprocessor. However, each has limited applicability or fails to address an important source of overhead. In addition, there has been little experimental comparison of the various solutions in the presence of applications with varying degrees of parallelism and synchronization. The authors explore the tradeoffs between three different approaches to multiprogramming a multiprocessor: time-slicing, coscheduling, and dynamic hardware partitions. They implemented applications that vary in the degree of parallelism, and the frequency and type of synchronization. They show that in most cases coscheduling is preferable to time-slicing. They also show that although there are cases where coscheduling is beneficial, dynamic hardware partitions do no worse, and will often do better. They conclude that under most circumstances, hardware partitioning is the best strategy for multiprogramming a multiprocessor, no matter how much parallelism applications employ or how frequently synchronization occurs. >

66 citations


Journal ArticleDOI
S. Doi1
TL;DR: Analytical and empirical studies are carried out, using 2D convection-diffusion model equations, to determine how the degree of parallelism affects the speed of convergence for these preconditioned methods.

54 citations


Journal ArticleDOI
Stefan Bondeli1
01 Jul 1991
TL;DR: A divide and conquer algorithm which solves linear tridiagonal systems with one right-hand side, especially suited for parallel computers, and can be combined with recursive doubling, cyclic reduction or Wang's partition method in order to increase the degree of parallelism and vectorizability.
Abstract: We describe a divide and conquer algorithm which solves linear tridiagonal systems with one right-hand side, especially suited for parallel computers. The algorithm is flexible, permits multiprocessing or a combination of vector and multiprocessor implementations, and is adaptable to a wide range of parallelism granularities. This algorithm can also be combined with recursive doubling, cyclic reduction or Wang's partition method, in order to increase the degree of parallelism and vectorizability. The divide and conquer method will be explained. Some results of time measurements on a CRAY X-MP/28, on an Alliant FX/8, and on a Sequent Symmetry S81b, as well as comparisons with the cyclic reduction algorithm and Gaussian elimination will be presented. Finally, numerical results are given.

50 citations


Proceedings ArticleDOI
08 Jan 1991
TL;DR: Hector is a shared-memory multiprocessor with a hierarchical interconnection structure that has three important advantages: it uses only short transmission lines and allows implementations with simple, modular hardware, and the cost of a memory access grows incrementally with the distance between the processor and memory location.
Abstract: Hector is a shared-memory multiprocessor with a hierarchical interconnection structure that has three important advantages. First, it uses only short transmission lines and allows implementations with simple, modular hardware. This makes it scalable to match the needs of tomorrow's high speed microprocessors. Second, the cost and the overall bandwidth of the structure grow linearly with the number of processing modules. This makes Hector expandable to moderate sizes of up to 256 PMs, yet allows small-scale systems at a low cost. Finally, the cost of a memory access grows incrementally with the distance between the processor and memory location. This allows single threaded applications, applications with a small degree of parallelism, and applications with a high degree of locality in their memory accesses to exploit the low cost of localized memory accesses. >

43 citations


Proceedings Article
24 Aug 1991
TL;DR: The massively parallel memory-based parsing takes a radical departure from the traditional view and views parsing as a memory-intensive process which can be sped up by massively parallel computing and is promising for real-time parsing and bulk text processing.
Abstract: This paper discusses a radically new scheme of natural language processing called massively parallel memory-based parsing. Most parsing schemes are rule-based or principle-based which involves extensive serial rule application. Thus, it is a time consuming task which requires a few seconds or even a few minutes to complete the parsing of one sentence. Also, the degree of parallelism attained by mapping such a scheme on parallel computers is at most medium, so that the existing scheme can not take advantage of massively parallel computing. The massively parallel memory-based parsing takes a radical departure from the traditional view. It views parsing as a memory-intensive process which can be sped up by massively parallel computing. Although we know of some studies in this direction, we have seen no report regarding implementation strategies on actual massively parallel machines, on performance, or on practicality accessment based on actual data. Thus, this paper focuses on discussion of the feasibility and problems of the approach based on actual massively parallel implementation using real data. The degree of parallelism attained in our model reaches a few thousands, and the performance of a few milliseconds per sentence has been accomplished. In addition, parsing time grows only linearly (or sublincarly) to the length of the input sentences. The experimental results show the approach is promising for real-time parsing and bulk text processing.

34 citations


Proceedings ArticleDOI
01 Apr 1991
TL;DR: This paper presents a formal mathematical framework which unifies the existing loop transformations and includes more general classes of loop transformations, which can extract more parallelism from a class of programs than the existing techniques.
Abstract: : This paper presents a formal mathematical framework which unifies the existing loop transformations. This framework also includes more general classes of loop transformations, which can extract more parallelism from a class of programs than the existing techniques. We classify schedules into three classes: uniform, subdomain-variant, and statement-variant. Viewing from the degree of parallelism to be gained by loop transformation, the schedules can also be classified as single-sequential level, multiple-sequential level, and mixed schedules. We also illustrate the usefulness of the more general loop transformation with an example program.

28 citations


Journal ArticleDOI
TL;DR: An energy function is constructed to reduce the synchronization requirements by using a reformulation of the log-likelihood function from the expectation maximization (EM) algorithm to change the global dependence in the energy function fromThe current estimate to the estimate generated during the last iteration.
Abstract: A method for implementing simulated annealing in parallel to speed up the execution of emission tomography (ET) image reconstruction is presented. A high degree of parallelism can be attained by using a parallel-acceptance partitioning strategy, in which perturbations to subsets of the estimate are evaluated in parallel. However because the point spread function in ET imaging systems is globally dependent, processors cannot update the current estimate independently. Consequently, processors must be synchronized each time a perturbation is accepted to avoid introducing error. This can produce excessive communications overhead, especially when the acceptance rate is high. An energy function is constructed to reduce the synchronization requirements by using a reformulation of the log-likelihood function from the expectation maximization (EM) algorithm. The approach is to change the global dependence in the energy function from the current estimate to the estimate generated during the last iteration. The synchronization requirements for guaranteed convergence are then significantly reduced from once per acceptance to once per iteration. This parallel implementation on 54 Inmos T800 transputers connected in a ring topology resulted in execution times that were almost 50 times faster than on a VAX 8600.

23 citations


Journal ArticleDOI
TL;DR: The condition for stability of the system is first precisely specified, the degree of parallelism, expressed as the asymptotic average number of processors that work concurrently, is computed, and various design and simulation aspects concerning parallel processing systems are considered.
Abstract: The general problem of parallel (concurrent) processing is investigated from a queuing theoretic point of view.As a basic simple model, consider infinitely many processors that can work simultaneously, and a stream of arriving jobs, each carrying a processing time requirement. Upon arrival, a job is allocated to a processor and starts being executed, unless it is blocked by another one already in the system. Indeed, any job can be randomly blocked by any preceding one, in the sense that it cannot start being processed before the one that blocks it leaves. After execution, the job leaves the system. The arrival times, the processing times and the blocking structures of the jobs form a stationary and ergodic sequence.The random precedence constraints capture the essential operational characteristic of parallel processing and allow a unified treatment of concurrent processing systems from such diverse areas as parallel computation, database concurrency control, queuing networks, flexible manufacturing systems. The above basic model includes the G/G/1 and G/G/∞ queuing systems as special extreme cases.Although there is an infinite number of processors, the precedence constraints induce a queuing phenomenon, which, depending on the loading conditions, can lead to stability or instability of the system.In this paper, the condition for stability of the system is first precisely specified. The asymptotic behavior, at large times, of the quantities associated with the performance of the system is then studied, and the degree of parallelism, expressed as the asymptotic average number of processors that work concurrently, is computed. Finally, various design and simulation aspects concerning parallel processing systems are considered, and the case of finite number of processors is discussed.The results proved for the basic model are then extended to cover more complex and realistic parallel processing systems, where each job has a random internal structure of subtasks to be executed according to some internal precedence constriants.

19 citations


Journal Article
TL;DR: The basic volume manipulation, object segmentation, and graphics operations required of a 3D medical imaging machine are described and sample algorithms are presented and general trends for future developments in this field are delineated.
Abstract: This survey reviews three-dimensional (3D) medical imaging machines and 3D medical imaging operations. The survey is designed to provide a snapshot overview of the present state of computer architectures for 3D medical imaging. The basic volume manipulation, object segmentation, and graphics operations required of a 3D medical imaging machine are described and sample algorithms are presented. The architecture and 3D imaging algorithms employed in 11 machines which render medical images are assessed. The performance of the machines is compared across several dimensions, including image resolution, elapsed time to form an image, imaging algorithms employed in the machine, and the degree of parallelism employed in the architecture. The innovation in each machine, whether architectural or algorithmic, is described in detail. General trends for future developments in this field are delineated and an extensive bibliography is provided.

Journal ArticleDOI
TL;DR: Algorithms for recursive implementation of the eigendecomposition (ED) of the autocorrelation matrix and SVD of the data matrix and the ED/SVD trade-off is discussed.

Journal ArticleDOI
01 Sep 1991
TL;DR: A reconfigurable dual-network SIMD machine with internal direct feedback that best matches the characteristics of efficient parallel algorithms for computing the kinematics, dynamics, and Jacobian of manipulators and their corresponding inverses is designed.
Abstract: Efficient parallel algorithms for computing the kinematics, dynamics, and Jacobian of manipulators and their corresponding inverses are discussed and analyzed, and their characteristics are identified based on the type and degree of parallelism, uniformity of the operations, fundamental operations, data dependencies, and communication requirements. Most of the algorithms for robotic computations possess highly regular properties and some common structures, especially the linear recursive structure. They are well-suited to be implemented on a single-instruction-stream-multiple-data-stream (SIMD) computer with reconfigurable interconnection networks. A reconfigurable dual-network SIMD machine with internal direct feedback that best matches these characteristics has been designed. To achieve high efficiency in the computation of robotics algorithms on the proposed parallel machine, a generalized cube interconnection network is proposed. A centralized network switch control scheme is developed to support the pipeline timing of this machine. To maintain high reliability in the overall system, a fault-tolerant generalized cube network is designed to improve the original network. >

Proceedings ArticleDOI
T. Yamauchi1, T. Nakata1, N. Koike1, A. Ishizuka1, N. Nishiguchi1 
11 Nov 1991
TL;DR: The authors describe a novel parallel detailed router named PROTON (parallel router on a parallel machine) with various new features, which include a parallelized line search algorithm based on parallel breadth first search and extraction of a higher degree of parallelism by simultaneous routing of multiple nets using the result of the global router.
Abstract: The authors describe a novel parallel detailed router named PROTON (parallel router on a parallel machine) with various new features. These features include: a parallelized line search algorithm based on parallel breadth first search; extraction of a higher degree of parallelism by simultaneous routing of multiple nets using the result of the global router; a parallel router on a quasi-shared-memory based MIMD parallel machine; and a detailed router supporting multilayer channelless gate arrays with complex industrial design rules. PROTON is implemented on an MIMD parallel machine named Cenju, which consists of 64 microprocessors. In order to improve routing speed, PROTON incorporates two levels of parallelism, namely magnet parallelism and net level parallelism. A speedup of 43 times has been achieved using 64 processors for a medium-scale channelless gate array (1537*1790 grids, 12591 pin pairs). >

Journal ArticleDOI
TL;DR: A new parallel computing solution of problems of coupled flows in soils is presented and an attempt is made to exploit a high degree of parallelism based on the use of a finite difference solution method.
Abstract: A new parallel computing solution of problems of coupled flows in soils is presented. Such phenomena occur when more than one transport process exists below ground as is the case, for example, for coupled chemical, electrical, heat, or moisture flow. A numerical solution of the governing differential equations is achieved via a new parallel computing algorithm, programmed using new parallel software and operated on new parallel hardware. An attempt is made to exploit a high degree of parallelism based on the use of a finite difference solution method. The specific problem considered in detail in the paper, as an example of one particular case that may be addressed, is that of coupled heat and moisture transfer in unsaturated soil. The validity of the results achieved are first discussed in relation to physically observed behavior. Qualitatively and quantitatively correct results are shown to have been obtained. The performance of the approach as an efficient parallel computing solution is then assessed. T...

Journal ArticleDOI
01 Dec 1991
TL;DR: The benchmark generation environment PAR-Bench is described, which enables measurements of effects introduced by parallel programs running in a multiprogramming mode, and provides information about mutual influences of parallel programs and background load, enabling to evaluate different multitasking implementations, different operating systems, and different computer hardware.
Abstract: While the efficiency of multitasking is proven for parallel programs running in a dedicated environment, this paper wants to show a new approach for the assessment of multitasking. It describes the benchmark generation environment PAR-Bench, which enables measurements of effects introduced by parallel programs running in a multiprogramming mode. The PAR-Bench system is implemented on Cray multiprocessor systems under the operating systems COS and UNICOS. Using PAR-Bench, the benchmark process is divided into two parts: In a first step, according to user-supplied parameters like MFLOPS rate, memory and I/O activities, CPU time etc., the PAR-Bench system generates synthetic benchmark programs by using the hardware performance monitor HPM. These programs can be used to simulate a given site's workload in a flexible way. In a second step, the system can be used to run this workload several times with varied parameters; the substantial work to be done is fixed, nevertheless dynamical changes of program parameters like memory size, priority, and also variations of the degree of parallelism are supported. The PAR-Bench system provides information about mutual influences of parallel programs and background load, enabling us to evaluate different multitasking implementations, different operating systems, and different computer hardware. Because of the abundance of data concerning characteristic program parameters and program timings, as a further component of PAR-Bench the graphical analyzing system GRANSYS was developed, realizing automatic analysis features for benchmark data including data interpretation and visualization.

Journal ArticleDOI
TL;DR: A parallel-execution model that can concurrently exploit AND and OR parallelism in logic programs is presented, employing a combination of techniques in an approach to executing logic problems in parallel, making tradeoffs among number of processes, degree of parallelism, and combination bandwidth.
Abstract: A parallel-execution model that can concurrently exploit AND and OR parallelism in logic programs is presented. This model employs a combination of techniques in an approach to executing logic problems in parallel, making tradeoffs among number of processes, degree of parallelism, and combination bandwidth. For interpreting a nondeterministic logic program, this model (1) performs frame inheritance for newly created goals, (2) creates data-dependency graphs (DDGs) that represent relationships among the goals, and (3) constructs appropriate process structures based on the DDGs. (1) The use of frame inheritance serves to increase modularity. In contrast to most previous parallel models that have a large single process structure, frame inheritance facilitates the dynamic construction of multiple independent process structures, and thus permits further manipulation of each process structure. (2) The dynamic determination of data dependency serves to reduce computational complexity. In comparison to models that exploit brute-force parallelism and models that have fixed execution sequences, this model can reduce the number of unification and/or merging steps substantially. In comparison to models that exploit only AND parallelism, this model can selectively exploit demand-driven computation, according to the binding of the query and optional annotations. (3) The construction of appropriate process structures serves to reduce communication complexity. Unlike other methods that map DDGs directly onto process structures, this model can significantly reduce the number of data sent to a process and/or the number of communication channels connected to a process. >

Proceedings ArticleDOI
30 Apr 1991
TL;DR: It is shown that the problem of time allocation in such a real-time application can be formulated and solved as a linear programming problem and an algorithm is given for constructing a multiprocessor schedule from the linear programming solution.
Abstract: In a real-time application that supports imprecise computation, each task is logically composed of a hard task and a soft task. The hard task must be completed before its deadline. The soft task is an optional task which may not be executed to completion, if insufficient computational resources are available. In the presented model, each task may be parallelized and executed on multiple processors with a multiprocessing overhead which is assumed to be a linear function of the degree of parallelism. It is shown that the problem of time allocation in such a real-time application can be formulated and solved as a linear programming problem. An algorithm is given for constructing a multiprocessor schedule from the linear programming solution. This algorithm guarantees the multiprocessing overhead generated in the multiprocessor schedule not to exceed a linear upper bound. >

Proceedings ArticleDOI
11 Jun 1991
TL;DR: It is shown that there is a close relationship between the mappings of sequentially formulated algorithms onto different kinds of parallel architectures and a set of parameterized tools is defined that allows the uniformization of the mapping of sequential code onto parameterized parallel architectures.
Abstract: It is shown that there is a close relationship between the mappings of sequentially formulated algorithms onto different kinds of parallel architectures. The compilation of programs for these architectures shares common optimization features, such as a high degree of parallelism, a short execution time, and a high processor utilization, and also a common design trajectory. Given a problem formulation, parallelism is extracted. Equivalence transformations are applied for the purpose of optimization. In the mapping phase, resources are assigned to operations and a schedule is defined. In order to match a problem and an architecture of given size, hierarchical transformations are performed that partition a problem into problems of smaller size that are executed sequentially. Due to these similarities, design methods known for the design and optimization of processor arrays can be used to solve problems known from the design of vectorizing compilers for supercomputers and vice versa. For defining the tasks of a versatile compiler for massive parallel architectures (COMPAR) a set of parameterized tools is defined that allows the uniformization of the mapping of sequential code onto parameterized parallel architectures. >

Journal ArticleDOI
11 Nov 1991
TL;DR: The authors show that, given an arbitrary composite relation, a fixed order composite relation can be constructed under certain assumptions), where the primitive relations occur at most once in a predetermined order, such that the original relation holds between two machines if and only if the fixed order relation holds.
Abstract: Uses string function theory to develop an efficient methodology for the verification of logic implementations against behavioral specifications. First, the authors define five primitive relations between string functions, other than strict automata equivalence, namely: don't care times, parallelism, encoding, input don't care and output don't care relations. These relations have attributes, For instance, the parallelism relation has an attribute corresponding to the degree of parallelism. For each of these primitive relations, the authors derive transformations on the specification and the implementation such that the relation holds between the specification and implementation if and only if the transformed circuits exhibit the same input/output behavior. This reduces the problem of verifying primitive relations to automata equivalence checking. They enlarge the set of relations between specifications and implementations by including arbitrary compositions of the five primitive relations. To reduce the cost of verifying such a composite relation, the authors show that, given an arbitrary composite relation, a fixed order composite relation can be constructed under certain assumptions), where the primitive relations occur at most once in a predetermined order, such that the original relation holds between two machines if and only if the fixed order relation holds. For the fixed order composite relation, they derive again transformations on the specification and the implementation which reduce verifying the composite relation to performing one equivalence check. The end result is a sound and complete proof method for proving arbitrary compositions of relations by transforming the specification and the implementation and performing an equivalence check on the transformed finite state machines. >

Journal ArticleDOI
TL;DR: A parallel algorithm for the ADI preconditioning is proposed, in which several tridiagonal systems that are traditionally solved sequentially are now solved concurrently and can be implemented in a multiprocessor architecture.

Proceedings ArticleDOI
01 Dec 1991
TL;DR: An elimination approach for processing OODBs is presented that allows more processors to operate concurrently on a query, thus allowing a higher degree of parallelism in query processing.
Abstract: The authors have show previously (1989, 1991) that processing OODBs can be viewed as the manipulation of patterns of object associations. Parallel, multiple wavefront algorithms based on an identification approach for verifying association patterns have been introduced. The current paper presents an elimination approach for processing OODBs. The new approach allows more processors to operate concurrently on a query, thus allowing a higher degree of parallelism in query processing. A formal proof of the correctness of the new approach is given, and a parallel elimination algorithm for processing tree queries is presented. Some simulation results are also provided to compare the performance of the identification approach with the elimination approach. >

Book ChapterDOI
01 Aug 1991
TL;DR: The characterization of the six basic robotics algorithms is tabulated for discussion and the results can be used to design better parallel architectures or a common architecture for the computation of these robotics algorithms.
Abstract: The kinematics, dynamics, Jacobian, and their corresponding inverses are six major computational tasks in the real-time control of robot manipulators. The parallel algorithms for these computations are examined and analyzed. They are characterized based on six well-defined features that have greatest effects on the execution of parallel algorithms. These features include type of parallelism, degree of parallelism (granularity), uniformity of operations, fundamental operations, data dependency, and communication requirement. It is found that the inverse dynamics, the forward dynamics, the forward kinematics and the forward Jacobian computations possess highly regular properties and they are all in homogeneous linear recursive form. The inverse Jacobian is essentially the problem of solving a system of linear equations. The closed-form solution of the inverse kinematics problem is obviously non-uniform and robot dependent. The iterative solution for the inverse kinematics problem seems uniform and the parallel portions of the algorithm involve the forward kinematics, the forward Jacobian, and the inverse Jacobian computations. Suitable algorithms for the six basic robotics computations are selected and parallelized to make use of their common features. The characterization of the six basic robotics algorithms is tabulated for discussion and the results can be used to design better parallel architectures or a common architecture for the computation of these robotics algorithms.

01 Jan 1991
TL;DR: Hector is a shared-memory multiprocessor with a hierarchical interconnection structure that has three important advantages: it uses only short transmission lines and allows implementations with simple, modular hardware, and the cost of a memory access grows incrementally with the distance between the processor and memory location.
Abstract: Hector is a shared-memory multiprocessor with a hierarchical interconnection structure that has three important advantages. First, it uses only short transmission lines and allows implementations with simple, modular hardware. This makes it scalable to match the needs of tomorrow’s high speed microprocessors. Second, the cost and the overall bandwidth of the structure grow linearly with the number of processing modules. This makes Hector expandable to moderate sizes of up to 256 PMs, yet allows small-scale systems at a low cost. Finally, the cost of a memory access grows incrementally with the distance between the processor and memory location. This allows single threaded applications, applications with a small degree of parallelism, and applications with a high degree of locality in their memory accesses to exploit the low cost of localized memory accesses.

Journal ArticleDOI
TL;DR: A stabilized parallel algorithm for direct-form recursive filters is obtained, using a method of derivation in the Z domain, and it is shown how to reduce the number of multiplications compared to the number required in a naive implementation.
Abstract: A stabilized parallel algorithm for direct-form recursive filters is obtained, using a method of derivation in the Z domain. The degree of parallelism, stability, and complexity of the algorithm is examined. It is shown how to reduce the number of multiplications compared to the number required in a naive implementation. The algorithm is regular and modular, so very efficient VLSI architectures can be constructed to implement it. The degree of parallelism in these implementations can be chosen freely and is not restricted to be a power of two. >

01 Jan 1991
TL;DR: This paper discusses the use of software pipelining and loop unrolling for writing optimized assembler inner loops for matrix inner and outer products, which were able to operate at more than 90% and 70%, respectively, of the AP1000’s theoretical peak performance.
Abstract: The Basic Linear Algebra Subprogram (BLAS) library is widely used in many supercomputing applications, and is used to implement more extensive linear algebra subroutine libraries, such as LINPACK and LAPACK. To take advantage of the high degree of parallelism of architectures such as the Fujitsu AP1000, BLAS level 3 routines (matrix-matrix operations) are proposed. This project is concerned with implementing BLAS level 3 (BLAS-3) for single precision matrices on the AP1000, with emphasis on obtaining the highest possible performance, without significantly sacrificing numerical stability. This paper discusses the techniques used to achieve this goal, together with the underlying issues. The most important techniques were the use of software pipelining and loop unrolling for writing optimized assembler inner loops for matrix inner and outer products, which were able to operate at more than 90% and 70%, respectively, of the AP1000’s theoretical peak performance. The eciency of cell communication using wormhole routing on the AP1000, especially the row/column broadcast, enabled a sustained performance of 80 to 90% of the theoretical peak for all the BLAS-3 routines. It also meant that many variations (using dierent communication schemes) for matrix multiplication have more or less equivalent performance. However, for future versions of the AP1000, optimizing communication must still be considered. Techniques for improving the performance for large matrices (partitioning, to improve cache utilization) and for small matrices (minimizing communication) are employed. The latter have been developed for general rectangular AP1000 configurations.

Journal ArticleDOI
TL;DR: This thesis proves that transforming a program using the Magic Templates algorithm and then evaluating the fixpoint bottom-up provides a “most parallel” implementation for a given choice of sips, without taking resource constraints into account, and provides a formal measure of parallelism.
Abstract: There is a tension between the objectives of avoiding irrelevant computation and extracting parallelism, in that a computational step used to restrict another must precede the latter. Our thesis, following [3], is that evaluation methods can be viewed as implementing a choice ofsideways information propagation graphs, or sips, which determines the set of goals and facts that must be evaluated. Two evaluation methods that implement the same sips can then be compared to see which obtains a greater degree of parallelism, and we provide a formal measure of parallelism to make this comparison. Using this measure, we prove that transforming a program using the Magic Templates algorithm and then evaluating the fixpoint bottom-up provides a “most parallel” implementation for a given choice of sips, without taking resource constraints into account. This result, taken in conjunction with earlier results from [3,27], which show that bottom-up evaluation performs no irrelevant computation and is sound and complete, suggests that a bottom-up approach to parallel evaluation of logic programs is very promising. A more careful analysis of the relative overheads in the top-down and bottom-up evaluation paradigms is needed, however, and we discuss some of the issues. The abstract model allows us to establish several results comparing other proposed parallel evaluation methods in the logic programming and deductive database literature, thereby showing some natural, and sometimes surprising, connections. We consider the limitations of the abstract model and of the proposed bottom-up evaluation method, including the inability of sips to describe certain evaluation methods, and the effect of resource constraints. Our results shed light on the limits of the sip paradigm of computation, which we extend in the process.

Patent
22 Feb 1991
TL;DR: In this article, a program to be executed is divided into plural processes, so that the number of ranks is uniformed and the average degree of parallelism is found out in each divided process.
Abstract: PURPOSE:To uniformize the load of respective processors by allocating processors to be executed correspondingly to the degree of parallelism of respective processes in each process and inputting generation data to drive respective processors CONSTITUTION:A program to be executed is divided into plural processes, so that the number of ranks is uniformed and the average degree of parallelism is found out in each divided process and the number (m) of processors to be allocated to each process is found out from the average degree of parallelism of each process and the number of pipelines of the processors In order to successively allocate the generation data to m processors allocated to execute the process out of plural processors constituting a parallel processor, the output of the preceding process is rewritten so as to be dispersedly outputted to any one of the 1st to m-th processors allocated to the process When the final process is confirmed, the allocation of the processors is ended and executed in each generation Thus, the load applied to respective processors can be uniformized

22 Apr 1991
TL;DR: The authors reveal a method to exploit a second degree of parallelism which aims to further increase the efficiency of the network and a balance in the computation to communication on multiprocessors can be achieved.
Abstract: Image compression algorithms based on block transform coding have been adopted by the CCITT for coding of visual telephony. Recent developments in block transform coding show that by incorporating adaptive blocksizes, the efficiency in terms of bit-rate and subjective image quality of such coding methods are improved. The characteristics of this Scene Adaptive Transform Coding algorithm are introduced. The computational intensity and highly parallel nature of such algorithms motivate the use of a multiple processor network to execute the algorithms in close to real-time. The authors reveal a method to exploit a second degree of parallelism which aims to further increase the efficiency of the network. This second degree of concurrency is achieved by parallelising the functional algorithms. Pipelining methods are used to exploit the functional concurrency within the algorithms. By maintaining a suitable granularity of data partitioning and parallelising the functional algorithms, a balance in the computation to communication on multiprocessors can be achieved. The performance of the proposed Pipeline-Tree Architecture (PTA) is compared with the commonly used Tree structure. >

Proceedings ArticleDOI
01 Sep 1991
TL;DR: A graphics processor architecture with a high degree of parallelism connected to a distributed frame buffer that can be configured with an arbitrary number of identical, high level programmable processors operating in parallel is described.
Abstract: Interactive 3D graphics applications require significant arithmetic processing to meet the ever-inreasing desire for higher image complexity and higher resolution in displayed images. This paper describes a graphics processor architecture with a high degree of parallelism connected to a distributed frame buffer. The architecture can be configured with an arbitrary number of identical, high level programmable processors operating in parallel. Within the architecture an automatic load balancing mechanism is presented which distributes the processing load between geometry and rendering section. After the unique features of the architecture are described the load balancing mechanism is analyzed and the increase of performance is demonstrated.