scispace - formally typeset
Search or ask a question

Showing papers on "Degree of parallelism published in 1995"


Dissertation
01 Jan 1995
TL;DR: This work reinterpreted multiobjective optimization with genetic algorithms as a sequence of decision making problems interleaved with search steps, in order to accommodate previous work in the field and develops a unified approach to multiple objective and constraint handling with genetic algorithm.
Abstract: Genetic algorithms (GAs) are stochastic search techniques inspired by the principles of natural selection and natural genetics which have revealed a number of characteristics particularly useful for applications in optimization, engineering, and computer science, among other fields. In control engineering, they have found application mainly in problems involving functions difficult to characterize mathematically or known to present difficulties to more conventional numerical optimizers, as well as problems involving non-numeric and mixed-type variables. In addition, they exhibit a large degree of parallelism, making it possible to effectively exploit the computing power made available through parallel processing. Despite their early recognized potential for multiobjective optimization (almost all engineering problems involve multiple, often conflicting objectives), genetic algorithms have, for the most part, been applied to aggregations of the objectives in a single-objective fashion, like conventional optimizers. Although alternative approaches based on the notion of Pareto-dominance have been suggested, multiobjective optimization with genetic algorithms has received comparatively little attention in the literature. In this work, multiobjective optimization with genetic algorithms is reinterpreted as a sequence of decision making problems interleaved with search steps, in order to accommodate previous work in the field. A unified approach to multiple objective and constraint handling with genetic algorithms is then developed from a decision making perspective and characterized, with application to control system design in mind. Related genetic algorithm issues, such as the ability to maintain diverse solutions along the trade-off surface and responsiveness to on-line changes in decision policy, are also considered. The application of the multiobjective GA to three realistic problems in optimal controller design and non-linear system identification demonstrates the ability of the approach to concurrently produce many good compromise solutions in a single run, while making use of any preference information interactively supplied by a human decision maker. The generality of the approach is made clear by the very different nature of the two classes of problems considered.

220 citations


Journal ArticleDOI
TL;DR: A new variant of the scheduling problem is attempted by investigating the scalability of the schedule length with the required number of processors, by performing scheduling partially at compile time and partially at run time using a new concept of the threshold of a task.
Abstract: We attempt a new variant of the scheduling problem by investigating the scalability of the schedule length with the required number of processors, by performing scheduling partially at compile time and partially at run time. Assuming infinite number of processors, the compile time schedule is found using a new concept of the threshold of a task that quantifies a trade-off between the schedule-length and the degree of parallelism. The schedule is found to minimize either the schedule length or the number of required processors and it satisfies: A feasibility condition which guarantees that the schedule delay of a task from its earliest start time is below the threshold, and an optimality condition which uses a merit function to decide the best task-processor match for a set of tasks competing for a given processor. At run time, the tasks are merged producing a schedule for a smaller number of available processors. This allows the program to be scaled down to the processors actually available at run time. Usefulness of this scheduling heuristic has been demonstrated by incorporating the scheduler in the compiler backend for targeting Sisal (Streams and Iterations in a Single Assignment Language) on iPSC/860. >

57 citations


Patent
W. Patrick Hays1
21 Apr 1995
TL;DR: In this paper, a single-instruction, multiple-data (SIMD) architecture is adopted to exploit the high degree of parallelism inherent in many video signal processing algorithms.
Abstract: Single-instruction multiple-data is a new class of integrated video signal processors especially suited for real-time processing of two-dimensional images. The single-instruction, multiple-data architecture is adopted to exploit the high degree of parallelism inherent in many video signal processing algorithms. Features have been added to the architecture which support conditional execution and sequencing--an inherent limitation of traditional single-instruction multiple-data machines. A separate transfer engine offloads transaction processing from the execution core, allowing balancing of input/output and compute resources--a critical factor in optimizing performance for video processing. These features, coupled with a scalable architecture allow a united programming model and application driven performance.

32 citations


Journal ArticleDOI
TL;DR: Why many MPPs use parallel I/O subsystems, what architecture is best for such a subsystem, and how to implement the subsystem are examined are examined.
Abstract: Applications on MPPs often require a high aggregate bandwidth of low-latency I/O to secondary storage. This requirement can met by internal parallel I/O subsystems that comprise dedicated I/O nodes, each with processor, memory, and disks.Massively parallel processors (MPPs), encompassing from tens to thousands of processors, are emerging as a major architecture for high-performance computers. Most major computer vendors offer computers with some degree of parallelism, and many smaller vendors specialize in producing MPPs. These machines are targeted for both grand-challenge problems and general-purpose computing.Like any computer, MPP architectural design must balance computation, memory bandwidth and capacity, communication capabilities, and I/O. In the past, most design research focused on the basic compute and communications hardware and software. This led to unbalanced computers that had relatively poor I/O performance. Recently, researchers have focused on designing hardware and software for I/O subsystems in MPPs. Consequently, most current MPPs have an architecture based on an internal parallel I/O subsystem (the "Architectures with parallel I/O" sidebar describes some examples). In these computers, this subsystem encompasses a collection of I/O nodes, each managing and providing I/O access to a set of disks. The I/O nodes connect to other nodes in the system by the same switching network that connects the compute nodes.In this article we'll examine why many MPPs use parallel I/O subsystems, what architecture is best for such a subsystem, and how to implement the subsystem. We'll also discuss how parallel file systems and their user interfaces can exploit the parallel I/O to provide enhanced services to applications.The systems discussed in this article are mostly tightly coupled distributed-memory MIMD (multiple-instruction, multiple-data) MPPs. In some cases, we also discuss shared-memory and SIMD (single-instruction, multiple-data) machines. We'll discuss three node types. Compute nodes are optimized to perform floating-point and numeric calculations, and have no local disk except perhaps for paging, booting, and operating-system software. I/O nodes contain the system's secondary storage, and provide the parallel file-system services. Gateway nodes provide connectivity to external data servers and mass-storage systems. In some cases, individual nodes can serve as more than one type. For example, the same nodes often handle I/O and gateway functions. The "Terminology" sidebar defines some other terms used in this article.

28 citations


Journal ArticleDOI
TL;DR: An architectural framework for parallel time-recursive computation is established and it is shown that the structure of the realization of a given linear operator is dictated by the decomposition of the latter with respect to proper basis functions.
Abstract: The time-recursive computation has been proven a particularly useful tool in real-time data compression, in transform domain adaptive filtering, and in spectrum analysis. Unlike the FFT-based ones, the time-recursive architectures require only local communication. Also, they are modular and regular, thus they are very appropriate for VLSI implementation and they allow a high degree of parallelism. In this two-part paper, we establish an architectural framework for parallel time-recursive computation. We consider a class of linear operators that consists of the discrete time, time invariant, compactly supported, but otherwise arbitrary kernel functions. We show that the structure of the realization of a given linear operator is dictated by the decomposition of the latter with respect to proper basis functions. An optimal way for carrying out this decomposition is demonstrated. The parametric forms of the basis functions are identified and their properties pertinent to the architecture design are studied. A library of architectural building modules capable of realizing these functions is developed. An analysis of the implementation complexity for the aforementioned modules is conducted. Based on this framework, the time-recursive architecture of a given linear operator can be derived in a systematic routine way.

20 citations


Patent
Michael Keith1, Eiichi Kowashi1
18 Sep 1995
TL;DR: In this article, a single-instruction, multiple-data (SIMD) architecture is adopted to exploit the high degree of parallelism inherent in many video signal processing algorithms.
Abstract: Single-instruction multiple-data is a new class of integrated video signal processors especially suited for real-time processing of two-dimensional images. The single-instruction, multiple-data architecture is adopted to exploit the high degree of parallelism inherent in many video signal processing algorithms. Features have been added to the architecture which support conditional execution and sequencing--an inherent limitation of traditional single-instruction multiple-data machines. A separate transfer engine offloads transaction processing from the execution core, allowing balancing of input/output and compute resources--a critical factor in optimizing performance for video processing. These features, coupled with a scalable architecture allow a united programming model and application driven performance.

16 citations


Journal ArticleDOI
Chandrasekhar Narayanaswami1
TL;DR: A parallel raster algorithm to draw Gouraud shaded triangles is presented, which represents an increased level of parallelism compared to the existing solutions and ensures that adjacent triangles that share an edge do not share any pixels.
Abstract: A parallel raster algorithm to draw Gouraud shaded triangles is presented. At the heart of the algorithm is a new constrained parallel edge-traversal technique. This parallel traversal represents an increased level of parallelism compared to the existing solutions. Next, traditional algorithms take different amounts of time to advance from one horizontal span to another for the left edge and the right edge of the triangle when the slope of one of the edges is more than one and that of the other edge is less than one. This causes one processor to wait for another processor. The parallel constrained edge traversal technique removes this problem by directly jumping from one span to the next. It also ensures that adjacent triangles that share an edge do not share any pixels. Moreover, no cracks occur between adjacent polygons. Unlike some existing algorithms whose complexity depends on the size of the bounding box of the triangle, the complexity of our algorithm is solely dependenton the perimeter and area of the triangle. Due to the above features, the algorithm presented here exposes a greater degree of parallelism at considerably lesser cost and achieves better processor utilization, compared to existing algorithms for this problem1, 2, 3, 4, 5, 6. The algorithm is well suited for hardware implementation.

14 citations


Journal ArticleDOI
TL;DR: This paper presents several parallel, multiwavefront algorithms based on two approaches, i.e., identification and elimination approaches, to verify association patterns specified in queries, thus introducing a higher degree of parallelism in query processing.

11 citations


Journal ArticleDOI
TL;DR: A highly scalable parallel algorithm based on a spatial decomposition of the general unstructured mesh is presented and two spatial decompositions are compared, the recursive inertia partitioning (RIP) algorithm and the Greedy algorithm.
Abstract: The planar generalized Yee (PGY) algorithm is an extension of the generalized Yee algorithm and the discrete surface integral (DSI) methods, which are based on explicit time-marching solutions of Maxwell's equations. Specifically, the PGY algorithm exploits the planar symmetries of printed microwave circuit devices, achieving great savings in both CPU time and memory. Since the PGY algorithm is an explicit method, it has a high degree of parallelism. To this end, a highly scalable parallel algorithm based on a spatial decomposition of the general unstructured mesh is presented. Two spatial decompositions are compared, the recursive inertia partitioning (RIP) algorithm and the Greedy algorithm. The Greedy algorithm provides optimal load balance, whereas the RIP algorithm more effectively minimizes shared boundary interface lengths. Through numerical examples, It is demostrated that the Greedy algorithm provides superior speedups. It is also demonstrated that the parallel PGY algorithm is a highly scalable algorithm.

10 citations


Journal ArticleDOI
TL;DR: A performance is achieved with a programmable parallel processor architecture that hitherto required the application of a dedicated integrated circuit, leading to a high performance for a wide field of applications.

9 citations


Journal Article
TL;DR: This paper discusses some new viewpoints for the construction of effective preconditioners, including re-ordering, series expansion and domain decomposition techniques, and parallelization aspects, includingRe-ordering and approximations by truncating certain series expansion will increase the parallelism, but usually with a deterioration in convergence rate.

Journal Article
TL;DR: This paper presents a parallel simulated annealing algorithm for solving the problem of mapping irregular parallel programs onto homogeneous processor arrays with regular topology.
Abstract: This paper presents a parallel simulated annealing algorithm for solving the problem of mapping irregular parallel programs onto homogeneous processor arrays with regular topology. The algorithm constructs and uses joint transformations. These transformations guarantee a high degree of parallelism that is bounded below by |Np| / deg(Gp)+1, where |Np| is the number of task nodes in the mapped program graph Gp and deg(Gp) is the maximal degree of a node in Gp. The mapping algorithm provides good program mappings (in terms of program execution time and the number of processors used) in a reasonable number of steps.

Journal ArticleDOI
TL;DR: A finite element model is developed and used to simulate three-dimensional compressible fluid flow on a massively parallel computer and a high degree of parallelism has been achieved utilizing a MasPar MP-2 SIMD computer.
Abstract: A finite element model is developed and used to simulate three-dimensional compressible fluid flow on a massively parallel computer. The algorithm is based on a Petrov-Galerkin weighting of the convective terms in the governing equations. The discretized time-dependent equations are solved explicitly using a second-order Runge-Kutta scheme. A high degree of parallelism has been achieved utilizing a MasPar MP-2 SIMD computer. An automated conversion program is used to translate the original Fortran 77 code into the Fortran 90 needed for parallelization. This conversion program and the use of compiler directives allows the maintenance of one version of the code for use on either vector or parallel machines. 17 refs.

Journal ArticleDOI
TL;DR: In this paper, a method for processing cross-wire data at high speed in real-time is described, where the measurements are linearly dependent on the effective cooling velocity of each wire, expressed uniquely in terms of a single variable such as output voltage.

Journal ArticleDOI
TL;DR: Experiments indicate that the numerical stability of the algorithm derived to solve tridiagonal linear systems with a high degree of parallelism is similar to Gaussian elimination with partial pivoting.
Abstract: Based on the parallel minimal norm method an algorithm is derived to solve tridiagonal linear systems with a high degree of parallelism. No conditions need to be posed with respect to the system. Experiments indicate that the numerical stability of the algorithm is similar to Gaussian elimination with partial pivoting. >

Book ChapterDOI
01 Jan 1995
TL;DR: A parallel processing system is designed for plasma position and shape control in the JT-60 upgrade tokamak with optimum number of DSPs estimated according to the degree of parallelism in the control processing and its allowable delay.
Abstract: A parallel processing system is designed for plasma position and shape control in the JT-60 upgrade tokamak. Fast digital signal processors (DSPs) which have inter-processor communication ports are applied to the system. The optimum number of the DSPs is estimated according to the degree of parallelism in the control processing and its allowable delay.

Proceedings ArticleDOI
01 Jul 1995
TL;DR: The model is used to describe closed stochastic queueing network simulations and analysis of their execution results suggests that the model makes available a promising degree of parallelism.
Abstract: This paper presents an approach for speculative parallel execution of rendezvous-synchronized simulations. Rendezvous-synchronized simulation is based on the notions of processes and gates and on the rendezvous mechanism defined in the basic process algebra of LOTOS—a standard formal specification language for temporal ordering[2]. Time is introduced via a mechanism similar to the delay behaviour annotation provided by the TOPO toolset[4-6].The algorithm allows speculative gate activations. This increases the available parallelism while ensuring correct execution of the computation.The model is used to describe closed stochastic queueing network simulations. Analysis of their execution results suggests that the model makes available a promising degree of parallelism.

Journal Article
TL;DR: The choice of the optimal number of parallel branches in accordance with the criterion of the minimum of total time costs for the automatic generation of a parallel program and its running for various subclasses of computer systems with MIMD architecture is considered.
Abstract: The choice of the optimal number of parallel branches in accordance with the criterion of the minimum of total time costs for the automatic generation of a parallel program and its running for various subclasses of computer systems with MIMD architecture is considered

01 Jan 1995
TL;DR: An architectural framework for parallel time-recursive computation is established and it is shown that the structure of the realization of a given linear operator is dictated by the decomposition of the latter with respect to proper basis functions.
Abstract: The time-recursive computation has been proven a particularly useful tool in real-time data compression, in trans- form domain adaptive filtering, and in spectrum analysis. Unlike the FFT-based ones, the time-recursive architectures require only local communication. Also, they are modular and regular, thus they are very appropriate for VLSI implementation and they allow a high degree of parallelism. In this two-part paper, we establish an architectural framework for parallel time-recursive computation. We clonsider a class of linear operators that consists of the discrete time, time invariant, compactly supported, but otherwise arbitrary kernel functions. We show that the structure of the realization of a given linear operator is dictated by the decomposition of the latter with respect to proper basis functions. An optimal way for carrying out this decomposition is demonstrated. The parametric forms of the basis functions are identified and their properties pertinent to the architecture design are studied. A library of architectural building modules capable of realizing these functions is developed. An analysis of the implementatioin complexity for the aforementioned modules is conducted. Based on this framework, the time-recursive archi- tecture of a given hear operator can be derived in a systematic routine way.

01 Jan 1995
TL;DR: In this paper, a finite element model is developed and used to simulate three-dimensional compressible fluid flow on a massively parallel computer, which is based on a Petrov-Galerkin weighting of the convective terms in the governing equations.
Abstract: A finite element model is developed and used to simulate three-dimensional compressible fluid flow on a massively parallel computer. The algorithm is based on a Petrov-Galerkin weighting of the convective terms in the governing equations. The discretized time-dependent equations are solved explicitly using a second-order Runge-Kutta scheme. A high degree of parallelism has been achieved utilizing a MasPar MP-2 SIMD computer. An automated conversion program is used to translate the original Fortran 77 code into the Fortran 90 needed for parallelization. This conversion program and the use of compiler directives allows the maintenance of one version of the code for use on either vector or parallel machines. cp cv E e / k

Proceedings ArticleDOI
19 Apr 1995
TL;DR: This paper presents a hierarchical parallel execution model for Prolog programs, the execution model is based on Or-parallelism/And-par parallelism as coarse-grain parallelism, and parallel unification as fine- grain parallelism.
Abstract: This paper presents a hierarchical parallel execution model for Prolog programs, the execution model is based on Or-parallelism/And-parallelism as coarse-grain parallelism, and parallel unification as fine-grain parallelism. At the coarse-grain parallelism level we propose an extended And-Or tree. Consequently, the tree can exploit high degree of parallelism from Prolog programs. Exploiting parallelism of Prolog programs is based an the binding-arrays method for Or-parallelism and the restricted And-parallelism (RAP) method for And-parallelism. At the fine-grain parallelism level, parallel unification is performed. In general, the parallel unification consists of parallel argument matching and consistency checking. However, since the RAP method does not need consistency checking, consistency checking at the fine-grain parallelism level is also removed. The measurements of the parallelism degree of this model are also to be presented in this paper. >

Book ChapterDOI
04 Sep 1995
TL;DR: New bounds on the performance of work-greedy assignment schemes are presented, taking into account the degree of parallelism visible between the tasks and the inter-task communication delays.
Abstract: Given an irregular dependency graph consisting of interdependent tasks, the problem of finding an optimal assignment on a number of parallel execution units is NP-complete. Assignment schemes thus settle for some heuristics that produce sub-optimal solutions. Most popular of these are the work-greedy assignment schemes. This paper presents new bounds on the performance of work-greedy schemes, taking into account the degree of parallelism visible between the tasks and the inter-task communication delays.

Proceedings ArticleDOI
20 Nov 1995
TL;DR: The architecture of a new configurable parallel neurocomputer optimized for the high-speed simulation of neural networks is presented, which supports several accuracies in all typical neural network operations and offers good scalability.
Abstract: The paper presents the architecture of a new configurable parallel neurocomputer optimized for the high-speed simulation of neural networks. Its main system feature is the reconfigurability of a new arithmetical unit chip which supports several accuracies in all typical neural network operations. If the required accuracy is decreased the degree of parallelism inside the chip can be increased by a dynamical reconfiguration of the hardware resources. The system also offers a good scalability: for the simulation of large neural networks the system performance can easily be increased by using several arithmetical unit chips operating in parallel.

01 Jan 1995
TL;DR: This paper presents a new approach which can significantly increase the parallelism by adding appropriate synchroniza- tions to loops which is demon- strated on the CRAY MPP which has very fast synchronization mechanisms.
Abstract: Loops are the primary source of parallelism in parallel processing. Two itera- tions in a loop are flow dependent if the results computed at one iteration are used by the other. Otherwise, they are independent. Independent iterations in loops can be scheduled in any orders or partitioned to any processors without explicit synchronizations. For dependent iterations, they are partitioned into sets (one or many) such that iterations in different sets are independent (e.g., the minimum distance method). These sets can be executed in parallel without explicit synchronizations. The degree of parallelism is the number of sets. This paper presents a new approach which can significantly increase the parallelism by adding appropriate synchroniza- tions. The implementation feasibility and performance benefits of this approach are demon- strated on the CRAY MPP which has very fast synchronization mechanisms.

Book ChapterDOI
29 Aug 1995
TL;DR: Different classes of modern methods in scientific computing and their parallel implementation will be discussed.
Abstract: Most processes in the real world are local and contain a high degree of parallelism. A simple example is weather prediction. The weather at any location depends on the weather at earlier times in the neighborhood. Causality is, however, another important physical principle preventing parallelism. The weather at one instance must be known before the weather at later times can be computed. For problems of the size of weather prediction the required recursiveness in the algorithm does not, in practice, prohibit a high degree of parallelism. The parallel computation at each time instance saturates the computer. The computational algorithms should capture the parallelism in these processes and map them efficiently onto the current architecture. Quite often the original physical process is approximated in such a way that the local dependence is lost. This happens, for example, when steady state is assumed. Furthermore, many modern computational methods are hierarchical and contain some global interconnection even if the underlying process is local. The overall efficiency depends on how well this connectivity is supported by the architecture. Different classes of modern methods in scientific computing and their parallel implementation will be discussed.