scispace - formally typeset
Search or ask a question

Showing papers on "Parallel algorithm published in 1993"


01 May 1993
TL;DR: Comparing the results to the fastest reported vectorized Cray Y-MP and C90 algorithm shows that the current generation of parallel machines is competitive with conventional vector supercomputers even for small problems.
Abstract: Three parallel algorithms for classical molecular dynamics are presented. The first assigns each processor a fixed subset of atoms; the second assigns each a fixed subset of inter-atomic forces to compute; the third assigns each a fixed spatial region. The algorithms are suitable for molecular dynamics models which can be difficult to parallelize efficiently—those with short-range forces where the neighbors of each atom change rapidly. They can be implemented on any distributed-memory parallel machine which allows for message-passing of data between independently executing processors. The algorithms are tested on a standard Lennard-Jones benchmark problem for system sizes ranging from 500 to 100,000,000 atoms on several parallel supercomputers--the nCUBE 2, Intel iPSC/860 and Paragon, and Cray T3D. Comparing the results to the fastest reported vectorized Cray Y-MP and C90 algorithm shows that the current generation of parallel machines is competitive with conventional vector supercomputers even for small problems. For large problems, the spatial algorithm achieves parallel efficiencies of 90% and a 1840-node Intel Paragon performs up to 165 faster than a single Cray C9O processor. Trade-offs between the three algorithms and guidelines for adapting them to more complex molecular dynamics simulations are also discussed.

29,323 citations


Proceedings ArticleDOI
01 Jul 1993
TL;DR: A new parallel machine model, called LogP, is offered that reflects the critical technology trends underlying parallel computers and is intended to serve as a basis for developing fast, portable parallel algorithms and to offer guidelines to machine designers.
Abstract: A vast body of theoretical research has focused either on overly simplistic models of parallel computation, notably the PRAM, or overly specific models that have few representatives in the real world. Both kinds of models encourage exploitation of formal loopholes, rather than rewarding development of techniques that yield performance across a range of current and future parallel machines. This paper offers a new parallel machine model, called LogP, that reflects the critical technology trends underlying parallel computers. it is intended to serve as a basis for developing fast, portable parallel algorithms and to offer guidelines to machine designers. Such a model must strike a balance between detail and simplicity in order to reveal important bottlenecks without making analysis of interesting problems intractable. The model is based on four parameters that specify abstractly the computing bandwidth, the communication bandwidth, the communication delay, and the efficiency of coupling communication and computation. Portable parallel algorithms typically adapt to the machine configuration, in terms of these parameters. The utility of the model is demonstrated through examples that are implemented on the CM-5.

1,515 citations


Book
01 Nov 1993
TL;DR: High Performance Fortran is a set of extensions to Fortran expressing parallel execution at a relatively high level that brings the convenience of sequential Fortran a step closer to today's complex parallel machines.
Abstract: From the Publisher: High Performance Fortran (HPF) is a set of extensions to Fortran expressing parallel execution at a relatively high level. For the thousands of scientists, engineers, and others who wish to take advantage of the power of both vector and parallel supercomputers, five of the principal authors of HPF have teamed up here to write a tutorial for the language. There is an increasing need for a common parallel Fortran that can serve as a programming interface with the new parallel machines that are appearing on the market. While HPF does not solve all the problems of parallel programming, it does provide a portable, high-level expression for data- parallel algorithms that brings the convenience of sequential Fortran a step closer to today's complex parallel machines.

677 citations


Journal ArticleDOI
01 Dec 1993
TL;DR: This paper shows how simple and parallel techniques can be combined to achieve this goal and deal with complex real world scenes and shows that the algorithm relies on correlation followed by interpolation and performs very well on difficult images such as faces and cluttered ground level scenes.
Abstract: To compute reliable dense depth maps, a stereo algorithm must preserve depth discontinuities and avoid gross errors. In this paper, we show how simple and parallel techniques can be combined to achieve this goal and deal with complex real world scenes. Our algorithm relies on correlation followed by interpolation. During the correlation phase the two images play a symmetric role and we use a validity criterion for the matches that eliminate gross errors: at places where the images cannot be correlated reliably, due to lack of texture of occlusions for example, the algorithm does not produce wrong matches but a very sparse disparity map as opposed to a dense one when the correlation is successful. To generate a dense depth map, the information is then propagated across the featureless areas, but not across discontinuities, by an interpolation scheme that takes image grey levels into account to preserve image features. We show that our algorithm performs very well on difficult images such as faces and cluttered ground level scenes. Because all the algorithms described here are parallel and very regular they could be implemented in hardware and lead to extremely fast stereo systems.

483 citations


Proceedings ArticleDOI
01 Dec 1993
TL;DR: The authors report on an efficient adaptive N-body method which computes the forces on an arbitrary distribution of bodies in a time which scales as N log N with the particle number.
Abstract: The authors report on an efficient adaptive N-body method which we have recently designed and implemented. The algorithm computes the forces on an arbitrary distribution of bodies in a time which scales as N log N with the particle number. The accuracy of the force calculations is analytically bounded, and can be adjusted via a user defined parameter between a few percent relative accuracy, down to machine arithmetic accuracy. Instead of using pointers to indicate the topology of the tree, the authors identify each possible cell with a key. The mapping of keys into memory locations is achieved via a hash table. This allows the program to access data in an efficient manner across multiple processors. Performance of the parallel program is measured on the 512 processor Intel Touchstone Delta system. Comments on a number of wide-ranging applications which can benefit from application of this type of algorithm are included.

457 citations


Journal ArticleDOI
TL;DR: This paper describes an insertion algorithm for the Vehicle Routing and Scheduling Problem with Time Windows that builds routes in parallel and uses a generalized regret measure over all unrouted customers to select the next candidate for insertion.

436 citations


Book
01 Sep 1993
TL;DR: These thorough but introductory presentations demonstrate the most important algorithmic techniques and their use in exposing the hidden parallelism within problems.
Abstract: From the Publisher: This landmark collaboration will enhance the knowledge and abilities of anyone interested in parallel algorithms and in developing programs for parallel computers Many problems currently solved with sequential algorithms are themselves highly parallelizable when designers use the powerful parallel techniques now available These thorough but introductory presentations demonstrate the most important algorithmic techniques and their use in exposing the hidden parallelism within problems Beginning with familiar sequential algorithms, the authors provide a careful description of the fundamental problem, its solution, and analysis--complete with examples and exercises Each of the 22 chapters then synthesizes a more sophisticated parallel algorithm using the simpler sequential and parallel techniques used to introduce the problem The PRAM shared-memory model of computing provides a unifying framework This model has been used extensively for designing parallel algorithms and can be efficiently simulated on many of the parallel architectures now in use Applying the methods in this book will offer designers a substantial advantage when solving problems for parallel computation

352 citations


Journal ArticleDOI
TL;DR: Isoefficiency analysis helps to determine the best algorithm/architecture combination for a particular problem without explicitly analyzing all possible combinations under all possible conditions.
Abstract: Isoefficiency analysis helps us determine the best algorithm/architecture combination for a particular problem without explicitly analyzing all possible combinations under all possible conditions. >

329 citations


Journal ArticleDOI
TL;DR: It is proved that every nonlinear clustering of a coarse grain DAG can be transformed into a linear clustering that has less or equal parallel time than the nonlinear one.
Abstract: The authors consider the impact of the granularity on scheduling task graphs. Scheduling consists of two parts, the processors assignment of tasks, also called clustering, and the ordering of tasks for execution in each processor. The authors introduce two types of clusterings: nonlinear and linear clusterings. A clustering is nonlinear if two parallel tasks are mapped in the same cluster otherwise it is linear. Linear clustering fully exploits the natural parallelism of a given directed acyclic task graph (DAG) while nonlinear clustering sequentializes independent tasks to reduce parallelism. The authors also introduce a new quantification of the granularity of a DAG and define a coarse grain DAG as the one whose granularity is greater than one. It is proved that every nonlinear clustering of a coarse grain DAG can be transformed into a linear clustering that has less or equal parallel time than the nonlinear one. This result is used to prove the optimality of some important linear clusterings used in parallel numerical computing. >

302 citations


Journal ArticleDOI
TL;DR: Algorithms that are efficient for solving a variety of problems involving graphs and digitized images are introduced that are asymptotically superior to those previously obtained for the mesh, the mesh with multiple broadcasting, the meshes with multiple buses, theMesh-of-trees, and the pyramid computer.
Abstract: The mesh with reconfigurable bus is presented as a model of computation. The reconfigurable mesh captures salient features from a variety of sources, including the CAAPP, CHiP, polymorphic-torus network, and bus automation. It consists of an array of processors interconnected by a reconfigurable bus system that can be used to dynamically obtain various interconnection patterns between the processors. A variety of fundamental data-movement operations for the reconfigurable mesh are introduced. Based on these operations, algorithms that are efficient for solving a variety of problems involving graphs and digitized images are also introduced. The algorithms are asymptotically superior to those previously obtained for the aforementioned reconfigurable architectures, as well as to those previously obtained for the mesh, the mesh with multiple broadcasting, the mesh with multiple buses, the mesh-of-trees, and the pyramid computer. The power of reconfigurability is illustrated by solving some problems, such as the exclusive OR, more efficiently on the reconfigurable mesh than is possible on the programmable random-access memory (PRAM). >

261 citations


ReportDOI
01 May 1993
TL;DR: In this article, three parallel algorithms for classical molecular dynamics are presented, which can be implemented on any distributed-memory parallel machine which allows for message-passing of data between independently executing processors.
Abstract: Three parallel algorithms for classical molecular dynamics are presented. The first assigns each processor a subset of atoms; the second assigns each a subset of inter-atomic forces to compute; the third assigns each a fixed spatial region. The algorithms are suitable for molecular dynamics models which can be difficult to parallelize efficiently -- those with short-range forces where the neighbors of each atom change rapidly. They can be implemented on any distributed-memory parallel machine which allows for message-passing of data between independently executing processors. The algorithms are tested on a standard Lennard-Jones benchmark problem for system sizes ranging from 500 to 10,000,000 atoms on three parallel supercomputers, the nCUBE 2, Intel iPSC/860, and Intel Delta. Comparing the results to the fastest reported vectorized Cray Y-MP and C90 algorithm shows that the current generation of parallel machines is competitive with conventional vector supercomputers even for small problems. For large problems, the spatial algorithm achieves parallel efficiencies of 90% and the Intel Delta performs about 30 times faster than a single Y-MP processor and 12 times faster than a single C90 processor. Trade-offs between the three algorithms and guidelines for adapting them to more complex molecular dynamics simulations are also discussed.

Proceedings ArticleDOI
03 Nov 1993
TL;DR: New techniques for designing efficient algorithms for computational geometry problems that are too large to be solved in internal memory are given and these algorithms are the first known optimal algorithms for a wide range of two-level and hierarchical multilevel memory models, including parallel models.
Abstract: In this paper we give new techniques for designing efficient algorithms for computational geometry problems that are too large to be solved in internal memory. We use these techniques to develop optimal and practical algorithms for a number of important large-scale problems. We discuss our algorithms primarily in the context of single processor/single disk machines, a domain in which they are not only the first known optimal results but also of tremendous practical value. Our methods also produce the first known optimal algorithms for a wide range of two-level and hierarchical multilevel memory models, including parallel models. The algorithms are optimal both in terms of I/O cost and internal computation. >

Journal ArticleDOI
TL;DR: Universal randomized methods for parallelizing sequential backtrack search and branch-and-bound computation are presented and demonstrate the effectiveness of randomization in distributed parallel computation.
Abstract: Universal randomized methods for parallelizing sequential backtrack search and branch-and-bound computation are presented. These methods execute on message-passing multi- processor systems, and require no global data structures or complex communication protocols. For backtrack search, it is shown that, uniformly on all instances, the method described in this paper is likely to yield a speed-up within a small constant factor from optimal, when all solutions to the problem instance are required. For branch-and-bound computation, it is shown that, uniformly on all instances, the execution time of this method is unlikely to exceed a certain inherent lower bound by more than a constant factor. These randomized methods demonstrate the effectiveness of randomization in distributed parallel computation. Categories and Subject Descriptors: F.2.2 (Analysis of Algorithms and Problem Complexity): Non-numerical Algorithms-computation

Journal ArticleDOI
TL;DR: A transitive-closure-based test generation algorithm is presented that dependences derived from the transitive closure are used to reduce ternary relations to binary relations that in turn dynamically update the transitives closure.
Abstract: A transitive-closure-based test generation algorithm is presented. A test is obtained by determining signal values that satisfy a Boolean equation derived from the neural network model of the circuit incorporating necessary conditions for fault activation and path sensitization. The algorithm is a sequence of two main steps that are repeatedly executed: transitive closure computation and decision-making. A key feature of the algorithm is that dependences derived from the transitive closure are used to reduce ternary relations to binary relations that in turn dynamically update the transitive closure. The signals are either determined from the transitive closure or are enumerated until the Boolean equation is satisfied. Experimental results on the ISCAS 1985 and the combinational parts of ISCAS 1989 benchmark circuits are presented to demonstrate efficient test generation and redundancy identification. Results on four state-of-the-art production VLSI circuits are also presented. >

Journal ArticleDOI
TL;DR: It is still an open issue to decide which of the various architectures among shared-memory, shared-disk, and shared-nothing, is best for database management under various conditions.
Abstract: Parallel database systems attempt to exploit recent multiprocessor computer architectures in order to build high-performance and high-availability database servers at a much lower price than equivalent mainframe computers. Although there are commercial SQL-based products, a number of open problems hamper the full exploitation of the capabilities of parallel systems. These problems touch on issues ranging from those of parallel processing to distributed database management. Furthermore, it is still an open issue to decide which of the various architectures among shared-memory, shared-disk, and shared-nothing, is best for database management under various conditions. Finally, there are new issues raised by the introduction of higher functionality such as knowledge-based or object-oriented capabilities within a parallel database system.

01 Jan 1993
TL;DR: In this paper, the authors describe an efficient approach to partitioning unstructured meshes that occur naturally in the finite element and finite difference methods, making use of the underlying geometric structure of a given mesh and finding a provably good partition in random O(n) time.
Abstract: This paper describes an efficient approach to partitioning unstructured meshes that occur naturally in the finite element and finite difference methods. The approach makes use of the underlying geometric structure of a given mesh and finds a provably good partition in random O(n) time. It applies to meshes in both two and three dimensions. The new method has applications in efficient sequential and parallel algorithms for large-scale problems in scientific computing. This is an overview paper written with emphasis on the algorithmic aspects of the approach. Many detailed proofs can be found in companion papers.

Proceedings ArticleDOI
01 Jul 1993
TL;DR: This paper presents a bottom-up clustering algorithm based on recursive collapsing of small cliques in a graph that leads to a natural parallel implementation in which multiple processors are used to identify clusters simultaneously.
Abstract: In this paper, we present a bottom-up clustering algorithm based on recursive collapsing of small cliques in a graph. The sizes of the small cliques are derived using random graph theory. This clustering algorithm leads to a natural parallel implementation in which multiple processors are used to identify clusters simultaneously. We also present a cluster-based partitioning method in which our clustering algorithm is used as a preprocessing step to both the bisection algorithm by Fiduccia and Mattheyses and a ratio-cut algorithm by Wei and Cheng. Our results show that cluster-based partitioning obtains cut sizes up to 49.6% smaller than the bisection algorithm, and obtains ratio cut sizes up to 66.8% smaller than the ratio-cut algorithm. Moreover, we show that cluster-based partitioning produces much stabler results than direct partitioning.

Journal ArticleDOI
TL;DR: A realistic method is described that scales all relevant parameters under considerations imposed by the application domain, which leads to different conclusions about the effectiveness and design of large multiprocessors than the naive practice of scaling only the data set size.
Abstract: Models for the constraints under which an application should be scaled, including constant problem-size scaling, memory-constrained scaling, and time-constrained scaling, are reviewed. A realistic method is described that scales all relevant parameters under considerations imposed by the application domain. This method leads to different conclusions about the effectiveness and design of large multiprocessors than the naive practice of scaling only the data set size. The primary example application is a simulation of galaxies using the Barnes-Hut hierarchical N-body method. >

Journal ArticleDOI
TL;DR: The authors present the scalability analysis of a parallel fast Fourier transform (FFT) algorithm on mesh and hypercube connected multicomputers using the isoefficiency metric and show that it is more cost-effective to implement the FFT algorithm on a hypercube rather than a mesh.
Abstract: The authors present the scalability analysis of a parallel fast Fourier transform (FFT) algorithm on mesh and hypercube connected multicomputers using the isoefficiency metric. The isoefficiency function of an algorithm architecture combination is defined as the rate at which the problem size should grow with the number of processors to maintain a fixed efficiency. It is shown that it is more cost-effective to implement the FFT algorithm on a hypercube rather than a mesh despite the fact that large scale meshes are cheaper to construct than large hypercubes. Although the scope of this work is limited to the Cooley-Tukey FFT algorithm on a few classes of architectures, the methodology can be used to study the performance of various FFT algorithms on a variety of architectures such as SIMD hypercube and mesh architectures and shared memory architecture. >

Proceedings ArticleDOI
01 Jul 1993
TL;DR: Initial benchmark results of NESL show that NESL's performance is competitive with that of machine-specific codes for regular dense data, and is often superior for irregular data.
Abstract: This paper gives an overview of the implementation of NESL, a portable nested data-parallel language. This language and its implementation are the first to fully support nested data structures as well as nested data-parallel function calls. These features allow the concise description of parallel algorithms on irregular data, such as sparse matrices and graphs. In addition, they maintain the advantages of data-parallel languages: a simple programming model and portability. The current NESL implementation is based on an intermediate language called VCODE and a library of vector routines called CVL. It runs on the Connection Machine CM-2, the Cray Y-MP C90, and serial machines. We compare initial benchmark results of NESL with those of machine-specific code on these machines for three algorithms: least-squares line-fitting, median finding, and a sparse-matrix vector product. These results show that NESL's performance is competitive with that of machine-specific codes for regular dense data, and is often superior for irregular data.

Proceedings ArticleDOI
01 Nov 1993
TL;DR: This paper presents a divide-and-conquer ray-traced volume rendering algorithm and a parallel image compositing method, along with their implementation and performance on the Connection Machine CM-5, and networked workstations.
Abstract: This paper presents a divide-and-conquer ray-traced volume rendering algorithm and a parallel image compositing method, along with their implementation and performance on the Connection Machine CM-5, and networked workstations. This algorithm distributes both the data and the computations to individual processing units to achieve fast, high-quality rendering of high-resolution data. The volume data, once distributed, is left intact. The processing nodes perform local raytracing of their subvolume concurrently. No communication between processing units is needed during this locally ray-tracing process. A subimage is generated by each processing unit and the final image is obtained by compositing subimages in the proper order, which can be determined a priori. Test results on the CM-5 and a group of networked workstations demonstrate the practicality of our rendering algorithm and compositing method.

01 Apr 1993
TL;DR: NESL is intended to be used as a portable interface for programming a variety of parallel and vector supercomputers, and as a basis for teaching parallel algorithms, and several examples of algorithms coded in the language are described.
Abstract: This report describes NESL, a strongly-typed, applicative, data-parallel language. NESL is intended to be used as a portable interface for programming a variety of parallel and vector supercomputers, and as a basis for teaching parallel algorithms. Parallelism is supplied through a simple set of data-parallel constructs based on sequences (ordered sets), including a mechanism for applying any function over the elements of a sequence in parallel and a rich set of parallel functions that manipulate sequences. NESL fully supports nested sequences and nested parallelism -- the ability to take a parallel function and apply it over multiple instances in parallel. Nested parallelism is important for implementing algorithms with complex and dynamically changing data structures, such as required in many graph and sparse matrix algorithms. NESL also provides a mechanism for calculating the asymptotic running time for a program on various parallel machine models, including the parallel random access machine (PRAM). This is useful for estimating running times of algorithms on actual machines and, when teaching algorithms, for supplying a close correspondence between the code and the theoretical complexity. This report defines NESL and describes several examples of algorithms coded in the language. The examples include algorithms for median finding, sorting, string searching, finding prime numbers, and finding a planar convex hull. NESL currently compiles to an intermediate language called Vcode, which runs on the Cray Y-MP, Connection Machine CM-2, and Encore Multimax. For many algorithms, the current implementation gives performance close to optimized machine-specific code for these machines. Note: This report is an updated version of CMU-CS-92-103, which described version 2.4 of the language. The most significant changes in version 2.6 are that it supports polymorphic types, has an ML-like syntax instead of a lisp-like syntax, and includes support for I/O.

Journal ArticleDOI
TL;DR: A generalized mapping strategy that uses a combination of graph theory, mathematical programming, and heuristics to guide the mapping of a parallel algorithm and the architecture is proposed.
Abstract: A generalized mapping strategy that uses a combination of graph theory, mathematical programming, and heuristics is proposed. The authors use the knowledge from the given algorithm and the architecture to guide the mapping. The approach begins with a graphical representation of the parallel algorithm (problem graph) and the parallel computer (host graph). Using these representations, the authors generate a new graphical representation (extended host graph) on which the problem graph is mapped. An accurate characterization of the communication overhead is used in the objective functions to evaluate the optimality of the mapping. An efficient mapping scheme is developed which uses two levels of optimization procedures. The objective functions include minimizing the communication overhead and minimizing the total execution time which includes both computation and communication times. The mapping scheme is tested by simulation and further confirmed by mapping a real world application onto actual distributed environments. >

Book
15 Mar 1993
TL;DR: This text provides one of the broadest presentations of parallel processing available, including the structure of parallel processors and parallel algorithms, with extensive coverage of array and multiprocessor architectures.
Abstract: From the Publisher: This text provides one of the broadest presentations of parallel processing available, including the structure of parallel processors and parallel algorithms. The emphasis is on mapping algorithms to highly parallel computers, with extensive coverage of array and multiprocessor architectures. Early chapters provide insightful coverage on the analysis of parallel algorithms and program transformations, effectively integrating a variety of material previously scattered throughout the literature. Theory and practice are well balanced across diverse topics in this concise presentation. For exceptional clarity and comprehension, the author presents complex material in geometric graphs as well as algebraic notation. Each chapter includes well-chosen examples, tables summarizing related key concepts and definitions, and a broad range of worked exercises. Features: Overview of common hardware and theoretical models, including algorithm characteristics and impediments to fast performance Analysis of data dependencies and inherent parallelism through program examples, building from simple to complex Graphic and explanatory coverage of program transformations Easy-to-follow presentation of parallel processor structures and interconnection networks, including parallelizing and restructuring compilers Parallel synchronization methods and types of parallel operating systems Detailed descriptions of hypercube systems Specialized chapters on dataflow and on AI architectures

Journal ArticleDOI
TL;DR: A full configuration interaction (FCI) algorithm is presented and discussed, an integral driven formalism based on the explicit construction of tables which realize the correspondence between the FCI vector x and the vector Hx, H being the Hamiltonian matrix of the system.
Abstract: A full configuration interaction (FCI) algorithm is presented and discussed. It is an integral driven formalism based on the explicit construction of tables which realize the correspondence between the FCI vector x and the vector Hx, H being the Hamiltonian matrix of the system. In this way no decomposition of the identity is needed, and in the simplest implementation only the two vectors x and Hx need to be kept on disk. The main test has been done on the cyclic polyene C18H18 in the Pariser–Parr–Pople approximation, where the size of the FCI vector can be reduced to about 73 million components. Running on a CRAY Y‐MP with 4 CPU and 32 MW of core memory, we obtained an elapsed CPU time per iteration of about 300 s and a total elapsed time of 1000 s, which correspond to about 4 and 14 s per million determinants, respectively. The parallel CPU speed‐up obtained by running with the 4 CPU is greater than 3, without any substantial increasing of the memory or disk requirements.

02 Jan 1993
TL;DR: In this paper, the fairness of parallel performance metrics is studied and the theoretical and experimental results show that the most commonly used performance metric, parallel speedup, is unfair in that it favors slow processors and poorly coded programs.
Abstract: Due to programming difficulty, parallel algorithms are commonly compared with different levels of programming effort. They are also compared on different architectures. In this paper, the fairness'' of parallel performance metrics is studied. Theoretical and experimental results show that the most commonly used performance metric, parallel speedup, is unfair'', in that it favors slow processors and poorly coded programs. Two new performance metrics are introduced. The first one, sizeup, provides a fair'' performance measurement. The second one is a generalization of speedup -- the generalized speedup. The relation between sizeup, speedup, and generalized speedup are studied. A real application has been implemented on an nCUBE 2 multicomputer. The experimental results match the analytical results closely. 18 refs., 12 figs.

Proceedings ArticleDOI
27 Apr 1993
TL;DR: It is shown how the total least squares recursive algorithm for the real data FIR (finite impulse response) adaptive filtering problem can be applied to reconstruct a high-resolution filtered image from undersampled, noisy multiframes, when the interframe displacements are not accurately known.
Abstract: It is shown how the total least squares recursive algorithm for the real data FIR (finite impulse response) adaptive filtering problem can be applied to reconstruct a high-resolution filtered image from undersampled, noisy multiframes, when the interframe displacements are not accurately known. This is done in the wavenumber domain after transforming the complex data problem to an equivalent real data problem, to which the algorithm developed by C.E. Davila (Proc. ICASSP 1991 p.1853-6 of 1991) applies. The procedure developed also applies when the multiframes are degraded by linear shift-invariant blurs. All the advantages of implementation via massively parallel computational architecture apply. The performance of the algorithm is verified by computer simulations. >

Journal ArticleDOI
TL;DR: A unified approach to the solution of the partitioning problem is presented based on the following concepts: Algorithms are represented by programs, and the concept of stepwise refinement of programs is used to solve the partitions problem by applying a sequence of provably correct program transformations.

Journal ArticleDOI
TL;DR: Experimental results for many synthetic and practical problems run on various parallel machines that validate the theoretical analysis are presented, and it is shown that the average speedup obtained is linear when the distribution of solutions is uniform and superlinear when the distributed distribution is nonuniform.
Abstract: Analytical models and experimental results concerning the average case behavior of parallel backtracking are presented. Two types of backtrack search algorithms are considered: simple backtracking, which does not use heuristics to order and prune search, and heuristic backtracking, which does. Analytical models are used to compare the average number of nodes visited in sequential and parallel search for each case. For simple backtracking, it is shown that the average speedup obtained is linear when the distribution of solutions is uniform and superlinear when the distribution of solutions is nonuniform. For heuristic backtracking, the average speedup obtained is at least linear, and the speedup obtained on a subset of instances is superlinear. Experimental results for many synthetic and practical problems run on various parallel machines that validate the theoretical analysis are presented. >

Journal ArticleDOI
TL;DR: An efficient and robust image reconstruction algorithm for static impedance imaging using Hachtel's augmented matrix method was developed and it was shown that the parallel computation could reduce the computation time from hours to minutes.
Abstract: An efficient and robust image reconstruction algorithm for static impedance imaging using Hachtel's augmented matrix method was developed. This improved Newton-Raphson method produced more accurate images by reducing the undesirable effects of the ill-conditioned Hessian matrix. It is demonstrated that the electrical impedance tomography (EIT) system could produce two-dimensional static images from a physical phantom with 7% spatial resolution at the center and 5% at the periphery. Static EIT image reconstruction requires a large amount of computation. In order to overcome the limitations on reducing the computation time by algorithmic approaches, the improved Newton-Raphson algorithm was implemented on a parallel computer system. It is shown that the parallel computation could reduce the computation time from hours to minutes. >