scispace - formally typeset
Search or ask a question

Showing papers on "Parallel processing (DSP implementation) published in 1991"


Journal ArticleDOI
01 Sep 1991
TL;DR: A new set of benchmarks has been developed for the performance evaluation of highly parallel supercom puters that mimic the computation and data move ment characteristics of large-scale computational fluid dynamics applications.
Abstract: A new set of benchmarks has been developed for the performance evaluation of highly parallel supercom puters. These consist of five "parallel kernel" bench marks and three "simulated application" benchmarks. Together they mimic the computation and data move ment characteristics of large-scale computational fluid dynamics applications. The principal distinguishing feature of these benchmarks is their "pencil and paper" specification-all details of these benchmarks are specified only algorithmically. In this way many of the difficulties associated with conventional bench- marking approaches on highly parallel systems are avoided.

2,246 citations


Journal ArticleDOI
TL;DR: The design, application, and evaluation of parallel processing to the high-speed volumetric ultrasound imaging system, which uses pulse-echo phased array principles to steer a 2-D array transducer of 289 elements in a pyramidal scan format is described.
Abstract: For pt.I see ibid., vol.38, no.2, p.100-8 (1991). The authors describe the design, application, and evaluation of parallel processing to the high-speed volumetric ultrasound imaging system. The scanner produces images analogous to an optical camera or the human eye and supplies more information than conventional sonograms. Potential medical applications include improved anatomic visualization, tumor localization, and better assessment of cardiac function. The system uses pulse-echo phased array principles to steer a 2-D array transducer of 289 elements in a pyramidal scan format. Parallel processing in the receive mode produces 4992 scan lines at a rate of approximately 8 frames/s. Echo data for the scanned volume is presented online as projection images with depth perspective, stereoscopic pairs, or multiple tomographic images. The authors also describe the techniques developed for the online display of volumetric images on a conventional CRT oscilloscope and show preliminary volumetric images for each display mode. >

433 citations


Journal ArticleDOI
TL;DR: In this paper, the authors present a lazy task creation method for a parallel implementation of Scheme called Mul-T that combines parallel tasks dynamically at runtime, based on a load-based inlining method.
Abstract: When a parallel algorithm is written naturally, the resulting program often produces tasks of a finer grain than an implementation can exploit efficiently. Two solutions to the granularity problem that combine parallel tasks dynamically at runtime are discussed. The simpler load-based inlining method, in which tasks are combined based on dynamic bad level, is rejected in favor of the safer and more robust lazy task creation method, in which tasks are created only retroactively as processing results become available. The strategies grew out of work on Mul-T, an efficient parallel implementation of Scheme, but could be used with other languages as well. Mul-T implementations of lazy task creation are described for two contrasting machines, and performance statistics that show the method's effectiveness are presented. Lazy task creation is shown to allow efficient execution of naturally expressed algorithms of a substantially finer grain than possible with previous parallel Lisp systems. >

300 citations


Journal ArticleDOI
TL;DR: Analysis of the olfactory system using a combination of physiological measurements and computational approaches might elucidate the principles by which odors are discriminated.

206 citations



Journal ArticleDOI
TL;DR: An approach to connectionist natural language processing is proposed, which is based on hierarchically organized modular parallel distributed processing (PDP) networks and a central lexicon of distributed input/output representations.

159 citations


01 Jul 1991
TL;DR: This thesis presents a feed-forward algorithm, called splatting, that directly renders rectilinear volume meshes, a naturally parallel algorithm that adheres well to the requirements imposed by signal processing theory.
Abstract: Volume rendering is the generation of images from discrete samples of volume data. The volume data is sampled in at least three dimensions and comes in three basic classes: the rectilinear mesh-for example, a stack of computed tomography scans; the curvilinear mesh-for example, computational fluid dynamic data sets of the flow of air over an airplane wing; and the unstructured mesh-for example, a collection of ozone density readings at multiple elevations from a set of collection stations in the United States. Previous methods coerced the volumetric data into line and surface primitives that were viewed on conventional computer graphics displays. This coercion process has two fundamental flaws: viewers are never sure whether they are viewing a feature of the data or an artifact of the coercion process; and the insertion of a geometric modeling procedure into the middle of the display pipeline hampers interactive viewing. New direct rendering approaches that operate on the original data are replacing coercion approaches. These new methods, which avoid the artifacts introduced by conventional graphics primitives, fall into two basic categories: feed-backward methods that attempt to map the image plane onto the data, and feed-forward methods that attempt to map each volume element onto the image plane. This thesis presents a feed-forward algorithm, called splatting, that directly renders rectilinear volume meshes. The method achieves interactive speed through parallel execution, successive refinement, table-driven shading, and table-driven filtering. The method achieves high image quality by paying careful attention to signal processing principles during the process of reconstructing a continuous volume from the sampled input. This thesis' major contribution to computer graphics is the splatting algorithm. It is a naturally parallel algorithm that adheres well to the requirements imposed by signal processing theory. The algorithm has uncommon features. First, it can render volumes as either clouds or surfaces by changing the shading functions. Second, it can smoothly trade rendering time for image quality at several stages of the rendering pipeline. In addition this thesis presents a theoretical framework for volume rendering.

140 citations


Proceedings ArticleDOI
01 Dec 1991
TL;DR: A hardware/software design is presented that allows the order of memory and the CPU''s to be allowed along with hardware and software control to replay execution and represents several orders of magnitude improvement in both performance and log size over purely software-based methods proposed previously.
Abstract: Shared-memory parallel programs can be highly non-deterministic due to the unpredictable order in which shared references are satisfied. However, deterministic execution is extremely important for debugging and can also be used for fault-tolerance and other replay-based algorihtms. We present a hardware/software design that allows the order of memory and the CPU''s. This log can then be used along with hardware and software control to replay execution. Simulation of several parallel programs shows that our device records no more than 1.17 MB/second for an application exhibiting fine-grained sharing behavior on a 16-way multiprocessor consisting of 12 MIP CPU''s. In addition, no probe effect on performance degradation is introduced. This represents several orders of magnitude improvement in both performance and log size over purely software-based methods proposed previously.

115 citations


Patent
Takashi Kan1
07 May 1991
TL;DR: In this paper, a SIMD type parallel processing unit (50) and a MIMD type Parallel Data Processing Unit (51) are connected to each other by a common bus (41) and memory (42), and a system controller (43) is provided to allow each of the parallel data processing units to perform its suitable processings.
Abstract: There are SIMD type parallel data processing systems having a single instruction stream and multiple data streams and MIMD type parallel data processing systems having multiple instruction and data streams in the parallel data processing field for performing high-speed data processing. They have both merits and demerits and each have their suitable application fields. Because of this, it is extremely difficult to cover a wide range of application fields with either one of the systems. Then, a SIMD type parallel processing unit (50) and a MIMD type parallel data processing unit (51) are connected to each other by a common bus (41) and a memory (42), and a system controller (43) is provided to allow each of the parallel data processing units to perform its suitable processings, thus making it possible to apply the optimum parallel processing system to a wide range of application fields. That is, simple processings of a large volume of data are allocated to the SIMD type parallel data processing unit, while complex processings of a small volume of data are allocated to the MIMD type parallel data processing unit, whereby processings which have been difficult for a conventional computer to accomplish within an effective time, such as large-scale and complex processings of images, can be performed within a practical time at a high speed.

113 citations


Journal ArticleDOI
TL;DR: In this article, a parallel simulated annealing algorithm that is problem-independent, maintains the serial decision sequence, and obtains speedup which can exceed log/sub 2/P on P processors is discussed.
Abstract: A parallel simulated annealing algorithm that is problem-independent, maintains the serial decision sequence, and obtains speedup which can exceed log/sub 2/P on P processors is discussed. The algorithm achieves parallelism by using the concurrency technique of speculative computation. Implementation of the parallel algorithm on a hypercube multiprocessor and application to a task assignment problem are described. The simulated annealing solutions are shown to be, on average, 28% better than the solutions produced by a random task assignment algorithm and 2% better than the solutions produced by a heuristic. >

112 citations


Journal ArticleDOI
TL;DR: The authors show that for SPECT imaging on 64x64 image grids, the single-instruction, multiple data (SIMD) distributed array processor containing 64(2) processors performs the expectation-maximization (EM) algorithm with Good's smoothing at a rate of 1 iteration/1.5 s, promising for emission tomography fully Bayesian reconstructions including regularization in clinical computation times which are on the order of 1 min/slice.
Abstract: Extending the work of A.W. McCarthy et al. (1988) and M.I. Miller and B. Roysam (1991), the authors demonstrate that a fully parallel implementation of the maximum-likelihood method for single-photon emission computed tomography (SPECT) can be accomplished in clinical time frames on massively parallel systolic array processors. The authors show that for SPECT imaging on 64*64 image grids, with 96 view angles, the single-instruction, multiple data (SIMD) distributed array processor containing 64/sup 2/ processors performs the expectation-maximization (EM) algorithm with Good's smoothing at a rate of 1 iteration/1.5 s. This promises for emission tomography fully Bayesian reconstructions including regularization in clinical computation times which are on the order of 1 min/slice. The most important result of the implementations is that the scaling rules for computation times are roughly linear in the number of processors. >

Journal ArticleDOI
TL;DR: Experimentation aimed at determining the potential benefit of mixed-mode SIMD/MIMD parallel architectures is reported, based on timing measurements made on the PASM system prototype at Purdue utilizing carefully coded synthetic variations of a well-known algorithm.

Journal ArticleDOI
TL;DR: The software design of MARS is described and its implementation as a practical system for large-scale information management is described.
Abstract: The Medical ARchival System (MARS) is an information retrieval system utilizing distributed parallel processing. It features a modular design, machine independence, and a Boolean query interface, based in a UNIX environment. Developed at the University of Pittsburgh in response to the information needs of a large academic health center, MARS integrates textual data from a wide variety of sources to create a single, comprehensive medical records information system. It currently contains 850,000 medical reports, 2,500,000 medical references, and 500,000,000 indexed words. This paper describes the software design of MARS and its implementation as a practical system for large-scale information management.

Patent
02 Aug 1991
TL;DR: In this paper, a system consisting of a plurality of processing boards having a substantially similar architecture is presented, which includes several frame grabber/frame storage processing boards, each of which digitizes the analog video signals from the video cameras and stores the digital data in a solid state buffer memory.
Abstract: A system which retrofits to an existing surveillance system and cooperates with sensors, video cameras and video monitors of the existing surveillance system. The system comprises a plurality of processing boards having a substantially similar architecture. Several frame grabber/frame storage processing boards are provided, each of which digitize the analog video signals from the video cameras and stores the digital data in a solid state buffer memory. Several display boards are provided to display the digitized video data on display monitors. A controller board controls the exchange of video data and command or control messages over a video link and between processing boards, respectively. Additional expansion boards may be added to support additional buffering and system options. Each processing board is built around a parallel processing computer chip.

Journal ArticleDOI
TL;DR: Three examples of star-coupled structures are introduced, one of which exhibits optical self-routing, and the complexity of the communication subsystem is reduced since intermediate buffering and routing of packets are eliminated.
Abstract: A multiple-instruction multiple-data (MIMD) distributed memory parallel computer system environment is considered. Media access control protocols that maintain good performance with high capacity optical channels are investigated. Three examples of star-coupled structures are introduced, one of which exhibits optical self-routing. Self-routing single-step optically interconnected communication structures can be designed through the incorporation of agile laser diode sources and wavelength tunable optical filters in a wavelength-division multiple-access environment. Intermediary latencies typical of MIMD distributed memory systems are eliminated. The degree and diameter of the resulting structures are dramatically reduced, and the complexity of the communication subsystem is reduced since intermediate buffering and routing of packets are eliminated. >

Journal ArticleDOI
TL;DR: The syntax and semantics of the DINO language is described, examples of DINO programs are given, a critique of theDINO language features are presented, and the performance of code generated by the Dino compiler is discussed.

Journal ArticleDOI
TL;DR: It is demonstrated that high performance efficiencies are attainable for multigrid on massively parallel computers, as indicated by an example of poor efficiency on 65,536 processors, and that parallel machines open the possibility of finding really new approaches to solving standard problems.
Abstract: Multigrid methods have been established as being among the most efficient techniques for solving complex elliptic equations. We sketch the multigrid idea, emphasizing that a multigrid solution is generally obtainable in a time directly proportional to the number of unknown variables on serial computers. Despite this, even the most powerful serial computers are not adequate for solving the very large systems generated, for instance, by discretization of fluid flow in three dimensions. A breakthrough can be achieved here only by highly parallel supercomputers. On the other hand, parallel computers are having a profound impact on computational science. Recently, highly parallel machines have taken the lead as the fastest supercomputers, a trend that is likely to accelerate in the future. We describe some of these new computers, and issues involved in using them. We describe standard parallel multigrid algorithms and discuss the question of how to implement them efficiently on parallel machines. The natural approach is to use grid partitioning. One intrinsic feature of a parallel machine is the need to perform interprocessor communication. It is important to ensure that time spent on such communication is maintained at a small fraction of computation time. We analyze standard parallel multigrid algorithms in two and three dimensions from this point of view, indicating that high performance efficiencies are attainable under suitable conditions on moderately parallel machines. We also demonstrate that such performance is not attainable for multigrid on massively parallel computers, as indicated by an example of poor efficiency on 65,536 processors. The fundamental difficulty is the inability to keep 65,536 processors busy when operating on very coarse grids. This example indicates that the straightforward parallelization of multigrid (and other) algorithms may not always be optimal. However, parallel machines open the possibility of finding really new approaches to solving standard problems. In particular, we present an intrinsically parallel variant of standard multigrid. This “PSMG” (parallel superconvergent multigrid) method allows all processors to be used at all times. even when processing on the coarsest grid levels. The sequential version of this method is not a sensible algorithm

01 Aug 1991
TL;DR: This is a description of and a user's manual for upshot, an X-based graphics tool for viewing log files produced by parallel programs.
Abstract: This is a description of and a user's manual for upshot, an X-based graphics tool for viewing log files produced by parallel programs.

Journal ArticleDOI
TL;DR: In this paper, a diagnostic for distinguishing between serial and parallel processing in visual search is proposed, which is based on testing for subadditive effects of a within-trial visual quality manipulation on target-absent trials.
Abstract: The authors propose a diagnostic for distinguishing between serial and parallel processing in visual search; it is based on testing for subadditive effects of a within-trial visual quality manipulation on target-absent trials. It was evaluated in 2 experiments wherein parallel and serial processing might be expected on the basis of previous work and was then applied to a more uncertain situation in a third experiment. The diagnostic indicates parallel processing of stimuli that differ from each other on a featural basis (Xs and Os) and canonical letters that differ in line arrangement (Ts and Ls) but serial processing when Ts and Ls are randomly rotated. These results form a coherent pattern that is understandable in terms of the literature on visual search, and thus they suggest that the diagnostic may be a useful addition to the methodology used to distinguish between serial and parallel processes.

Patent
Fumio Nagasaka1
05 Apr 1991
TL;DR: In this paper, the rasterize processing for obtaining printing picture element information from a source file described in a page-description language is distributed-processed by a plurality of information processing units (6a, 6b, 6c) loose connected via a network.
Abstract: The rasterize processing for obtaining printing picture element information from a source file described in a page-description language is distributed-processed by a plurality of information processing units (6a, 6b, 6c) loose connected via a network (7). In the information processing unit (6a) which generates a printing request, a client process (210) converts a source file (19) into an intermediate code file (10) and further divides the intermediate code file into a plurality of partial files executable in the rasterize processing, independently. A part of these plural partial files is given to a rasterizer (212) of the information processing unit (6a) which generates a printing request, so as to be rasterized into picture element information. The remaining part of the plural partial files are distributed to the other information processing units (6b, 6c) via the network. In each of these other information processing units (6b, 6c), the distributed partial file is received by a server process (211), transmitted to the rasterizer (212) to form partial picture element information. These partial picture element information formed by these other information processing units (6b, 6c) are returned to the information processing unit (6a) which generates the printing request. In this information processing unit (6a) which generates the printing request, the client process (210) combines the picture element information returned from the other information processing units (6b, 6c) with the picture element information formed by the rasterizer (212) of its own unit, to form the entire picture element information. The entire picture element information is transmitted to a printing unit (21).

Patent
06 Dec 1991
TL;DR: In this article, a massively parallel processor includes an array of processor element (20), or PEs, and a multi-stage router interconnection network (30), which is used both for I/O communications and for simultaneous PE to PE communications.
Abstract: A massively parallel processor includes an array of processor element (20), or PEs, and a multi-stage router interconnection network (30), which is used both for I/O communications and for simultaneous PE to PE communications. The I/O system (10) for the massively parallel processor is based on a globally shared addressable I/O RAM buffer memory (50) that has parallel address and data buses (52) to the I/O devices (80, 82) and other parallel address and data buses (42) which are coupled to a router I/O element array (40). The router I/O element array is in turn coupled to the bit-serial router ports (e.g. S2 ^_O ^_XO) of the second stage (430) of the router interconnection network. The router I/O array provides the corner turn conversion between the massive array of bit-serial router lines (32) and the relatively few parallel buses (52) to the I/O devices.

Book
02 Jan 1991
TL;DR: "Advances in Languages and Compilers for Parallel Processing" discusses languages and language extensions, presents two innovative environments for parallel programming, describes techniques for debugging parallel programs, and takes up the important issue of data organization and management during parallel processing.
Abstract: These twenty-three contributions represent some of the best research on software for parallel computers being done in universities and industry today."Advances in Languages and Compilers for Parallel Processing" discusses languages and language extensions, presents two innovative environments for parallel programming, describes techniques for debugging parallel programs, and takes up the important issue of data organization and management during parallel processing. New compiler techniques for parallelizing loops are covered as are new results in code scheduling and new approaches to dependency analysis and representation. The book concludes with an interesting insight into the measurement of parallelism implicit in ordinary programs and methods for dealing with programming and compiling for distributed and shared memory multiprocessors.

Patent
Fumihiko Saitoh1
31 Jul 1991
TL;DR: In this article, a character recognition system and method using the generalized Hough transform are disclosed, in which a template table which stores edge point parameters to be used for the GHT is compressed so as to include only predetermined parameters, and is then divided into a plurality of template tables which are respectively loaded in the memories of a pluralityof subprocessors operating in parallel under the control of a main processor.
Abstract: A character recognition system and method using the generalized Hough transform are disclosed. A template table which stores edge point parameters to be used for the generalized Hough transform is compressed so as to include only predetermined parameters, and is then divided into a plurality of template tables which are respectively loaded in the memories of a plurality of subprocessors operating in parallel under the control of a main processor. In performing recognition processing, these subprocessors operate in parallel according to their related partial template tables. Character recognition using the generalized Hough transform provides a high rate of character recognition. Also, parallel processing using the compressed template tables and partial template tables helps shorten table search time and computation time, thereby increasing processing efficiency.

Journal ArticleDOI
TL;DR: The main thrust is to explore the match between the algorithms, their implementation, and the machine architectures, and to present various considerations together with the results.
Abstract: Both the very dishonest Newton (VDHN) and the successive over relaxed (SOR) Newton algorithms have been implemented on the iPSC/2 and Alliant FX/8 computers for power system dynamic simulation using complex generator and nonlinear load models. The main thrust is to explore the match between the algorithms, their implementation, and the machine architectures. For example, the less parallel but sequentially faster VDHN runs faster on the hypercube (iPSC/2) whereas the more parallel SOR-Newton requires data sharing more often because of the extra iterations and does better on the Alliant. The implementation on the hypercube requires significant manual programming to schedule the processors and their communication whereas the compiler in the Alliant recognizes parallel steps but only if the software is properly coded. The authors present these various considerations together with the results. >

Journal ArticleDOI
TL;DR: Efficient parallel simulations are given for a variety of queueing networks having a global first come first served structure, and the problem of simulating the arrival and departure times for the first N jobs to a single G/G/l queue is solved in time proportional to N/P + log P using P processors.
Abstract: New methods are presented for parallel simulation of discrete event systems that, when applicable, can usefully employ a number of processors much larger than the number of objects in the system being simulated, Abandoning the distributed event list approach, the simulation problem is posed using recurrence relations. We bring three algorithmic ideas to bear on parallel simulation: parallel prefix computation, parallel merging, and iterative folding. Efficient parallel simulations are given for (in turn) the G/G/l queue, a variety of queueing networks having a global first come first served structure (e.g., a series of queues with finite buffers), acyclic networks of queues, and networks of queues with feedbacks and cycles. In particular, the problem of simulating the arrival and departure times for the first N jobs to a single G/G/l queue is solved in time proportional to N/P + log P using P processors.

Proceedings ArticleDOI
01 Dec 1991
TL;DR: Results show that prefetching can be implemented efficiently even for the more complex parallel file access patterns, and the ability of these policies across a range of architectural parameters is tested.
Abstract: Improvements in the processing speed of multiprocessors are outpacing improvements in the speed of disk hardware. Parallel disk I/O subsystems have been proposed as one way to close the gap between processor and disk speeds. In a previous paper the authors showed that prefetching and caching have the potential to deliver the performance benefits of parallel file systems to parallel applications. They describe experiments with practical prefetching policies, and show that prefetching can be implemented efficiently even for the more complex parallel file access patterns. They also test the ability of these policies across a range of architectural parameters. (see IEEE Trans. on Parallel and Distributed Systems, vol.1, no.2, p.218-30, 1990). >


Patent
29 Apr 1991
TL;DR: In this article, a parallel processing computer system for clustering data points in continuous feature space by adaptively separating classes of patterns is presented, which is based upon the gaps between successive data values within single features.
Abstract: A parallel processing computer system for clustering data points in continuous feature space by adaptively separating classes of patterns. The preferred embodiment for this massively parallel system includes preferably one computer processor per feature and requires a single a priori assumption of central tendency in the distributions defining the pattern classes. It advantageously exploits the presence of noise inherent in the data gathering to not only classify data points into clusters, but also measure the certainty of the classification for each data point, thereby identifying outliers and spurious data points. The system taught by the present invention is based upon the gaps between successive data values within single features. This single feature discrimination aspect is achieved by applying a minimax comparison involving gap lengths and locations of the largest and smallest gaps. Clustering may be performed in near-real-time on huge data spaces having unlimited numbers of features.


Patent
31 May 1991
TL;DR: In this article, a vector signal processor for concurrent, parallel processing of complex vectors is described. The principal processing units are an execution unit, data movement unit, control/register unit, a vector buffer unit, an instruction fetch unit, and a bus interface unit.
Abstract: Multiple special purpose processing units are provided in a vector signal processor for concurrent, parallel processing, particularly of complex vectors. The principal processing units are an execution unit, a data movement unit, a control/register unit, a vector buffer unit, an instruction fetch unit, and a bus interface unit.