scispace - formally typeset
Search or ask a question

Showing papers on "Parallel processing (DSP implementation) published in 1988"


Journal ArticleDOI
TL;DR: In this article, a class of information processing systems called cellular neural networks (CNNs) are proposed, which consist of a massive aggregate of regularly spaced circuit clones, called cells, which communicate with each other directly through their nearest neighbors.
Abstract: A novel class of information-processing systems called cellular neural networks is proposed. Like neural networks, they are large-scale nonlinear analog circuits that process signals in real time. Like cellular automata, they consist of a massive aggregate of regularly spaced circuit clones, called cells, which communicate with each other directly only through their nearest neighbors. Each cell is made of a linear capacitor, a nonlinear voltage-controlled current source, and a few resistive linear circuit elements. Cellular neural networks share the best features of both worlds: their continuous-time feature allows real-time signal processing, and their local interconnection feature makes them particularly adapted for VLSI implementation. Cellular neural networks are uniquely suited for high-speed parallel signal processing. >

4,583 citations


Journal ArticleDOI
TL;DR: Examples of cellular neural networks which can be designed to recognize the key features of Chinese characters are presented and their applications to such areas as image processing and pattern recognition are demonstrated.
Abstract: The theory of a novel class of information-processing systems, called cellular neural networks, which are capable of high-speed parallel signal processing, was presented in a previous paper (see ibid., vol.35, no.10, p.1257-72, 1988). A dynamic route approach for analyzing the local dynamics of this class of neural circuits is used to steer the system trajectories into various stable equilibrium configurations which map onto binary patterns to be recognized. Some applications of cellular neural networks to such areas as image processing and pattern recognition are demonstrated, albeit with only a crude circuit. In particular, examples of cellular neural networks which can be designed to recognize the key features of Chinese characters are presented. >

2,332 citations


Journal ArticleDOI
TL;DR: A status report on the architecture and programming of a family of concurrent computers that are organized as ensembles of small programmable computers called nodes, connected by a message-passing network, each with its own private memory is provided in this article.
Abstract: A status report is provided on the architecture and programming of a family of concurrent computers that are organized as ensembles of small programmable computers called nodes, connected by a message-passing network, each with its own private memory. The architecture of the multicomputer is described and contrasted with that of the shared-memory multiprocessor, and the concept of grain size (which depends on the size of the individual memories) is explained. Medium-grain and fine-grain multicomputers, with nodes containing megabytes and tens of kilobytes of memory, respectively, are examined, and their programming is discussed. >

532 citations


Journal ArticleDOI
TL;DR: H hierarchical network structures are developed that have the property that the optimal global estimate based on all the available information can be reconstructed from estimates computed by local processor nodes solely on the basis of their own local information and transmitted to a central processor.
Abstract: Various multisensor network scenarios with signal processing tasks that are amenable to multiprocessor implementation are described The natural origins of such multitasking are emphasized, and novel parallel structures for state estimation using the Kalman filter are proposed that extend existing results in several directions In particular, hierarchical network structures are developed that have the property that the optimal global estimate based on all the available information can be reconstructed from estimates computed by local processor nodes solely on the basis of their own local information and transmitted to a central processor The algorithms potentially yield an approximately linear speedup rate, are reasonably failure-resistant, and are optimized with respect to communication bandwidth and memory requirements at the various processors >

482 citations


Journal ArticleDOI
TL;DR: The scaled-problem paradigm better reveals the capabilities of large ensembles, and permits detection of subtle hardware-induced load imbalances that may become increasingly important as parallel processors increase in node count.
Abstract: We have developed highly efficient parallel solutions for three practical, full-scale scientific problems: wave mechanics, fluid dynamics, and structural analysis. Several algorithmic techniques are used to keep communication and serial overhead small as both problem size and number of processors are varied. A new parameter, operation efficiency, is introduced that quantifies the tradeoff between communication and redundant computation. A 1024-processor MIMD ensemble is measured to be 502 to 637 times as fast as a single processor when problem size for the ensemble is fixed, and 1009 to 1020 times as fast as a single processor when problem size per processor is fixed. The latter measure, denoted scaled speedup, is developed and contrasted with the traditional measure of parallel speedup. The scaled-problem paradigm better reveals the capabilities of large ensembles, and permits detection of subtle hardware-induced load imbalances (such as error correction and data-dependent MFLOPS rates) that may become increasingly important as parallel processors increase in node count. Sustained performance for the applications is 70 to 130 MFLOPS, validating the massively parallel ensemble approach as a practical alternative to more conventional processing methods. The techniques presented appear extensible to even higher levels of parallelism than the 1024-processor level explored here.

433 citations


Journal ArticleDOI
01 Jun 1988
TL;DR: A parallel algorithm for the rasterization of polygons is presented that is particularly well suited for 3D Z-buffered graphics implementations and can be interpolated with hardware similar to hardware required to interpolate color and Z pixel values.
Abstract: A parallel algorithm for the rasterization of polygons is presented that is particularly well suited for 3D Z-buffered graphics implementations. The algorithm represents each edge of a polygon by a linear edge function that has a value greater than zero on one side of the edge and less than zero on the opposite side. The value of the function can be interpolated with hardware similar to hardware required to interpolate color and Z pixel values. In addition, the edge function of adjacent pixels may be easily computed in parallel. The coefficients of the "Edge function" can be computed from floating point endpoints in such a way that sub-pixel precision of the endpoints can be retained in an elegant way.

259 citations


Patent
04 Nov 1988
TL;DR: The crossbar switch as discussed by the authors is a cross-bar switch that connects coarse-grained processing elements to a plurality of memory modules for a parallel processing system free of memory conflicts over a wide range of arithmetic computations (i.e. scalar, vector and matrix).
Abstract: A crossbar switch which connects N (N=2k ; k=0, 1, 2, 3) coarse grain processing elements (rated at 20 million floating point operations per second) to a plurality of memories provides for a parallel processing system free of memory conflicts over a wide range of arithmetic computations (i.e. scalar, vector and matrix). The configuration of the crossbar switch, i.e., the connection between each processing element unit and each parallel memory module, may be changed dynamically on a cycle-by-cycle basis in accordance with the requirements of the algorithm under execution. Although there are certain crossbar usage rules which must be obeyed, the data is mapped over parallel memory such that the processing element units can access and operate on input streams of data in a highly parallel fashion with an effective memory transfer rate and computational throughput power comparable in performance to present-day supercomputers. The crossbar switch is comprised of two basic sections; a multiplexer and a control section. The multiplexer provides the actual switching of signal paths, i.e. connects each processing element unit to a particular parallel memory on each clock cycle (100 nsec). The control section determines which connections are made on each clock cycle in accordance with the algorithm under execution. Selectable pipelined delay in the control section provides for optimal data transfer efficiency between the processors and memory modules over a wide range of array processing algorithms. The crossbar switch also provides for graceful system degradation in computational throughput power without the need to download a new program.

176 citations


11 Jul 1988
TL;DR: The application of the connectionist framework to problems of cognitive development is considered, and a network that learns to anticipate which side of a balance beam will go down is illustrated, based on the number of weights on each side of the fulcrum and their distance from the Fulcrum.
Abstract: : This paper provides a brief overview of the connectionist or parallel distributed processing framework for modeling cognitive processes, and considers the application of the connectionist framework to problems of cognitive development. Several aspects of cognitive development might result from the process of learning as it occurs in multi-layer networks. This learning process has the characteristic that it reduces the discrepancy between expected and observed events. As it does this, representations develop on hidden units which dramatically change both the way in which the network represents the environment from which it learns and the expectations that the network generates about environmental events. The learning process exhibits relatively abrupt transitions corresponding to stage shifts in cognitive development. These points are illustrated using a network that learns to anticipate which side of a balance beam will go down, based on the number of weights on each side of the fulcrum and their distance from the fulcrum on each side of the beam. The network is trained in an environment in which weight more frequently governs which side will go down. It recapitulates the states of development seen in children, as well as the stage transitions, as it learns to represent weight and distance information. Keywords: Parallel processing; Data processing.

173 citations


Journal ArticleDOI
TL;DR: A bulk arrival M/sup x//M/c queuing system is used to model a centralized parallel processing system with job splitting, and an expression for the mean job response-time is obtained.
Abstract: A bulk arrival M/sup x//M/c queuing system is used to model a centralized parallel processing system with job splitting. In such a system, jobs wait in a central queue, which is accessible by all the processors, and are split into independent tasks that can be executed on separate processors. The job response-time consists of three components: queuing delay, service time, and synchronization delay. An expression for the mean job response-time is obtained for this centralized parallel-processing system. Centralized and distributed parallel-processing systems (with and without job-splitting) are considered and their performances compared. Furthermore, the effects of parallelism and overheads due to job-splitting are investigated. >

152 citations


Patent
16 Feb 1988
TL;DR: A parallel associative memory as mentioned in this paper provides a way of recognizing or identifying observed data patterns, where each memory store a plurality of recognition patterns, and the recall pattern is contemporaneously compared to the recognition patterns stored in the memories and an exact or best match recognition pattern is selected.
Abstract: A parallel associative memory provides a way of recognizing or identifying observed data patterns. Each of a plurality of memories stores a plurality of recognition patterns. In response to receipt of a recall pattern to be identified, the recall pattern is contemporaneously compared to the recognition patterns stored in the memories and an exact or best match recognition pattern is selected. In a preferred embodiment, the memories may store multiple data bases each of which includes patterns having different lengths and different radii of attraction. The comparison process is controlled by masks which specify respective portions of the patterns which may include the radii of attraction, bits which must identically match, bits which are ignored, bits which are compared in a bit-wise fashion, and bytes which are compared by multiplication. A correlation is computed and selectively adjusted by the respective radii of attraction. A specified number of the patterns having the best correlation are identified, subject to selected threshold conditions, and sorted according to their respective correlations. The parallel nature of the memory lends itself to a hierarchical organization for increased storage capacity and to parallel processing which increases the speed of the identification or recognition process and thereby allows a broad range of applications. These applications includes fast retrieval of exact or inexact data, diagnosis, image processing and speech recognition.

150 citations


Journal ArticleDOI
TL;DR: This work extensively study the relationship between four shared memory models of parallel computation that allow simultaneous read/write access, and proves nontrivial separations and simulation results among them.
Abstract: Shared memory models of parallel computation (e.g., parallel RAMs) that allow simultaneous read/write access are very natural and already widely used for parallel algorithm design. The various models differ from each other in the mechanism by which they resolve write conflicts. To understand the effect of these communication primitives on the power of parallelism, we extensively study the relationship between four such models that appear in the literature, and prove nontrivial separations and simulation results among them.

Journal ArticleDOI
TL;DR: This work advocates the separation of the design of programs for massively parallel machines into two steps which can be verified in a formal way: the construction of a program with implicit parallelism (Γ-program) and its translation into a network of processes.

Book
01 May 1988
TL;DR: The Connection Information Distributor (CID) as mentioned in this paper is an extension of the interactive CICtivation model of word recognition that allows simultaneous processing of several patterns in separate programmable networks.
Abstract: This paper introduces o mechanism called CID, the Connection Information Distributor. CID extends connectionism by providing o way to program networks of simple processing elements on line, in response to processing demands. Without CID, simultaneous processing of several patterns has only been possible by prewiring multiple copies of the network needed to process one pattern at o time. With CID, programmable processing structures con be loaded with connection informotion stored centrally. OS needed. To illustrate some of the characteristics of the scheme, o CID version of the interactive CICtivation model of word recognition is described. The model has a single permanent representation of the connection information required for word perception, but it allows several words to be processed simultaneously in separate programmable networks. Multiword processing is not perfect, however. The model produces the some kinds of intrusion errors that human subjects make in processing brief presentations of word-pairs, such OS SAND LANE (SAND is often misreported OS LAND r)r SANE). The resource requirements of the mechanism. in terms of nodes and connections, ore found to be quite moderote. primarily because networks that ore programmed inresponse to task demands con be much smaller than networks that have knowledge of large numbers of patterns built in.


Journal ArticleDOI
TL;DR: The author presents ASP architecture, which offers cost-effective support of a wide range of numerical and nonnumerical computing applications, using state-of-the-art microelectronic technology to achieve processor packing densities that are more usually associated with memory components.
Abstract: The author presents ASP architecture, which offers cost-effective support of a wide range of numerical and nonnumerical computing applications, using state-of-the-art microelectronic technology to achieve processor packing densities that are more usually associated with memory components, ASP is designed to benefit from the inevitable VLSI-to-ULSI-to-WSI (very large, ultra large, and wafer-scale integration) technological trend, with a fully integrated, simply scalable, and defect/fault-tolerant processor interconnection strategy. The author discusses the architectural philosophy, structural organization, operational principles, and VLSI/ULSI/WSI implementation of ASP and indicates its cost-performance potential. ASP microcomputers have the potential to achieve cost-performance targets in the range of 100 to 1000 MOPS (million operations per second) per $1000. This gives ASPs an advantage of two to three orders of magnitude over current parallel computer architectures. >

Journal ArticleDOI
R. C. Covington1, Sridhar Madala1, V. Mehta1, J. R. Jump1, J. B. Sinclair1 
01 May 1988

Journal ArticleDOI
01 May 1988
TL;DR: The amount of sharing in user programs and in the operating system, comparing the characteristics of user and system reference patterns, sharing related to process migration, and the temporal, spatial, and processor locality of shared blocks are addressed.
Abstract: Shared-memory multiprocessors have received wide attention in recent times as a means of achieving high-performance cost-effectively. Their viability requires a thorough understanding of the memory access patterns of parallel processing applications and operating systems. This paper reports on the memory reference behavior of several parallel applications running under the MACH operating system on a shared-memory multiprocessor. The data used for this study is derived from multiprocessor address traces obtained from an extended ATUM address tracing scheme implemented on a 4-CPU DEC VAX 8350. The applications include parallel OPS5, logic simulation, and a VSLI wire routing program. Among the important issues addressed in this paper are the amount of sharing in user programs and in the operating system, comparing the characteristics of user and system reference patterns, sharing related to process migration, and the temporal, spatial, and processor locality of shared blocks. We also analyze the impact of shared references on cache coherence in shared-memory multiprocessors.

Journal ArticleDOI
01 Jun 1988
TL;DR: This paper presents and analyzes the computational and parallel complexity of the Livermore Loops and addresses the concern that their computations must be understood thoroughly, so that efficient implementations may be written.
Abstract: This paper presents and analyzes the computational and parallel complexity of the Livermore Loops. The Loops represent the type of computational kernels typically found in large-scale scientific computing and have been used to benchmark computer system since the mid-60's. On parallel systems, a process's computational structure can greatly affect its efficiency. If the loops are to be used to benchmark such systems, their computations must be understood thoroughly, so that efficient implementations may be written. This paper addresses that concern.

Patent
20 Jan 1988
TL;DR: An image processing apparatus comprises image output means for outputting image information for each one line, a plurality of image processing means for divisionally receiving image information of one line as an input and parallel-processing the image information, and converting means for converting the result of the processing of the image processing mean from parallel into series as mentioned in this paper.
Abstract: An image processing apparatus comprises image output means for outputting image information for each one line, a plurality of image processing means for divisionally receiving the image information of one line as an input and parallel-processing the image information, and converting means for converting the result of the processing of the image processing means from parallel into series.

Journal ArticleDOI
TL;DR: The transitiveclosure of a database relation is considered as a paradigm to study parallel recursive query processing and two new parallel algorithms for evaluating the transitive closure of a relation in a parallel data server are proposed.
Abstract: Parallelism is a promising approach to high performance data management. In a highly parallel data server with declustered data placement, an important issue is to exploit parallelism in processing complex queries such as recursive queries. In this paper, we consider the transitive closure of a database relation as a paradigm to study parallel recursive query processing. And we propose two new parallel algorithms for evaluating the transitive closure of a relation in a parallel data server. Performance comparisons based on an analytical model indicate the superior response time of the parallel algorithms over their centralized version. With one hundred nodes, performance gain is between one and two orders of magnitude. One parallel algorithm provides superior response time while the other exhibits better response time/total time trade-off.

Patent
28 Apr 1988
TL;DR: In this article, a fault tolerant processing system which includes a plurality of at least (3f+1) fault containment regions is presented, where the operations of the network elements are synchronized and the system can be arranged to reconfigure the groups of processors so as to form different pluralities of redundant processing sites.
Abstract: A fault tolerant processing system which includes a plurality of at least (3f+1) fault containment regions each including a plurality of processors and a network element connected to each of the processors and to the network elements of the other regions. Groups of processors are used to form redundant processing sites, the number of each group being included in a different fault containment region. The operations of the network elements are synchronized and the system can be arranged to re-configure the groups of processors so as to form different pluralities of redundant processing sites.

Journal ArticleDOI
01 Aug 1988
TL;DR: A recently proposed criterion, the degree of autonomy of each processor, is applied to further classify fine-grain SIMD (single-instruction, multiple-data-stream) massively parallel computers.
Abstract: Options are examined that drive the design of a vision-oriented computer, beginning with the analysis of the basic vision computation and communication requirements. The classical taxonomy is briefly reviewed for parallel computers, based on the instruction and data stream. A recently proposed criterion, the degree of autonomy of each processor, is applied to further classify fine-grain SIMD (single-instruction, multiple-data-stream) massively parallel computers. Three types of processor autonomy, namely, operational autonomy, addressing autonomy, and connection autonomy, are identified. For each type, the basic definition is given and some examples shown. The concept of connection autonomy, which is believed to be the key point in the development of massively parallel architectures for vision, is presented. Two examples are shown of parallel computers featuring different types of connection autonomy-the Connection Machine and the polymorphic-Torus-and their cost and benefits are compared. >

Patent
28 Jun 1988
TL;DR: In this paper, an object-oriented system comprises concept and instance objects allocated to a plurality of processors to form a network, each of the processors comprises a transmission-reception portion for transmitting and receiving messages, an object control portion for managing concept and instances, an inheritance retrieval portion for recording and retrieving information inherited from upper concept objects, a message pattern retrieval component for retrieving message patterns, message procedure storage portion for storing procedures corresponding to the message patterns.
Abstract: An object-oriented system comprises concept and instance objects allocated to a plurality of processors to form a network. Each of the processors comprises a transmission-reception portion for transmitting and receiving messages, an object control portion for managing concept and instance objects, an inheritance retrieval portion for recording and retrieving information inherited from upper concept objects, a message pattern retrieval portion for retrieving message patterns, a message procedure storage portion for storing procedures corresponding to the message patterns, and an instance object storage portion for storing instance variables that hold internal states of the instance objects.

Journal ArticleDOI
TL;DR: A model for predicting multiprocessor performance on iterative algorithms is developed and illustrates the significant impact on performance of decomposing an algorithm into parallel processes.
Abstract: A model for predicting multiprocessor performance on iterative algorithms is developed. Each iteration consists of some amount of access to global data and some amount of local processing. The iterations may be synchronous or asynchronous, and the processors may or may not incur waiting time, depending on the relationship between the access time and processing time. The effect on performance of the speed of the processor, memory, and the interconnection network is studied. The model also illustrates the significant impact on performance of decomposing an algorithm into parallel processes. The model's predictions are calibrated with experimental measurements. >

Patent
Bob Chao-Chu Liang1
10 Oct 1988
TL;DR: In this paper, a pipeline and parallel processing system for generating surface patches for both Wireframe and Solid/Shaded Models in a Raster Graphics Display is presented. But the system is not suitable for the use of 3D models.
Abstract: A Pipeline and Parallel Processing system for generating Surface Patches for both Wireframe and Solid/Shaded Models in a Raster Graphics Display. The inputs to a Transformation Processor are the parameters for the Rational Bezier Surfaces: a 2-dimensional array of control points, and weights. The outputs are the coordinates of the corner points and the normals (to the surface) of the patches, which make up the surface. The system consists of three Pipeline stages: 1. A front-end processor fetches the data from memory and feeds the Transformation Processor; 2. four Floating Point Processors in Parallel for tessellating the surfaces into small patches; and 3. one Floating Point Processor for generating normals at the vertices of the small patches. The output is sent to the rest of the Graphics System for clipping, mapping, and shading.


Journal ArticleDOI
01 Jan 1988
TL;DR: It appears that as long as PRAMs cannot achieve the desired cost and performance goals, programmers must contend with carefully designing algorithms for specific architectures.
Abstract: Some of the problems encountered in mapping a parallel algorithm are examined, emphasizing mappings of vision algorithms onto mesh, hypercube, mesh-of-trees, pyramid, and parallel random-access machines (PRAMs) having many simple processors, each with a small amount of memory. Approaches that have been suggested include simulating the ideal architectures, and using general data movement operations. Each of these is shown to occasionally produce unacceptably inefficient implementations. It appears that as long as PRAMs cannot achieve the desired cost and performance goals, programmers must contend with carefully designing algorithms for specific architectures. >

Proceedings ArticleDOI
27 Jun 1988
TL;DR: The authors address issues central to the design and operation of a Byzantine resilient parallel computer by treating connectivity as a resource which is shared among many processing elements, allowing flexibility in their configuration and reducing complexity.
Abstract: The authors address issues central to the design and operation of a Byzantine resilient parallel computer. Interprocessor connectivity requirements are met by treating connectivity as a resource which is shared among many processing elements, allowing flexibility in their configuration and reducing complexity. Reliability analysis results are presented which demonstrate the reduced failure probability of such a system. Redundant groups are synchronized solely by message transmissions and receptions, which also provide input data consistency and output voting. Performance analysis results are presented which quantify the temporal overhead involved in executing such fault tolerance-specific operations. >

Patent
14 Jun 1988
TL;DR: In this paper, an array processor consisting of multiplexers, plural processing elements connected through the multiplexer in the form of a ring and a control unit for controlling both the array and the processing elements is presented.
Abstract: An array processor comprising multiplexers, plural processing elements connected through the multiplexers in the form of a ring and a control unit for controlling the multiplexers and the processing elements. Each of the processing elements is connected to an input vector data bus via the multiplexer and directly to an I/O data bus, so that two types of input vector data are inputted to the processing element simultaneously. Flags indicating a position of respective vector data are added to each one of input vector data, series composed of a combination of plural types of input vector data series. The processing element judges a processing status of the processing element to control a selection of the input vector data bus or the transfer path, data transfer between the processing elements, or data input/output to/from the I/O bus, so that the overall array processor executes autonomous control of all the combinations of the vector data of the two types of input vector data series. The array processor realizes parallel processing of pattern matching computation based upon dynamic time warping with a high efficiency and thus realizes a highefficiency utilization of hardware resources including processing elements and network.

Proceedings ArticleDOI
24 Oct 1988
TL;DR: The authors present the first sub-linear-time deterministic parallel algorithms for bipartite matching and several related problems, including maximal node-disjoint paths, depth-first search, and flows in zero-one networks.
Abstract: The authors present the first sub-linear-time deterministic parallel algorithms for bipartite matching and several related problems, including maximal node-disjoint paths, depth-first search, and flows in zero-one networks. The results are based on a better understanding of the combinatorial structure of the above problems, which lead to new algorithmic techniques. In particular, it is shown how to use maximal matching to extend, in parallel, a current set of node-disjoint paths and how to take advantage of the parallelism that arises when a large number of nodes are active during an execution of a push/relabel network flow algorithm. It is also shown how to apply the techniques to design parallel algorithms for the weighted versions of the above problems. >