scispace - formally typeset
Search or ask a question

Showing papers on "Degree of parallelism published in 1994"


Journal ArticleDOI
TL;DR: A new algorithm for passively estimating the ranges and bearings of multiple narrow-band sources using a uniform linear sensor array is presented, which reduces the global 2D search over range and bearing to 2(m/spl minus/1) independent 1D searches.
Abstract: A new algorithm for passively estimating the ranges and bearings of multiple narrow-band sources using a uniform linear sensor array is presented. The algorithm is computationally efficient and converges globally. It minimizes the MUSIC cost function subject to geometrical constraints imposed by the curvature of the received wavefronts. The estimation problem is reduced to one of solving a set of two coupled 2D polynomial equations. The proposed algorithm solves this nonlinear problem using a modification of the path-following (or homotopy) method. For an array having m sensors, the algorithm reduces the global 2D search over range and bearing to 2(m/spl minus/1) independent 1D searches. This imparts a high degree of parallelism that can be exploited to obtain source location estimates very efficiently. >

126 citations


Journal ArticleDOI
TL;DR: A general-purpose fuzzy processor, the core of which is based on an analog-numerical approach combining the inherent advantages of analog and digital implementations, above all as regards noise margins is presented.
Abstract: In this paper we present a design for a general-purpose fuzzy processor, the core of which is based on an analog-numerical approach combining the inherent advantages of analog and digital implementations, above all as regards noise margins. The architectural model proposed was chosen in such a way as to obtain a processor capable of working with a considerable degree of parallelism. The internal structure of the processor is organized as a cascade of pipeline stages which perform parallel execution of the processes into which each inference can be decomposed. A particular feature of the project is the definition of a 'fuzzy-gate', which executes elementary fuzzy computations, on which construction of the whole core of the processor is based. Designed using CMOS technology, the core can be integrated into a single chip and can easily be extended. The performance obtainable, in the order of 50 Mega fuzzy rules per second, is of a considerable level. >

63 citations


Patent
Kevin Harney1
20 Jul 1994
TL;DR: In this paper, a three-level prioritization scheme is used to handle the input/output data stream to improve the throughput of the processor, including provisions for distinguishing between same-priority events occurring at different times, and ensuring that in such cases the requested operations occur in the same temporal order as the respective requests.
Abstract: Features which support conditional execution and sequencing are employed in concert with a centralized-control, single-instruction, multiple data integrated video signal processor, thus adapting efficiently to the high degree of parallelism inherent in this type of video signal processing systems. A three-level prioritization scheme is used to handle the input/output data stream to improve the throughput of the processor, including provisions for distinguishing between same-priority events occurring at different times, and ensuring that in such cases the requested operations occur in the same temporal order as the respective requests.

49 citations


Book ChapterDOI
01 Jan 1994
TL;DR: It is argued that multi-coloring can be combined with multiple-step relaxation preconditioners to achieve a good level of parallelism while keeping the rates of convergence to good levels.
Abstract: The degree of parallelism in the preconditioned Krylov subspace method using standard preconditioners is limited and can lead to poor performance on massively parallel computers. In this paper we examine this problem and consider a number of alternatives based both on multi-coloring ideas and polynomial preconditioning. The emphasis is on methods that deal specifically with general unstructured sparse matrices such as those arising from finite element methods on unstructured grids. It is argued that multi-coloring can be combined with multiple-step relaxation preconditioners to achieve a good level of parallelism while keeping the rates of convergence to good levels. We also exploit the idea of multi-coloring and independent set orderings to introduce a multi-elimination incomplete LU factorization named ILUM, which is related to multifrontal elimination. The main goal of the paper is to discuss some of the prevailing ideas and to compare them on a few test problems.

42 citations


Proceedings ArticleDOI
10 Apr 1994
TL;DR: A multi-assignment language derived from the UNITY formalism is proposed, to implement the controllers with a high degree of parallelism of the ArMen FPGA-multiprocessor.
Abstract: Embedding a FPGA circular array into MIMD architectures allows one to synthesize fine-grain circuits for global computation support. These circuits operate concurrently with the distributed applications. They provide specific speed-up or additional services, such as communication protocols or global controllers. This article describes an architectural model for such controllers with practical examples implemented on the ArMen FPGA-multiprocessor. A multi-assignment language derived from the UNITY formalism is proposed, to implement the controllers with a high degree of parallelism. Their hardware synthesis principles are given. >

39 citations


Proceedings ArticleDOI
23 Sep 1994
TL;DR: Two methods for synthesis of VHDL specifications containing concurrent processes are presented to preserve simulation/synthesis correspondence during high-level synthesis and to produce hardware that operates with a high degree of parallelism.
Abstract: This paper presents two methods for synthesis of VHDL specifications containing concurrent processes. Our main objective is to preserve simulation/synthesis correspondence during high-level synthesis and to produce hardware that operates with a high degree of parallelism. The first method supports an unrestricted use of signals and wait statements and synthesizes synchronous hardware with global control of process synchronization for signal update. The second method allows hardware synthesis without the strict synchronization imposed by the VHDL simulation cycle. Experimental results have shown that the proposed methods are efficient for a wide spectrum of digital systems.

38 citations


Proceedings ArticleDOI
30 Nov 1994
TL;DR: The techniques presented in this paper used in combination with prior work on reducing the height of data dependences provide a comprehensive approach to accelerating loops with conditional exits.
Abstract: The performance of applications executing on processors with instruction level parallelism is often limited by control and data dependences. Performance bottlenecks caused by dependences can frequently be eliminated through transformations which reduce the height of critical paths through the program. While height reduction techniques are not always helpful, their utility can be demonstrated in a broad range of important situations. This paper focuses on the height reduction of control recurrences within loops with data dependent exits. Loops with exits are transformed so as to alleviate performance bottlenecks resulting from control dependences. A compilation approach to effect these transformations is described. The techniques presented in this paper used in combination with prior work on reducing the height of data dependences provide a comprehensive approach to accelerating loops with conditional exits. In many cases, loops with conditional exits provide a degree of parallelism traditionally associated with vectorization. Multiple iterations of a loop can be retired in a single cycle on a processor with adequate instruction level parallelism with no cost in code redundancy. In more difficult cases, height reduction requires redundant computation or may not be feasible.

37 citations


Journal ArticleDOI
06 Jun 1994
TL;DR: It is shown that guarded repair can improve system performance and dependability significantly and a time-dependent optimality of dependable, parallel configurations can be determined from the results.
Abstract: Imperfect coverage and nonnegligible reconfiguration delay are known to have a deleterious effect on the dependability and the performance of a multiprocessor system. In particular, increasing the number of processor elements does not always increase dependability. An obvious reason for this is that the total failure rate increases, generally, linearly with the number of components in the system. It is also a well-known fact that the performance gain due to parallelism mostly turns out to be sublinear with the number of processors. It is therefore important to optimize the degree of parallelism in system design. A related issue is that by deferring repair, it is sometimes possible to improve system dependability. In this case decisions have to be made dynamically as to when to repair and when not to repair. Most of the current research deals with static optimization of the number of processors. No systematic approach for dynamic control of dependable systems has been proposed so far. Dynamic, i.e. transient, decision of whether or not to repair is the optimization problem considered in this paper. We propose extended Markov reward models (EMRM) to capture such questions. EMRM are a marriage between performability modeling techniques and Markov decision theory. A numerical solution procedure is developed to provide optimal solution trajectories for this problem. EMRM are a general framework for the dynamic optimization of reconfigurable, dependable systems. The optimization is applied on the basis of several performance and dependability measures. In particular, we explore availability, capacity-oriented availability, performance-oriented unavailability, and performability measures. Furthermore, off-line and on-line repair strategies are compared. We show that guarded repair can improve system performance and dependability significantly. The control strategies and reward functions differ a lot in each case. Each scenario turns out to be interest in its own right. A time-dependent optimality of dependable, parallel configurations can be determined from our results.

31 citations


Journal ArticleDOI
TL;DR: This paper presents a parallel algorithm for solving the region growing problem based on the split-and-merge approach, and uses it to test and compare various parallel architectures and programming models.

31 citations


Proceedings ArticleDOI
23 May 1994
TL;DR: New scalable interaction paradigms and their embodiment in a time- and space-efficient debugger with scalable performance are presented, making it easier to debug and understand message-passing programs.
Abstract: Developers of message-passing codes on massively parallel systems have to contend with difficulties that data-parallel programmers do not face, not the least of which is debuggers that do not scale with the degree of parallelism. In this paper, we present new scalable interaction paradigms and their embodiment in a time- and space-efficient debugger with scalable performance. The debugger offers scalable expression, execution, and interpretation of all debugging operations, making it easier to debug and understand message-passing programs. >

18 citations


01 Nov 1994
TL;DR: A framework for bandwidth reduction and tridiagonalization algorithms for symmetric banded matrices is developed, which leads to algorithms that require fewer floating-point operations, allow for space-time tradeoffs, enable the use of block orthogonal transformations, and increase the degree of parallelism inherent in the algorithm.
Abstract: This paper develops a framework for bandwidth reduction and tridiagonalization algorithms for symmetric banded matrices. The algorithm family includes the algorithms by Rutishauser and Schwarz, which underlie the EISPACK and LAPACK implementations, and the algorithm recently proposed by Lang. The framework leads to algorithms that require fewer floating-point operations, allow for space-time tradeoffs, enable the use of block orthogonal transformations, and increase the degree of parallelism inherent in the algorithm.

Journal ArticleDOI
TL;DR: A simple rule of thumb for choosing the degree of parallelism in order to maximize the throughput of the hybrid hash algorithm and in the case of Grace join, asymptotic conditions on the amount of skew for a limit on parallelism to exist are established.

Journal ArticleDOI
01 Sep 1994-Lethaia
TL;DR: The high degree of parallelism, combined with abundant symplesiomorphic characters, led to erroneous phylogenetic inferences when non-biotic data were excluded from analysis.
Abstract: Webb, G.E. 1994 1015: Parallelism, non-biotic data and phylogeny reconstruction in paleobiology. Many systematists equate parallelism and convergence. However, whereas convergence is relatively uncommon and easily recognized using divergent characters, parallelism is common but more difficult to recognize because divergent characters are less abundant. Cladists, in particular, equate homeomorphy with convergence and reject parallelism as a distinct concept. Unfortunately, cladistic parsimony analysis may not resolve most parallelism. Therefore, criteria for the a priori recognition and objective evaluation of parallelism are very significant. Non-biotic data (e.g., stratigraphic and geographic distribution) provide independent criteria for the construction of hypotheses of parallelism in cases where taxa (1) were geographically isolated during homeomorphic character-state transformations, (2) occurred with endemic faunas, and (3) evolved in similar environmental conditions as suggested by paleoecological data. Australian lithostrotionoid corals were long considered congeneric with European taxa. However, because of their geographic isolation, occurrence with endemic rugose corals and occurrence in similar depositional environments as European forms, they are now considered a homeomorphic clade, resulting from an extended sequence of parallel character-state transformations. The high degree of parallelism, combined with abundant symplesiomorphic characters, led to erroneous phylogenetic inferences when non-biotic data were excluded from analysis. Cladistics, homeomorphy, lithostrotionoid corals, parallelism, phylogeny.

Proceedings ArticleDOI
03 Nov 1994
TL;DR: This work gives a simple specification of a scheduler and presents three delay-insensitive implementations, one of which contains a high degree of parallelism and is simpler than previously proposed implementations.
Abstract: The committee problem involves the specification and design of a scheduler for committee meetings. It is a general resource allocation problem that combines both synchronization and mutual exclusion. We give a simple specification of a scheduler and present three delay-insensitive implementations. Our last implementation contains a high degree of parallelism and is simpler than previously proposed implementations.

Proceedings ArticleDOI
18 May 1994
TL;DR: This paper illustrates a formal technique for describing the timing properties and resource constraints of pipelined superscalar processor instructions at high level using ACSR, and describes how to derive the temporal behavior of an assembly program using the ACSR laws.
Abstract: This paper illustrates a formal technique for describing the timing properties and resource constraints of pipelined superscalar processor instructions at high level. Superscalar processors can issue and execute multiple instructions simultaneously. The degree of parallelism depends on the multiplicity of hardware functional units as well as data dependencies among instructions. Thus, the timing properties of a superscalar program is difficult. To analyze and predict. We describe how to model the instruction-level architecture of a superscalar processor using ACSR and how to derive the temporal behavior of an assembly program using the ACSR laws. The salient aspect of ACSR is that the notions of time, resources and priorities are supported directly in the algebra. Our approach is to model superscalar processor registers as ACSR resources, instructions as ACSR processes, and use ACSR priorities to achieve maximum possible instruction-level parallelism. >

Journal ArticleDOI
TL;DR: A parallel algorithm, called PARALLEX, which uses a conflict resolving method, has been developed for the switchbox routing problem in a parallel processing environment and the speed-up for 7 and 19-net problems were 4.7 and 10, respectively.
Abstract: A parallel algorithm, called PARALLEX, which uses a conflict resolving method, has been developed for the switchbox routing problem in a parallel processing environment. PARALLEX can achieve a very high degree of parallelism by generating as many processes as nets. Each process is assigned to route a net, which bears the same identification number as the process. If conflicts are found for the current route of a net, then that process classifies the set(s) of conflict segments into groups that are identified by the various types of conflict(s) within each group. Each process with conflicts finds partial solutions by resolving every conflict of a group in the path-finding procedure and merges them with the solutions from other processes, which may or may not have conflicts, to make a conflict-free switchbox. The speed-up for 7and 19-net problems were 4.7 and 10, respectively. >

15 Dec 1994
TL;DR: A systematic parameter-based method, called the General Parameter Method (GPM), to design optimal, lower-dimensional processor arrays for uniform dependence algorithms has been developed and it can found that the system yield improves with the area of the coprocessor when chip yield decreases as the inverse square of the clock frequency.
Abstract: With the continuing growth of VLSI technology, special-purpose parallel processors have become a promising approach in the quest for high performance. Fine-grained processor arrays have become popular as they are suitable for solving problems with a high degree of parallelism, and can be inexpensively built using custom designs or commercially available field programmable gate arrays (FPGA). Such specialised designs are often required in portable computing and communication systems with real-time constraints, as software-controlled processors often fail to provide the necessary throughput. This thesis addresses many issues in designing such application-specific systems built with fine-grained processor arrays for regular recursive uniform dependence algorithms. A uniform dependence algorithm consists of a set of indexed computations and a set of uniform dependence vectors which are independent of the indices of computations. Many important applications in signal/image processing, communications, and scientific computing can be formulated as uniform dependence algorithms. The first part of this thesis addresses the problem of designing algorithm-specific processor arrays. A systematic parameter-based method, called the General Parameter Method (GPM), to design optimal, lower-dimensional processor arrays for uniform dependence algorithms has been developed. The GPM can be used to derive optimal arrays for any user-specified objective expressed in terms of the parameters. The proposed approach employs an efficient search technique to explore the design space and arrive at the optimal designs. The GPM can be used to find optimal designs in the dependence-based methods using the equivalence between the parameter-based and dependence-based methods. The GPM has also been extended to derive optimal two-level pipelined algorithm-specific processor arrays. Such two-level pipelined arrays can be clocked at higher rates than can nonpipelined designs for real-time applications. The second part of this thesis presents a parallel VLSI architecture for a general-purpose coprocessor for uniform dependence algorithms. The architecture consists of a linear array of processors and a linear chain of buffer memories organized as FIFO queues to store the buffered data. Such an architecture is advantageous from the point of view of scalability and wafer-level integration. A distinguishing feature is the assumption of a limited-bandwidth interface to external memory modules for accessing the data. Such an assumption allows the coprocessor to be integrated easily into existing systems. Efficient techniques to partition the dependence graph into blocks, sequence the blocks through the buffer memory to reduce the number of data accesses to main memory, and map the blocks using GPM have been developed. An important result obtained is the square-root relationship between clock-rate reduction and area of the coprocessor under fixed main-memory bandwidth. From the square-root relationship, it can found that the system yield improves with the area of the coprocessor when chip yield decreases as the inverse square of the clock frequency. Results on matrix-product and transitive-closure applications indicate that the coprocessor can be used to deliver higher speedup or lower clock rate than a reference one-processor design. Thus, the coprocessor can be used as a general-purpose back-end accelerator for loop-based matrix algorithms.

Proceedings ArticleDOI
01 May 1994
TL;DR: A novel algorithm, designated as Fast Invariant Imbedding algorithm, is developed, which offers a massive degree of parallelism with simple communication and synchronization requirements and two massively parallel, algorithmically specialized, architectures for low-cost and optimal implementation of this algorithm.
Abstract: Massively parallel algorithms and architectures for real-time wavefront control of a dense adaptive optic system (SELENE) are presented. We have already shown that the computation of a near optimal control algorithm for SELENE can be reduced to the solution of a discrete Poisson equation on a regular domain. Although this represents an optimal computation, due to the large size of the system and the high sampling rate requirement, the implementation of this control algorithm poses a computationally challenging problem since it demands a sustained computational throughput of the order of 10 GFlops. We develop a novel algorithm, designated as Fast Invariant Imbedding algorithm, which offers a massive degree of parallelism with simple communication and synchronization requirements. We also discuss two massively parallel, algorithmically specialized, architectures for low-cost and optimal implementation of the Fast Invariant Imbedding algorithm.

Journal ArticleDOI
TL;DR: This paper describes the development of a context sensitive compiler for pattern‐matching languages using the searching power of massively parallel associative computers and compilation of production rules into equivalent procedural rules is completely data parallel.
Abstract: The searching power of massively parallel associative computers is an under used and under investigated capability that can be used to facilitate software development. This paper describes the development of a context sensitive compiler for pattern-matching languages using that searching power. The described compiler was implemented on the STARAN parallel computer and the compiled OPS5 programs were also executed on the STARAN obtaining an estimated throughput of 6000 rules per second. The described compilation of production rules into equivalent procedural rules is completely data parallel, with the degree of parallelism depending on the number of tokens in the program being compiled. During any one step of the context-sensitive analysis, the entire program is processed in constant time.

Proceedings ArticleDOI
G. Privat1, K. Goser
26 Sep 1994
TL;DR: Fuzzy relaxation labelling is presented as a feasibility example for such a streamlined hardware implementation of a cellular automata network that combines the flexibility of numeric processing with the explicitness and transparency of logic rules.
Abstract: Expressing with fuzzy logic the local transition functions of a cellular automata network combines the flexibility of numeric processing with the explicitness and transparency of logic rules, within the framework of an emergent-cooperative computational model. Implementing fuzzy processing elements as the nodes of such nets places heavy emphasis on fine granularity, and is feasible only in analog if it is aimed to achieve the degree of parallelism matched to potential applications in image processing. Fuzzy relaxation labelling is presented as a feasibility example for such a streamlined hardware implementation. This computational model is potentially applicable in a wide range of applications, drawing maximal benefit from advanced VLSI technologies.

Patent
04 Nov 1994
TL;DR: In this article, the authors propose to increase the degree of parallelism of arithmetic processing operation in the VLIW system to improve the bit use efficiency of instructions and the use of hardware resources of a processor.
Abstract: PURPOSE:To increase the degree of parallelism of arithmetic processing operation in the VLIW system to improve the bit use efficiency of instructions and the use efficiency of hardware resources of a processor. CONSTITUTION:An instruction 10 consists of plural operation fields op1 to op4 and control subfields F1 to F4 corresponding to respective operation fields. When the instruction 10 including a branch operation is executed, the operation to be executed is different between the success of branch and the failure of branch. In this case, the operation results are adopted or rejected based on information of contents of control subfields F1 to F4 and a flag indicating the success or the failure of branch. Thus, the operation for the success of branch and that for the failure of branch are included in one instruction 10 and are processed in parallel, and the number of instructions is reduced.

Book ChapterDOI
04 Jul 1994
TL;DR: A parallel simulated annealing algorithm for solving the problem of mapping irregular parallel programs onto homogeneous processor arrays with regular topology that constructs and uses joint transformations.
Abstract: This paper presents a parallel simulated annealing algorithm for solving the problem of mapping irregular parallel programs onto homogeneous processor arrays with regular topology. The algorithm constructs and uses joint transformations. These transformations guarantee a high degree of parallelism that is bounded below by ⌈¦N p /deg(G p +1⌋, where ¦N p ¦ is the number of task nodes in the mapped program graph G p and deg(G p ) is the maximal degree of a node in G p . The mapping algorithm provides good program mappings (in terms of program execution time and the number of processors used) in a reasonable number of steps.

Proceedings ArticleDOI
26 Oct 1994
TL;DR: This paper presents efficient mappings of large sparse neural networks on a distributed-memory MIMD multicomputer with high performance vector units and shows that vectorization can nevertheless more than quadruple the performance on the authors' modeled supercomputer.
Abstract: This paper presents efficient mappings of large sparse neural networks on a distributed-memory MIMD multicomputer with high performance vector units. We develop parallel vector code for an idealized network and analyze its performance. Our algorithms combine high performance with a reasonable memory requirement. Due to the high cost of scatter/gather operations, generating high performance parallel vector code requires careful attention to details of the representation. We show that vectorization can nevertheless more than quadruple the performance on our modeled supercomputer. Pushing several patterns at a time through the network (batch mode) exposes an extra degree of parallelism which allows us to improve the performance by an additional factor of 4. Vectorization and batch updating therefore yield an order of magnitude performance improvement. >

Journal ArticleDOI
01 Dec 1994
TL;DR: This paper presents an alternative course, designed as an elective in computer architecture for upper level undergraduate or graduate students, that presents a side-by-side comparison of von Neumann and data flow architectures.
Abstract: Most computer architecture courses are geared toward the classical von Neumann style of computer architectures, mentioning only in passing other models such as data flow computation. This is unfortunate, due to the high degree of parallelism possible using data flow. We present an alternative course, designed as an elective in computer architecture for upper level undergraduate or graduate students, that presents a side-by-side comparison of von Neumann and data flow architectures.Our teaching environment is based on Simple Arithmetic SISAL (SAS), a subset of the applicative programming language SISAL, which we designed for both teaching about and research into data flow architectures. SAS runs in a highly integrated environment, allowing students to implement their program on a von Neumann architecture, then observe its execution through a data flow simulator. The environment runs on a standard IBM-style personal computer, providing a cost-effective platform for presenting the course.

01 Jan 1994
TL;DR: This paper presents Snyder's XYZ levels, a model for concurrent evaluation that allows a much higher degree of parallelism to be achieved, and states that optimizing compiler technology is particularly applicable at this level.
Abstract: ion Levels Talk about Snyder’s XYZ levels. Aside from terminology, we are thinking of the same ideas. The highest level is composition. Snyder discusses phase composition at the Z, or problem, level. In our scheme, a single process creates specifies a computation by brokering the services of existing objects. These objects are concurrent/parallel in nature. The next level is concurrent evaluation. The objects brokered by the composition level cause the creation of itinerant actors. Itinerant actors are used to do one of the following: • process coordination • object coordination • macro-dataflow (with virtual pattern matching) • distributed and shared data structures (all objects) Snyder speaks of a Y level, wherein a phase composes process units to achieve a parallel computation. In our model, we use the term process somewhat differently. Our processes are lightweight at this level and partially ordered as in dataflow. This allows a much higher degree of parallelism to be achieved, since we do not rely on operating system mechanisms to manage lightweight processes (in other words, we do not use threads either). Snyder speaks of the X level, wherein a process composes sequential program units in a single address space. We are consistent in this interpretation at our lowest level. While we do not address this in detail in this paper, there is nothing to stop you from exploiting parallelism at this level. This is the level where we believe optimizing compiler technology is particularly applicable.

Proceedings ArticleDOI
19 Apr 1994
TL;DR: This paper demonstrates that practical, time-adaptive singular value decomposition can be implemented on a parallel processor array using Cordic arithmetic and asynchronous communication, such that any degree of parallelism, from single-processor implementation up to full-size array implementation is supported by a 'universal' processing unit.
Abstract: Implementing Jacobi algorithms in parallel processor arrays is a non-trivial task, in particular when the algorithms are parameterized with respect to size and the architectures are parameterized with respect to space-time trade-offs. The objective of this paper is to demonstrate that practical, time-adaptive singular value decomposition can be implemented on a parallel processor array using Cordic arithmetic and asynchronous communication, such that any degree of parallelism, from single-processor implementation up to full-size array implementation is supported by a 'universal' processing unit. This result is the product of judicious application of transformations in the combined algorithm and architecture space. >

Proceedings ArticleDOI
02 Oct 1994
TL;DR: Presents an adaptive nonlinear estimation technique (polynomial model-based) that has guaranteed stability and makes parsimonious use of coefficients, thereby achieving optimal, or close to optimal, performance with reduced computational complexity when compared to the adaptive Volterra filters.
Abstract: Presents an adaptive nonlinear estimation technique (polynomial model-based) that has guaranteed stability and makes parsimonious use of coefficients, thereby achieving optimal, or close to optimal, performance with reduced computational complexity when compared to the adaptive Volterra filters. Additionally the suggested structure exhibits a high degree of parallelism which makes it suitable for VSLI implementation. >

Proceedings ArticleDOI
A. Genco1
26 Jan 1994
TL;DR: The study tries to improve and combine different approaches that are able to speed up applications of the Simulated Annealing model and investigates separately two main aspects concerning the degree of parallelism an implementation can egectively exploit at the initial andfinal periods of an execution.
Abstract: The study described in this paper tries to improve and combine different approaches that are able to speed up applications of the Simulated Annealing model It investigates separately two main aspects concerning the degree of parallelism an implementation can egectively exploit at the initial andfinal periods of an execution As for case studies, it deals with two implementations: the Job shop Scheduling problem and the poryblio selection problem The paper reports the results of a large number of experiments, carried out by means of a transputer network and a hypercube system They give useful suggestions about selecting the most suitable values of the intervention parameters to achieve super linear speedups

Journal ArticleDOI
TL;DR: Three major improvements are made to the design of a classical data flow based logic simulation accelerator that can be utilized by a data flow architecture to reduce the enormous simulation times.
Abstract: The high degree of parallelism in the simulation of digital VLSI systems can be utilized by a data flow architecture to reduce the enormous simulation times. The existing logic simulation accelerators based on the data flow principle use a static data flow architecture along with a timing wheel mechanism to implement the event driven simulation algorithm. The drawback in this approach is that the timing wheel becomes a bottleneck to high simulation throughput. Other shortcomings of the existing architecture are the high communication overhead in the arbitration and distribution networks, and reduced pipelining due to a static data flow architecture. To overcome these, three major improvements are made to the design of a classical data flow based logic simulation accelerator. These include:

Proceedings ArticleDOI
02 Oct 1994
TL;DR: In this paper a new parallel extended local feature extraction method is proposed which can be implemented on a distributed memory machine and an efficient algorithm is developed which is capable of exploiting a high degree of parallelism.
Abstract: Feature extraction is the most important phase in object recognition because accuracy of the system relies on how well the features are extracted. In this paper a new parallel extended local feature extraction method is proposed which can be implemented on a distributed memory machine. In order to reduce the complexity in the extended local feature extraction, an efficient algorithm is developed which is capable of exploiting a high degree of parallelism. Our parallel algorithm is implemented and tested on an Intel iPSC/2 hypercube computer. Some resulting figures and execution times according to various number of nodes and object features are presented. >