Showing papers on "Degree of parallelism published in 1994"

PDF

Open Access

Journal Article•DOI•

Passive localization of near-field sources by path following

[...]

David Starer¹, Arye Nehorai¹•Institutions (1)

01 Mar 1994-IEEE Transactions on Signal Processing

TL;DR: A new algorithm for passively estimating the ranges and bearings of multiple narrow-band sources using a uniform linear sensor array is presented, which reduces the global 2D search over range and bearing to 2(m/spl minus/1) independent 1D searches.

...read moreread less

Abstract: A new algorithm for passively estimating the ranges and bearings of multiple narrow-band sources using a uniform linear sensor array is presented. The algorithm is computationally efficient and converges globally. It minimizes the MUSIC cost function subject to geometrical constraints imposed by the curvature of the received wavefronts. The estimation problem is reduced to one of solving a set of two coupled 2D polynomial equations. The proposed algorithm solves this nonlinear problem using a modification of the path-following (or homotopy) method. For an array having m sensors, the algorithm reduces the global 2D search over range and bearing to 2(m/spl minus/1) independent 1D searches. This imparts a high degree of parallelism that can be exploited to obtain source location estimates very efficiently. >

...read moreread less

126 citations

Journal Article•DOI•

A VLSI fuzzy inference processor based on a discrete analog approach

[...]

Vincenzo Catania, Antonio Puliafito, M. Russo, Lorenzo Vita

01 May 1994-IEEE Transactions on Fuzzy Systems

TL;DR: A general-purpose fuzzy processor, the core of which is based on an analog-numerical approach combining the inherent advantages of analog and digital implementations, above all as regards noise margins is presented.

...read moreread less

Abstract: In this paper we present a design for a general-purpose fuzzy processor, the core of which is based on an analog-numerical approach combining the inherent advantages of analog and digital implementations, above all as regards noise margins. The architectural model proposed was chosen in such a way as to obtain a processor capable of working with a considerable degree of parallelism. The internal structure of the processor is organized as a cascade of pipeline stages which perform parallel execution of the processes into which each inference can be decomposed. A particular feature of the project is the definition of a 'fuzzy-gate', which executes elementary fuzzy computations, on which construction of the whole core of the processor is based. Designed using CMOS technology, the core can be integrated into a single chip and can easily be extended. The performance obtainable, in the order of 50 Mega fuzzy rules per second, is of a considerable level. >

...read moreread less

63 citations

Patent•

Centralized control SIMD processor having different priority levels set for each data transfer request type and successively repeating the servicing of data transfer request in a predetermined order

[...]

Kevin Harney¹•Institutions (1)

Intel¹

20 Jul 1994

TL;DR: In this paper, a three-level prioritization scheme is used to handle the input/output data stream to improve the throughput of the processor, including provisions for distinguishing between same-priority events occurring at different times, and ensuring that in such cases the requested operations occur in the same temporal order as the respective requests.

...read moreread less

Abstract: Features which support conditional execution and sequencing are employed in concert with a centralized-control, single-instruction, multiple data integrated video signal processor, thus adapting efficiently to the high degree of parallelism inherent in this type of video signal processing systems. A three-level prioritization scheme is used to handle the input/output data stream to improve the throughput of the processor, including provisions for distinguishing between same-priority events occurring at different times, and ensuring that in such cases the requested operations occur in the same temporal order as the respective requests.

...read moreread less

49 citations

Book Chapter•DOI•

Highly Parallel Preconditioners for General Sparse Matrices

[...]

Youcef Saad¹•Institutions (1)

University of Minnesota¹

01 Jan 1994

TL;DR: It is argued that multi-coloring can be combined with multiple-step relaxation preconditioners to achieve a good level of parallelism while keeping the rates of convergence to good levels.

...read moreread less

Abstract: The degree of parallelism in the preconditioned Krylov subspace method using standard preconditioners is limited and can lead to poor performance on massively parallel computers. In this paper we examine this problem and consider a number of alternatives based both on multi-coloring ideas and polynomial preconditioning. The emphasis is on methods that deal specifically with general unstructured sparse matrices such as those arising from finite element methods on unstructured grids. It is argued that multi-coloring can be combined with multiple-step relaxation preconditioners to achieve a good level of parallelism while keeping the rates of convergence to good levels. We also exploit the idea of multi-coloring and independent set orderings to introduce a multi-elimination incomplete LU factorization named ILUM, which is related to multifrontal elimination. The main goal of the paper is to discuss some of the prevailing ideas and to compare them on a few test problems.

...read moreread less

42 citations

Proceedings Article•DOI•

Global control synthesis for an MIMD/FPGA machine

[...]

Philippe Dhaussy, J.-M. Filloque, B. Pottier, Stéphane Rubini

10 Apr 1994

TL;DR: A multi-assignment language derived from the UNITY formalism is proposed, to implement the controllers with a high degree of parallelism of the ArMen FPGA-multiprocessor.

...read moreread less

Abstract: Embedding a FPGA circular array into MIMD architectures allows one to synthesize fine-grain circuits for global computation support. These circuits operate concurrently with the distributed applications. They provide specific speed-up or additional services, such as communication protocols or global controllers. This article describes an architectural model for such controllers with practical examples implemented on the ArMen FPGA-multiprocessor. A multi-assignment language derived from the UNITY formalism is proposed, to implement the controllers with a high degree of parallelism. Their hardware synthesis principles are given. >

...read moreread less

39 citations

Proceedings Article•DOI•

Synthesis of VHDL concurrent processes

[...]

Petru Eles, Marius Minea¹, Krzysztof Kuchcinski², Zebo Peng²•Institutions (2)

Carnegie Mellon University¹, Linköping University²

23 Sep 1994

TL;DR: Two methods for synthesis of VHDL specifications containing concurrent processes are presented to preserve simulation/synthesis correspondence during high-level synthesis and to produce hardware that operates with a high degree of parallelism.

...read moreread less

Abstract: This paper presents two methods for synthesis of VHDL specifications containing concurrent processes. Our main objective is to preserve simulation/synthesis correspondence during high-level synthesis and to produce hardware that operates with a high degree of parallelism. The first method supports an unrestricted use of signals and wait statements and synthesizes synchronous hardware with global control of process synchronization for signal update. The second method allows hardware synthesis without the strict synchronization imposed by the VHDL simulation cycle. Experimental results have shown that the proposed methods are efficient for a wide spectrum of digital systems.

...read moreread less

38 citations

Proceedings Article•DOI•

Height reduction of control recurrences for ILP processors

[...]

Michael S. Schlansker¹, Vinod Kathail¹, Sadun Anik¹•Institutions (1)

Hewlett-Packard¹

30 Nov 1994

TL;DR: The techniques presented in this paper used in combination with prior work on reducing the height of data dependences provide a comprehensive approach to accelerating loops with conditional exits.

...read moreread less

Abstract: The performance of applications executing on processors with instruction level parallelism is often limited by control and data dependences. Performance bottlenecks caused by dependences can frequently be eliminated through transformations which reduce the height of critical paths through the program. While height reduction techniques are not always helpful, their utility can be demonstrated in a broad range of important situations. This paper focuses on the height reduction of control recurrences within loops with data dependent exits. Loops with exits are transformed so as to alleviate performance bottlenecks resulting from control dependences. A compilation approach to effect these transformations is described. The techniques presented in this paper used in combination with prior work on reducing the height of data dependences provide a comprehensive approach to accelerating loops with conditional exits. In many cases, loops with conditional exits provide a degree of parallelism traditionally associated with vectorization. Multiple iterations of a loop can be retired in a single cycle on a processor with adequate instruction level parallelism with no cost in code redundancy. In more difficult cases, height reduction requires redundant computation or may not be feasible.

...read moreread less

37 citations

Journal Article•DOI•

Guarded repair of dependable systems

[...]

Hermann de Meer¹, Kishor S. Trivedi¹, Mario Dal Cin²•Institutions (2)

Duke University¹, University of Erlangen-Nuremberg²

06 Jun 1994

TL;DR: It is shown that guarded repair can improve system performance and dependability significantly and a time-dependent optimality of dependable, parallel configurations can be determined from the results.

...read moreread less

Abstract: Imperfect coverage and nonnegligible reconfiguration delay are known to have a deleterious effect on the dependability and the performance of a multiprocessor system. In particular, increasing the number of processor elements does not always increase dependability. An obvious reason for this is that the total failure rate increases, generally, linearly with the number of components in the system. It is also a well-known fact that the performance gain due to parallelism mostly turns out to be sublinear with the number of processors. It is therefore important to optimize the degree of parallelism in system design. A related issue is that by deferring repair, it is sometimes possible to improve system dependability. In this case decisions have to be made dynamically as to when to repair and when not to repair. Most of the current research deals with static optimization of the number of processors. No systematic approach for dynamic control of dependable systems has been proposed so far. Dynamic, i.e. transient, decision of whether or not to repair is the optimization problem considered in this paper. We propose extended Markov reward models (EMRM) to capture such questions. EMRM are a marriage between performability modeling techniques and Markov decision theory. A numerical solution procedure is developed to provide optimal solution trajectories for this problem. EMRM are a general framework for the dynamic optimization of reconfigurable, dependable systems. The optimization is applied on the basis of several performance and dependability measures. In particular, we explore availability, capacity-oriented availability, performance-oriented unavailability, and performability measures. Furthermore, off-line and on-line repair strategies are compared. We show that guarded repair can improve system performance and dependability significantly. The control strategies and reward functions differ a lot in each case. Each scenario turns out to be interest in its own right. A time-dependent optimality of dependable, parallel configurations can be determined from our results.

...read moreread less

31 citations

Journal Article•DOI•

A data parallel algorithm for solving the region growing problem on the connection machine

[...]

Nawal Copty¹, Sanjay Ranka¹, Geoffrey C. Fox¹, Ravi V. Shankar¹•Institutions (1)

Syracuse University¹

01 Apr 1994-Journal of Parallel and Distributed Computing

TL;DR: This paper presents a parallel algorithm for solving the region growing problem based on the split-and-merge approach, and uses it to test and compare various parallel architectures and programming models.

...read moreread less

31 citations

Proceedings Article•DOI•

A scalable debugger for massively parallel message-passing programs

[...]

S. Sistare, D. Allen, R. Bowker, K. Jourdenais, J. Simons, R. Title - Show less +2 more

23 May 1994

TL;DR: New scalable interaction paradigms and their embodiment in a time- and space-efficient debugger with scalable performance are presented, making it easier to debug and understand message-passing programs.

...read moreread less

Abstract: Developers of message-passing codes on massively parallel systems have to contend with difficulties that data-parallel programmers do not face, not the least of which is debuggers that do not scale with the degree of parallelism. In this paper, we present new scalable interaction paradigms and their embodiment in a time- and space-efficient debugger with scalable performance. The debugger offers scalable expression, execution, and interpretation of all debugging operations, making it easier to debug and understand message-passing programs. >

...read moreread less

18 citations

A framework for symmetric band reduction and tridiagonalization

[...]

C.H. Bischof, X. Sun

01 Nov 1994

TL;DR: A framework for bandwidth reduction and tridiagonalization algorithms for symmetric banded matrices is developed, which leads to algorithms that require fewer floating-point operations, allow for space-time tradeoffs, enable the use of block orthogonal transformations, and increase the degree of parallelism inherent in the algorithm.

...read moreread less

Abstract: This paper develops a framework for bandwidth reduction and tridiagonalization algorithms for symmetric banded matrices. The algorithm family includes the algorithms by Rutishauser and Schwarz, which underlie the EISPACK and LAPACK implementations, and the algorithm recently proposed by Lang. The framework leads to algorithms that require fewer floating-point operations, allow for space-time tradeoffs, enable the use of block orthogonal transformations, and increase the degree of parallelism inherent in the algorithm.

...read moreread less

Journal Article•DOI•

Limits of parallelism in hash join algorithms

[...]

Antoine N. Mourad¹, Robert J. T. Morris², Arun N. Swami², Honesty C. Young²•Institutions (2)

Bell Labs¹, IBM²

01 May 1994-Performance Evaluation

TL;DR: A simple rule of thumb for choosing the degree of parallelism in order to maximize the throughput of the hybrid hash algorithm and in the case of Grace join, asymptotic conditions on the amount of skew for a limit on parallelism to exist are established.

...read moreread less

Journal Article•DOI•

Parallelism, non-biotic data and phylogeny reconstruction in paleobiology

[...]

Gregory E. Webb¹•Institutions (1)

Texas A&M University¹

01 Sep 1994-Lethaia

TL;DR: The high degree of parallelism, combined with abundant symplesiomorphic characters, led to erroneous phylogenetic inferences when non-biotic data were excluded from analysis.

...read moreread less

Abstract: Webb, G.E. 1994 1015: Parallelism, non-biotic data and phylogeny reconstruction in paleobiology. Many systematists equate parallelism and convergence. However, whereas convergence is relatively uncommon and easily recognized using divergent characters, parallelism is common but more difficult to recognize because divergent characters are less abundant. Cladists, in particular, equate homeomorphy with convergence and reject parallelism as a distinct concept. Unfortunately, cladistic parsimony analysis may not resolve most parallelism. Therefore, criteria for the a priori recognition and objective evaluation of parallelism are very significant. Non-biotic data (e.g., stratigraphic and geographic distribution) provide independent criteria for the construction of hypotheses of parallelism in cases where taxa (1) were geographically isolated during homeomorphic character-state transformations, (2) occurred with endemic faunas, and (3) evolved in similar environmental conditions as suggested by paleoecological data. Australian lithostrotionoid corals were long considered congeneric with European taxa. However, because of their geographic isolation, occurrence with endemic rugose corals and occurrence in similar depositional environments as European forms, they are now considered a homeomorphic clade, resulting from an extended sequence of parallel character-state transformations. The high degree of parallelism, combined with abundant symplesiomorphic characters, led to erroneous phylogenetic inferences when non-biotic data were excluded from analysis. Cladistics, homeomorphy, lithostrotionoid corals, parallelism, phylogeny.

...read moreread less

Proceedings Article•DOI•

Delay-insensitive solutions to the committee problem

[...]

I. Benko¹, J.C. Ebergen•Institutions (1)

University of Waterloo¹

03 Nov 1994

TL;DR: This work gives a simple specification of a scheduler and presents three delay-insensitive implementations, one of which contains a high degree of parallelism and is simpler than previously proposed implementations.

...read moreread less

Abstract: The committee problem involves the specification and design of a scheduler for committee meetings. It is a general resource allocation problem that combines both synchronization and mutual exclusion. We give a simple specification of a scheduler and present three delay-insensitive implementations. Our last implementation contains a high degree of parallelism and is simpler than previously proposed implementations.

...read moreread less

Proceedings Article•DOI•

Timing analysis of superscalar processor programs using ACSR

[...]

Jin-Young Choi¹, Insup Lee¹, Inhye Kang¹•Institutions (1)

University of Pennsylvania¹

18 May 1994

TL;DR: This paper illustrates a formal technique for describing the timing properties and resource constraints of pipelined superscalar processor instructions at high level using ACSR, and describes how to derive the temporal behavior of an assembly program using the ACSR laws.

...read moreread less

Abstract: This paper illustrates a formal technique for describing the timing properties and resource constraints of pipelined superscalar processor instructions at high level. Superscalar processors can issue and execute multiple instructions simultaneously. The degree of parallelism depends on the multiplicity of hardware functional units as well as data dependencies among instructions. Thus, the timing properties of a superscalar program is difficult. To analyze and predict. We describe how to model the instruction-level architecture of a superscalar processor using ACSR and how to derive the temporal behavior of an assembly program using the ACSR laws. The salient aspect of ACSR is that the notions of time, resources and priorities are supported directly in the algebra. Our approach is to model superscalar processor registers as ACSR resources, instructions as ACSR processes, and use ACSR priorities to achieve maximum possible instruction-level parallelism. >

...read moreread less

Journal Article•DOI•

PARALLEX: a parallel approach to switchbox routing

[...]

Tae Won Cho, S.S. Pyo, J.R. Heath

01 Jun 1994-IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

TL;DR: A parallel algorithm, called PARALLEX, which uses a conflict resolving method, has been developed for the switchbox routing problem in a parallel processing environment and the speed-up for 7 and 19-net problems were 4.7 and 10, respectively.

...read moreread less

Abstract: A parallel algorithm, called PARALLEX, which uses a conflict resolving method, has been developed for the switchbox routing problem in a parallel processing environment. PARALLEX can achieve a very high degree of parallelism by generating as many processes as nets. Each process is assigned to route a net, which bears the same identification number as the process. If conflicts are found for the current route of a net, then that process classifies the set(s) of conflict segments into groups that are identified by the various types of conflict(s) within each group. Each process with conflicts finds partial solutions by resolving every conflict of a group in the path-finding procedure and merges them with the solutions from other processes, which may or may not have conflicts, to make a conflict-free switchbox. The speed-up for 7and 19-net problems were 4.7 and 10, respectively. >

...read moreread less

Mapping regular recursive algorithms to fine-grained processor arrays

[...]

Kumar N. Ganapathy¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

15 Dec 1994

TL;DR: A systematic parameter-based method, called the General Parameter Method (GPM), to design optimal, lower-dimensional processor arrays for uniform dependence algorithms has been developed and it can found that the system yield improves with the area of the coprocessor when chip yield decreases as the inverse square of the clock frequency.

...read moreread less

Abstract: With the continuing growth of VLSI technology, special-purpose parallel processors have become a promising approach in the quest for high performance. Fine-grained processor arrays have become popular as they are suitable for solving problems with a high degree of parallelism, and can be inexpensively built using custom designs or commercially available field programmable gate arrays (FPGA). Such specialised designs are often required in portable computing and communication systems with real-time constraints, as software-controlled processors often fail to provide the necessary throughput. This thesis addresses many issues in designing such application-specific systems built with fine-grained processor arrays for regular recursive uniform dependence algorithms. A uniform dependence algorithm consists of a set of indexed computations and a set of uniform dependence vectors which are independent of the indices of computations. Many important applications in signal/image processing, communications, and scientific computing can be formulated as uniform dependence algorithms. The first part of this thesis addresses the problem of designing algorithm-specific processor arrays. A systematic parameter-based method, called the General Parameter Method (GPM), to design optimal, lower-dimensional processor arrays for uniform dependence algorithms has been developed. The GPM can be used to derive optimal arrays for any user-specified objective expressed in terms of the parameters. The proposed approach employs an efficient search technique to explore the design space and arrive at the optimal designs. The GPM can be used to find optimal designs in the dependence-based methods using the equivalence between the parameter-based and dependence-based methods. The GPM has also been extended to derive optimal two-level pipelined algorithm-specific processor arrays. Such two-level pipelined arrays can be clocked at higher rates than can nonpipelined designs for real-time applications. The second part of this thesis presents a parallel VLSI architecture for a general-purpose coprocessor for uniform dependence algorithms. The architecture consists of a linear array of processors and a linear chain of buffer memories organized as FIFO queues to store the buffered data. Such an architecture is advantageous from the point of view of scalability and wafer-level integration. A distinguishing feature is the assumption of a limited-bandwidth interface to external memory modules for accessing the data. Such an assumption allows the coprocessor to be integrated easily into existing systems. Efficient techniques to partition the dependence graph into blocks, sequence the blocks through the buffer memory to reduce the number of data accesses to main memory, and map the blocks using GPM have been developed. An important result obtained is the square-root relationship between clock-rate reduction and area of the coprocessor under fixed main-memory bandwidth. From the square-root relationship, it can found that the system yield improves with the area of the coprocessor when chip yield decreases as the inverse square of the clock frequency. Results on matrix-product and transitive-closure applications indicate that the coprocessor can be used to deliver higher speedup or lower clock rate than a reference one-processor design. Thus, the coprocessor can be used as a general-purpose back-end accelerator for loop-based matrix algorithms.

...read moreread less

Proceedings Article•DOI•

Massively Parallel Algorithms for Real-Time Wavefront Control of a Dense Adaptive Optics System

[...]

Amir Fijany¹, Mark H. Milman¹, David C. Redding¹•Institutions (1)

Jet Propulsion Laboratory¹

01 May 1994

TL;DR: A novel algorithm, designated as Fast Invariant Imbedding algorithm, is developed, which offers a massive degree of parallelism with simple communication and synchronization requirements and two massively parallel, algorithmically specialized, architectures for low-cost and optimal implementation of this algorithm.

...read moreread less

Abstract: Massively parallel algorithms and architectures for real-time wavefront control of a dense adaptive optic system (SELENE) are presented. We have already shown that the computation of a near optimal control algorithm for SELENE can be reduced to the solution of a discrete Poisson equation on a regular domain. Although this represents an optimal computation, due to the large size of the system and the high sampling rate requirement, the implementation of this control algorithm poses a computationally challenging problem since it demands a sustained computational throughput of the order of 10 GFlops. We develop a novel algorithm, designated as Fast Invariant Imbedding algorithm, which offers a massive degree of parallelism with simple communication and synchronization requirements. We also discuss two massively parallel, algorithmically specialized, architectures for low-cost and optimal implementation of the Fast Invariant Imbedding algorithm.

...read moreread less

Journal Article•DOI•

Parallel context-sensitive compilation

[...]

Chandra R. Asthagiri¹, Jerry L. Potter²•Institutions (2)

Cleveland State University¹, Kent State University²

01 Sep 1994-Software - Practice and Experience

TL;DR: This paper describes the development of a context sensitive compiler for pattern‐matching languages using the searching power of massively parallel associative computers and compilation of production rules into equivalent procedural rules is completely data parallel.

...read moreread less

Abstract: The searching power of massively parallel associative computers is an under used and under investigated capability that can be used to facilitate software development. This paper describes the development of a context sensitive compiler for pattern-matching languages using that searching power. The described compiler was implemented on the STARAN parallel computer and the compiled OPS5 programs were also executed on the STARAN obtaining an estimated throughput of 6000 rules per second. The described compilation of production rules into equivalent procedural rules is completely data parallel, with the degree of parallelism depending on the number of tokens in the program being compiled. During any one step of the context-sensitive analysis, the entire program is processed in constant time.

...read moreread less

Proceedings Article•DOI•

Analog VLSI cellular fuzzy automata networks for relaxation labeling

[...]

G. Privat¹, K. Goser•Institutions (1)

Orange S.A.¹

26 Sep 1994

TL;DR: Fuzzy relaxation labelling is presented as a feasibility example for such a streamlined hardware implementation of a cellular automata network that combines the flexibility of numeric processing with the explicitness and transparency of logic rules.

...read moreread less

Abstract: Expressing with fuzzy logic the local transition functions of a cellular automata network combines the flexibility of numeric processing with the explicitness and transparency of logic rules, within the framework of an emergent-cooperative computational model. Implementing fuzzy processing elements as the nodes of such nets places heavy emphasis on fine granularity, and is feasible only in analog if it is aimed to achieve the degree of parallelism matched to potential applications in image processing. Fuzzy relaxation labelling is presented as a feasibility example for such a streamlined hardware implementation. This computational model is potentially applicable in a wide range of applications, drawing maximal benefit from advanced VLSI technologies.

...read moreread less

Patent•

Method and device for arithmetic processing

[...]

Takeda Koichi, Ohara Teruhiko

04 Nov 1994

TL;DR: In this article, the authors propose to increase the degree of parallelism of arithmetic processing operation in the VLIW system to improve the bit use efficiency of instructions and the use of hardware resources of a processor.

...read moreread less

Abstract: PURPOSE:To increase the degree of parallelism of arithmetic processing operation in the VLIW system to improve the bit use efficiency of instructions and the use efficiency of hardware resources of a processor. CONSTITUTION:An instruction 10 consists of plural operation fields op1 to op4 and control subfields F1 to F4 corresponding to respective operation fields. When the instruction 10 including a branch operation is executed, the operation to be executed is different between the success of branch and the failure of branch. In this case, the operation results are adopted or rejected based on information of contents of control subfields F1 to F4 and a flag indicating the success or the failure of branch. Thus, the operation for the success of branch and that for the failure of branch are included in one instruction 10 and are processed in parallel, and the number of instructions is reduced.

...read moreread less

Book Chapter•DOI•

Using Parallel Simulated Annealing in the Mapping Problem

[...]

Borut Robič, Jurij Šilc

04 Jul 1994

TL;DR: A parallel simulated annealing algorithm for solving the problem of mapping irregular parallel programs onto homogeneous processor arrays with regular topology that constructs and uses joint transformations.

...read moreread less

Abstract: This paper presents a parallel simulated annealing algorithm for solving the problem of mapping irregular parallel programs onto homogeneous processor arrays with regular topology. The algorithm constructs and uses joint transformations. These transformations guarantee a high degree of parallelism that is bounded below by ⌈¦N p /deg(G p +1⌋, where ¦N p ¦ is the number of task nodes in the mapped program graph G p and deg(G p ) is the maximal degree of a node in G p . The mapping algorithm provides good program mappings (in terms of program execution time and the number of processors used) in a reasonable number of steps.

...read moreread less

Proceedings Article•DOI•

Efficient mapping of randomly sparse neural networks on parallel vector supercomputers

[...]

S.M. Muller, B. Gomes

26 Oct 1994

TL;DR: This paper presents efficient mappings of large sparse neural networks on a distributed-memory MIMD multicomputer with high performance vector units and shows that vectorization can nevertheless more than quadruple the performance on the authors' modeled supercomputer.

...read moreread less

Abstract: This paper presents efficient mappings of large sparse neural networks on a distributed-memory MIMD multicomputer with high performance vector units. We develop parallel vector code for an idealized network and analyze its performance. Our algorithms combine high performance with a reasonable memory requirement. Due to the high cost of scatter/gather operations, generating high performance parallel vector code requires careful attention to details of the representation. We show that vectorization can nevertheless more than quadruple the performance on our modeled supercomputer. Pushing several patterns at a time through the network (batch mode) exposes an extra degree of parallelism which allows us to improve the performance by an additional factor of 4. Vectorization and batch updating therefore yield an order of magnitude performance improvement. >

...read moreread less

Journal Article•DOI•

An alternative computer architecture course

[...]

Aaron Garth Enright¹, Linda Wilkens², James Canning³•Institutions (3)

University of Massachusetts Amherst¹, Bridgewater State University², University of Massachusetts Lowell³

01 Dec 1994

TL;DR: This paper presents an alternative course, designed as an elective in computer architecture for upper level undergraduate or graduate students, that presents a side-by-side comparison of von Neumann and data flow architectures.

...read moreread less

Abstract: Most computer architecture courses are geared toward the classical von Neumann style of computer architectures, mentioning only in passing other models such as data flow computation. This is unfortunate, due to the high degree of parallelism possible using data flow. We present an alternative course, designed as an elective in computer architecture for upper level undergraduate or graduate students, that presents a side-by-side comparison of von Neumann and data flow architectures.Our teaching environment is based on Simple Arithmetic SISAL (SAS), a subset of the applicative programming language SISAL, which we designed for both teaching about and research into data flow architectures. SAS runs in a highly integrated environment, allowing students to implement their program on a von Neumann architecture, then observe its execution through a data flow simulator. The environment runs on a standard IBM-style personal computer, providing a cost-effective platform for presenting the course.

...read moreread less

Toward Scalable Parallel Software: An Active Object Model and Library to Support von Neumann Languages

[...]

George K. Thiruvathukal¹•Institutions (1)

Loyola University Chicago¹

01 Jan 1994

TL;DR: This paper presents Snyder's XYZ levels, a model for concurrent evaluation that allows a much higher degree of parallelism to be achieved, and states that optimizing compiler technology is particularly applicable at this level.

...read moreread less

Abstract: ion Levels Talk about Snyder’s XYZ levels. Aside from terminology, we are thinking of the same ideas. The highest level is composition. Snyder discusses phase composition at the Z, or problem, level. In our scheme, a single process creates specifies a computation by brokering the services of existing objects. These objects are concurrent/parallel in nature. The next level is concurrent evaluation. The objects brokered by the composition level cause the creation of itinerant actors. Itinerant actors are used to do one of the following: • process coordination • object coordination • macro-dataflow (with virtual pattern matching) • distributed and shared data structures (all objects) Snyder speaks of a Y level, wherein a phase composes process units to achieve a parallel computation. In our model, we use the term process somewhat differently. Our processes are lightweight at this level and partially ordered as in dataflow. This allows a much higher degree of parallelism to be achieved, since we do not rely on operating system mechanisms to manage lightweight processes (in other words, we do not use threads either). Snyder speaks of the X level, wherein a process composes sequential program units in a single address space. We are consistent in this interpretation at our lowest level. While we do not address this in detail in this paper, there is nothing to stop you from exploiting parallelism at this level. This is the level where we believe optimizing compiler technology is particularly applicable.

...read moreread less

Proceedings Article•DOI•

A 'Jacobi' signal processing unit for time-adaptive SVD

[...]

E.D. Deprettere¹, H.W. van Dijk¹, G.J. Hekstra¹•Institutions (1)

Delft University of Technology¹

19 Apr 1994

TL;DR: This paper demonstrates that practical, time-adaptive singular value decomposition can be implemented on a parallel processor array using Cordic arithmetic and asynchronous communication, such that any degree of parallelism, from single-processor implementation up to full-size array implementation is supported by a 'universal' processing unit.

...read moreread less

Abstract: Implementing Jacobi algorithms in parallel processor arrays is a non-trivial task, in particular when the algorithms are parameterized with respect to size and the architectures are parameterized with respect to space-time trade-offs. The objective of this paper is to demonstrate that practical, time-adaptive singular value decomposition can be implemented on a parallel processor array using Cordic arithmetic and asynchronous communication, such that any degree of parallelism, from single-processor implementation up to full-size array implementation is supported by a 'universal' processing unit. This result is the product of judicious application of transformations in the combined algorithm and architecture space. >

...read moreread less

Proceedings Article•DOI•

Adaptive modular Laguerre non-linear model

[...]

Z. Fejzo¹, Hanoch Lev-Ari¹•Institutions (1)

Northeastern University¹

02 Oct 1994

TL;DR: Presents an adaptive nonlinear estimation technique (polynomial model-based) that has guaranteed stability and makes parsimonious use of coefficients, thereby achieving optimal, or close to optimal, performance with reduced computational complexity when compared to the adaptive Volterra filters.

...read moreread less

Abstract: Presents an adaptive nonlinear estimation technique (polynomial model-based) that has guaranteed stability and makes parsimonious use of coefficients, thereby achieving optimal, or close to optimal, performance with reduced computational complexity when compared to the adaptive Volterra filters. Additionally the suggested structure exhibits a high degree of parallelism which makes it suitable for VSLI implementation. >

...read moreread less

Proceedings Article•DOI•

Parallel Simulated Annealing: Getting Super Linear Speedups

[...]

A. Genco¹•Institutions (1)

University of Palermo¹

26 Jan 1994

TL;DR: The study tries to improve and combine different approaches that are able to speed up applications of the Simulated Annealing model and investigates separately two main aspects concerning the degree of parallelism an implementation can egectively exploit at the initial andfinal periods of an execution.

...read moreread less

Abstract: The study described in this paper tries to improve and combine different approaches that are able to speed up applications of the Simulated Annealing model It investigates separately two main aspects concerning the degree of parallelism an implementation can egectively exploit at the initial andfinal periods of an execution As for case studies, it deals with two implementations: the Job shop Scheduling problem and the poryblio selection problem The paper reports the results of a large number of experiments, carried out by means of a transputer network and a hypercube system They give useful suggestions about selecting the most suitable values of the intervention parameters to achieve super linear speedups

...read moreread less

Journal Article•DOI•

An Improved Data Flow Architecture for Logic Simulation Acceleration

[...]

A. Mahmood, J. Herath, J. Jayasumana

01 Jan 1994-Vlsi Design

TL;DR: Three major improvements are made to the design of a classical data flow based logic simulation accelerator that can be utilized by a data flow architecture to reduce the enormous simulation times.

...read moreread less

Abstract: The high degree of parallelism in the simulation of digital VLSI systems can be utilized by a data flow architecture to reduce the enormous simulation times. The existing logic simulation accelerators based on the data flow principle use a static data flow architecture along with a timing wheel mechanism to implement the event driven simulation algorithm. The drawback in this approach is that the timing wheel becomes a bottleneck to high simulation throughput. Other shortcomings of the existing architecture are the high communication overhead in the arbitration and distribution networks, and reduced pipelining due to a static data flow architecture. To overcome these, three major improvements are made to the design of a classical data flow based logic simulation accelerator. These include:

...read moreread less

Proceedings Article•DOI•

Parallel extended local feature extraction on distributed memory computer

[...]

Joong-Hwan Baek, Yu-Seon Chang, K.A. Teague

02 Oct 1994

TL;DR: In this paper a new parallel extended local feature extraction method is proposed which can be implemented on a distributed memory machine and an efficient algorithm is developed which is capable of exploiting a high degree of parallelism.

...read moreread less

Abstract: Feature extraction is the most important phase in object recognition because accuracy of the system relies on how well the features are extracted. In this paper a new parallel extended local feature extraction method is proposed which can be implemented on a distributed memory machine. In order to reduce the complexity in the extended local feature extraction, an efficient algorithm is developed which is capable of exploiting a high degree of parallelism. Our parallel algorithm is implemented and tested on an Intel iPSC/2 hypercube computer. Some resulting figures and execution times according to various number of nodes and object features are presented. >

...read moreread less