Showing papers on "Degree of parallelism published in 1999"

PDF

Open Access

Proceedings Article•DOI•

Compiler-directed dynamic computation reuse: rationale and initial results

[...]

D.A. Conners¹, Wen-mei W. Hwu¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

16 Nov 1999

TL;DR: Initial results show that the compiler analysis can indeed identify large reuse regions and can improve the performance of a 6-issue microarchitecture by an average of 30% for a collection of SPEC and integer benchmarks.

...read moreread less

Abstract: Recent studies on value locality reveal that many instructions are frequently executed with a small variety of inputs. This paper proposes an approach that integrates architecture and compiler techniques to exploit value locality for large regions of code. The approach strives to eliminate redundant processor execution created by both instruction-level input repetition and recurrence of input data within high-level computations. In this approach, the compiler performs analysis to identify code regions whose computation can be reused during dynamic execution. The instruction set architecture provides a simple interface for the compiler to communicate the scope of each reuse region and its live-out register information to the hardware. During run time, the execution results of these reusable computation regions are recorded into hardware buffers for potential reuse. Each reuse can eliminate the execution of a large number of dynamic instructions. Furthermore, the actions needed to update the live-out registers can be performed at a higher degree of parallelism than the original code, breaking intrinsic dataflow dependence constraints. Initial results show that the compiler analysis can indeed identify large reuse regions. Overall, the approach can improve the performance of a 6-issue microarchitecture by an average of 30% for a collection of SPEC and integer benchmarks.

...read moreread less

82 citations

Journal Article•DOI•

One- and two-dimensional constant geometry fast cosine transform algorithms and architectures

[...]

J. Kwak, J. You

01 Jul 1999-IEEE Transactions on Signal Processing

TL;DR: This paper presents general radix one- and two-dimensional (1-D and 2-D) constant geometry fast cosine transform algorithms and architectures suitable for VLSI, owing to their regular structures.

...read moreread less

Abstract: This paper presents general radix one- and two-dimensional (1-D and 2-D) constant geometry fast cosine transform algorithms and architectures suitable for VLSI, owing to their regular structures. A constant geometry algorithm is obtained by shuffling the rows and columns of each decomposed DCT matrix that corresponds to a butterfly stage. The 1-D algorithm is derived, and then, it is extended to the 2-D case. Based on the derived algorithms, the architectures with a flexible degree of parallelism are discussed.

...read moreread less

25 citations

Patent•

Method and parallelizing geometric processing in a graphics rendering pipeline

[...]

Monty M. Denneau¹, Peter H. Hochschild¹, Henry S. Warren¹•Institutions (1)

IBM¹

10 Aug 1999

TL;DR: In this article, the geometric processing of an ordered sequence of graphics commands is distributed over a set of processors by the following steps: a first phase of processing is performed on the processors whereby, for each given subsequence Sj in the set of subsequences S0... SN−2, state vector Vj+1 is updated to represent state as if the graphics commands in Sj had been executed in sequential order.

...read moreread less

Abstract: The geometric processing of an ordered sequence of graphics commands is distributed over a set of processors by the following steps. The sequence of graphics commands is partitioned into an ordered set of N subsequences S0 . . . SN−1, and an ordered set of N state vectors V0 . . . VN−1 is associated with said ordered set of subsequences S0 . . . SN−1. A first phase of processing is performed on the set of processors whereby, for each given subsequence Sj in the set of subsequences S0 . . . SN−2, state vector Vj+1 is updated to represent state as if the graphics commands in subsequence Sj had been executed in sequential order. A second phase of the processing is performed whereby the components of each given state vector Vk in the set of state vectors V1 . . . VN−1 generated in the first phase is merged with corresponding components in the preceding state vectors V0 . . . Vk−1 such that the state vector Vk represents state as if the graphics commands in subsequences S0 . . . Sk−1 had been executed in sequential order. Finally, a third phase of processing is performed on the set of processors whereby, for each subsequence Sm in the set of subsequences S1 . . . SN−1, geometry operations for subsequence Sm are performed using the state vector Vm generated in the second phase. In addition, in the third phase, geometry operations for subsequence S0 are performed using the state vector V0. Advantageously, the present invention provides a mechanism that allows a large number of processors to work in parallel on the geometry operations of the three-dimensional rendering pipeline. Moreover, this high degree of parallelism is achieved with very little synchronization (one processor waiting from another) required, which results in increased performance over prior art graphics processing techniques.

...read moreread less

22 citations

Book Chapter•DOI•

Learning Rules from Distributed Data

[...]

Lawrence O. Hall¹, Nitesh V. Chawla¹, Kevin W. Bowyer¹, W. Philip Kegelmeyer²•Institutions (2)

University of South Florida¹, Sandia National Laboratories²

15 Aug 1999

TL;DR: It is shown that this approach compares to directly creating a rule set from the aggregate data and promises faster learning, and an approach has been developed to extend the degree of parallelism achieved before this problem takes over.

...read moreread less

Abstract: In this paper a concern about the accuracy (as a function of parallelism) of a certain class of distributed learning algorithms is raised, and one proposed improvement is illustrated.We focus on learning a single model from a set of disjoint data sets, which are distributed across a set of computers. The model is a set of rules. The distributed data sets may be disjoint for any of several reasons. In our approach, the first step is to construct a rule set (model) for each of the original disjoint data sets. Then rule sets are merged until an eventual final rule set is obtained which models the aggregate data. We show that this approach compares to directly creating a rule set from the aggregate data and promises faster learning. Accuracy can drop off as the degree of parallelism increases. However, an approach has been developed to extend the degree of parallelism achieved before this problem takes over.

...read moreread less

21 citations

Patent•

Method for computing the degree of parallelism in a multi-user environment

[...]

Juan Tellez¹, Benoit Dageville¹•Institutions (1)

Business International Corporation¹

22 Jul 1999

TL;DR: In this paper, a method and apparatus for computing degrees of parallelism for parallel operations in a computer system is provided, based on a set of factors, such as a target degree, a current workload and a requested degree.

...read moreread less

Abstract: A method and apparatus are provided for computing degrees of parallelism for parallel operations in a computer system. The degree of parallelism for a given parallel operation is computed based on a set of factors. The set of factors includes a target degree of parallelism that represents a desired total amount of parallelism in the computer system, a current workload and a requested degree of parallelism.

...read moreread less

20 citations

Book Chapter•DOI•

Parallel Execution for Speeding Up Inductive Logic Programming Systems

[...]

Hayato Ohwada¹, Fumio Mizoguchi¹•Institutions (1)

University of Tokyo¹

01 Dec 1999

TL;DR: This paper formalizes the ILP task as a generalized branch-and-bound search and proposes three methods of parallel executions for the optimal search and these methods are implemented in KL1, a parallel logic programming language, and are analyzed for execution speed and load balancing.

...read moreread less

Abstract: This paper describes a parallel algorithm and its implementation for a hypothesis space search in Inductive Logic Programming (ILP). A typical ILP system, Progol, regards induction as a search problem for finding a hypothesis, and an efficient search algorithm is used to find the optimal hypothesis. In this paper, we formalize the ILP task as a generalized branch-and-bound search and propose three methods of parallel executions for the optimal search. These methods are implemented in KL1, a parallel logic programming language, and are analyzed for execution speed and load balancing. An experiment on a benchmark test set was conducted using a shared memory parallel machine to evaluate the performance of the hypothesis search according to the number of processors. The result demonstrates that the statistics obtained coincide with the expected degree of parallelism.

...read moreread less

18 citations

Journal Article•DOI•

Automatically Partitioning Threads for Multithreaded Architectures

[...]

Xinan Tang¹, Guang R. Gao¹•Institutions (1)

University of Delaware¹

01 Aug 1999-Journal of Parallel and Distributed Computing

TL;DR: This paper proposes two heuristic thread-partitioning methods to solve a fundamental problem in compiling for multithreaded architectures, automatically partitioning a program into threads and shows that both partitioning algorithms are effective to generate efficient threaded code, and code generated by the compiler is comparable to hand-written code.

...read moreread less

18 citations

Proceedings Article•DOI•

Parallel string matching algorithms based on dataflow

[...]

Jin Hwan Park¹, K.M. George¹•Institutions (1)

Oklahoma State University–Stillwater¹

05 Jan 1999

TL;DR: Three parallel algorithms based on multiple input (and output) streams are presented and three different methods to design special purpose systolic array hardware for string matching are presented.

...read moreread less

Abstract: This paper presents efficient dataflow schemes for parallel string matching. Two subproblems known as the exact matching and the k-mismatches problems are covered. Three parallel algorithms based on multiple input (and output) streams are presented. Time complexities of these parallel algorithms are O((n/d)+/spl alpha/), O/spl les//spl alpha//spl les/m, where n and m represent lengths of reference and pattern strings (n/spl Gt/m) and d represents the number of streams used (the degree of parallelism). We can control the degree of parallelism by using variable number (d) of input (and output) streams. These performances are better than those found in the literature. These algorithms present three different methods to design special purpose systolic array hardware for string matching. With linear systolic array architecture, m PEs are needed for serial design and d*m PEs are needed for parallel design.

...read moreread less

18 citations

Journal Article•DOI•

Efficient parallel hardware algorithms for string matching

[...]

Jin Hwan Park¹, K. M. George¹•Institutions (1)

Oklahoma State University–Stillwater¹

01 Oct 1999-Microprocessors and Microsystems

TL;DR: Two subproblems known as the exact matching and the k-mismatches problems are covered and efficient parallel hardware algorithms for string matching are presented.

...read moreread less

18 citations

Book Chapter•DOI•

Optimal Degree of Parallelism and Integration in Design and Process Planning

[...]

Walter Eversheim¹, Iris Schulten¹•Institutions (1)

RWTH Aachen University¹

01 Jan 1999

TL;DR: Integration measures can be used with full effect on reducing time-to-market, trimming development and manufacturing cost and enhancing product quality, according to the developed evaluation methodology.

...read moreread less

Abstract: Efficiency in design and process planning is defined as measured contribution to goal fulfilment. In order to identify integration measures for design and process planning with highest effect on efficiency, an evaluation methodology has been developed. Examples of integration measures are early transmission and feedback of information, coordination and use of integrating methods as QFD or FMEA. The developed evaluation methodology supports the decision between alternative parallel and integrated procedures by estimating their efforts, benefits and risks. Applying the proposed evaluation methodology ensures an optimised degree of integration and parallelism in design and process planning. Thus integration measures can be used with full effect on reducing time-to-market, trimming development and manufacturing cost and enhancing product quality.

...read moreread less

11 citations

Journal Article•DOI•

Workload execution strategies and parallel speedup on clustered computers

[...]

Kenneth Hoganson¹•Institutions (1)

Kennesaw State University¹

01 Nov 1999-IEEE Transactions on Computers

TL;DR: A model of system performance for parallel processing on clustered multiprocessors is developed which unifies multiprogramming with speedup and scaled-speedup and heuristics are developed that relate cluster size to parallel fraction of a program and to process scaling factors.

...read moreread less

Abstract: A model of system performance for parallel processing on clustered multiprocessors is developed which unifies multiprogramming with speedup and scaled-speedup. The model is used to explore processor to process allocation alternatives for executing a workload consisting of multiple processes. Heuristics are developed that relate cluster size to parallel fraction of a program and to process scaling factors. The basic analytical model is made more sophisticated by incorporating considerations that affect the realizable speedup, including explicit process scaling, Degree of Parallelism (DOP) as a discrete function, and communication overhead. New developments incorporate nonuniform workload, interconnection network probability of acceptance of requests, nonuniform memory access, and multithreaded processes.

...read moreread less

Journal Article•DOI•

Efficient mapping reductions using iso-planes on the polytope model

[...]

Toomas P. Plaks

01 Jan 1999-Parallel Algorithms and Applications

TL;DR: The method presented here increases the available degree of parallelism and thus improves the time complexity of systolic computations.

...read moreread less

Abstract: This paper presents a new technique for mapping algorithms onto regular (systolic) arrays. The technique integrates the associativity and commutativity of computations into space-time transformations on the polytope model and involves three categories of transformations: ( 1) iso-planes - forming iso-planes of computations for algorithm representation in contrast to the conventional technique using the data dependence graph; ( 2) increase in dimensionality -mapping a low dimensional algorithm representation into a higher dimensional version with a higher degree of parallelism; and (3) pipestructures - generating and choosing a particular partial order of computations on iso-planes for moving data around the regular array. Three operations for generating pipestructures are introduced: permutation, rotation and reversal. The method presented here increases the available degree of parallelism and thus improves the time complexity of systolic computations. Examples for developing 2-D arrays for 1-D c...

...read moreread less

Book Chapter•DOI•

Parallel Programming with Interacting Processes

[...]

Peiyi Tang¹, Yoichi Muraoka²•Institutions (2)

University of Southern Queensland¹, Waseda University²

04 Aug 1999

TL;DR: It is argued that interacting processes (IP) with multiparty interactions are an ideal model for parallel programming and is a good candidate for the mainstream programming model for the both parallel and distributed computing in the future.

...read moreread less

Abstract: In this paper, we argue that interacting processes (IP) with multiparty interactions are an ideal model for parallel programming. The IP model with multiparty interactions was originally proposed by N. Francez and I. R. Forman [1] for distributed programming of reactive applications. We analyze the IP model and provide the new insights into it from the parallel programming perspective. We show through parallel program examples in IP that the suitability of the IP model for parallel programming lies in its programmability, high degree of parallelism and support for modular programming. We believe that IP is a good candidate for the mainstream programming model for the both parallel and distributed computing in the future.

...read moreread less

Journal Article•DOI•

A Robust, Parallel Homotopy Algorithm for the Symmetric Tridiagonal Eigenproblem

[...]

Michael H. Oettli

01 Jan 1999-SIAM Journal on Scientific Computing

TL;DR: A numerical comparison of a sequential implementation of the new homotopy approach for solving the symmetric eigenproblem shows that it is competitive with other well-established algorithms both in speed and accuracy.

...read moreread less

Abstract: The homotopy method has been used in the past for finding zeros of nonlinear functions. Early algorithms based on the homotopy approach for solving the symmetric eigenproblem suffer either from insufficient orthogonality between computed eigenvectors or from low efficiency. A new algorithm presented in this article overcomes these problems by using deflation of close eigenpairs and maintains a high degree of parallelism by applying the divide-and-conquer principle. A numerical comparison of a sequential implementation of the new algorithm shows that it is competitive with other well-established algorithms both in speed and accuracy. Algorithms based on the homotopy method are excellent candidates for multicomputers because of their natural parallelism. This article examines how the algorithm can be implemented on distributed-memory multicomputers. The experimental performance results obtained on an Intel Paragon multicomputer are very satisfying.

...read moreread less

On Flexible Allocation of Index and Temporary Data in Parallel Database Systems

[...]

Erhard Rahm, Holger Märtens, Thomas Stöhr

01 Jan 1999

TL;DR: The work discussed is performed within a project aiming at developing strategies to automatically determine optimal data allocation strategies in order to simplify system administration in high performance environments.

...read moreread less

Abstract: Data placement is a key factor for high performance database systems This is particularly true for parallel database systems where data allocation must support both I/O parallelism and processing parallelism within complex queries and between independent queries and transactions Determining an effective data placement is a complex administration problem depending on many parameters including system architecture, database and workload characteristics, hardware configuration, etc Research and tool support has so far concentrated on data placement for base tables, especially for Shared Nothing (SN), eg [MD97] On the other hand, to our knowledge, data placement issues for architectures where multiple DBMS instances share access to the same disks (Shared Disk, Shared Everything, specific hybrid architectures) have not yet been investigated in a systematic way Furthermore, little work has been published on effective disk allocation of index structures and temporary data (eg, intermediate query results) However, these allocation problems gain increasing importance, eg in order to effectively utilize parallel database systems for decision support / data warehousing environments In the next section we discuss the index allocation problem in more detail and introduce a classification of various approaches that are already supported to some degree in commercial DBMS While SN offers only few options, the other architectures provide a higher flexibility because index allocation can be independent from the base table allocation For certain indexsupported queries, this can allow for order-of-magnitude savings in I/O and communication cost We then turn to the disk allocation of intermediate query results for which the allocation parameters can be chosen dynamically at query run time For the case of parallel hash joins, we outline how to determine an optimal approach supporting a high degree of parallelism The work discussed is performed within a project aiming at developing strategies to automatically determine optimal data allocation strategies in order to simplify system administration in high performance environments

...read moreread less

Journal Article•DOI•

A programmable image processor for real-time image processing applications

[...]

Mohammed Yakoob Siyal¹, Mahmood Fathy²•Institutions (2)

Nanyang Technological University¹, Iran University of Science and Technology²

01 Jun 1999-Microprocessors and Microsystems

TL;DR: An architectural design and analysis of a programmable image processor, named Snake, designed with a high degree of parallelism to speed up a range of image processing operations is presented.

...read moreread less

The component architecture toolkit

[...]

Dennis Gannon, Juan E. Villacis

01 Jan 1999

TL;DR: The CAT contributes the following: a model capturing the principles used to build a scalable component system; an extensible framework design that can be retargeted to different Grid systems; a set of framework libraries for developing components and a runtime environment for deploying them; and aset of end user tools for locating and instantiating components.

...read moreread less

Abstract: Metacomputing is concerned with the exploitation of large-scale distributed computing over the Internet using high-performance networks and computing resources. Metacomputing is important to scientific computing for several reasons. First, many scientific problems exhibit some degree of parallelism, and hence may be solved more efficiently on a network of computers, or “metacomputer”, than on a single computer. Second, problems whose resource requirements exceed the capacity of a single computer can take advantage of the potentially unbounded capacity of a metacomputer. Third, problems which require access to local or specialized resources would benefit from the distributed aspect of a metacomputer. Metacomputing also extends the feasibility range of existing scientific problems. The Grid is a software infrastructure for implementing metacomputing. Grid systems such as Globus and Legion provide sophisticated service layers which allow users to access and manage distributed hardware and software resources. However, building distributed applications by programming directly to low-level Grid APIs is not easy. End users who wish to solve problems using the GrK tend to first think in terms of higher-level, problem-centric concepts, such as determining which software resources are applicable, and then designing and building an application using those resources. Low-level details such as process instantiation, machine or network characteristics, and so forth typically come at a later stage of the application building process. This dissertation presents the Component Architecture Toolkit (CAT), a system enabling end users to build component-based applications in a metacomputing environment. The CAT contributes the following: a model capturing the principles used to build a scalable component system; an extensible framework design that can be retargeted to different Grid systems; a set of framework libraries for developing components and a runtime environment for deploying them; and a set of end user tools for locating and instantiating components, as well as tools for building, controlling and analyzing the performance of component-based applications. Scientific computing issues addressed by the CAT design and implementation are discussed. The application of the CAT to a particular scientific problem domain, as well as the performance analysis of an example component-based application is also presented.

...read moreread less

Journal Article•DOI•

A Computation+Communication Load Balanced Loop Partitioning Method for Distributed Memory Systems

[...]

Santosh Pande¹, Tareq Bali¹•Institutions (1)

University of Cincinnati¹

01 Sep 1999-Journal of Parallel and Distributed Computing

TL;DR: This work describes a multiphase partitioning approach based on the above motivation for the general cases of DOALL loops to achieve a computation+communication load balanced partitioning through static data and iteration space distribution.

...read moreread less

Patent•

Functional memory device for image processing

[...]

Kuroda Atsushi

10 Dec 1999

TL;DR: In this paper, a data input/output circuit is provided for the processing of data for plural data units with respect to a row or column memory circuit, which can transfer data to an orthogonal memory for recording image data or the like.

...read moreread less

Abstract: PROBLEM TO BE SOLVED: To accelerate the input/output speed of data with the outside, to improve the degree of parallelism in the arithmetic processing of data and to reduce the number of instructions to be supplied from the outside. SOLUTION: A data input/output circuit 9 is provided for permitting the input/output of data for plural data units with respect to a row or column memory circuit 3 which can transfer data in the unit of a row or column, and to an orthogonal memory 1 for recording image data or the like. Inside an LSI, data are transferred in the unit of a row or column and the input/output with the outside is parallelly performed in plural data units. Further, the instruction applied from the outside is defined as a macro instruction, a translation memory 15 for macro instruction and nano instruction is provided inside the LSI, a lot of correspondent nano instructions are read out of that memory, and processing required for plural computing elements inside a row arithmetic circuit is parallelly performed. The number of plural computing elements in a row or column arithmetic circuit 5 is increased while reducing the scale for improving the degree of parallelism and while utilizing the macro/nano translation memory 15, on the other hand, improvement is dealt with by generating plural nano instructions inside.

...read moreread less

Book Chapter•DOI•

A Generalized Transaction Theory for Database and Non-database Tasks

[...]

Armin Fessler, Hans-Jörg Schek

31 Aug 1999

TL;DR: It is pointed out that recent progress in database transaction management theory in the field of composite stack schedules can improve the degree of parallelism in databases as well as in parallel programming.

...read moreread less

Abstract: In both database transaction management and parallel programming, parallel execution of operations is one of the most essential features. Although they look quite different, we will show that many important similarities exist. As a result of a more careful comparison we will be able to point out that recent progress in database transaction management theory in the field of composite stack schedules can improve the degree of parallelism in databases as well as in parallel programming. We will use an example from numerical algorithms and will demonstrate that in principle more parallelism can be achieved.

...read moreread less

Journal Article•

Optimum Degree of Parallelism-based Task Dependence Graph Scheduling Scheme

[...]

DU Jian-chen

01 Jan 1999-Journal of Software

TL;DR: Optimum degree of parallelism-based task dependence graph scheduling scheme fully utilizes the global information collected at compile-time, employs the techniques such as task merging in horizontal and vertical directions, and integration of centralized scheduling and layer-scheduling.

...read moreread less

Abstract: Optimum degree of parallelism-based task dependence graph scheduling scheme fully utilizes theglobal information collected at compile-time, employs the techniques such as task merging in horizontal andvertical directions, processors pre-allocation, combination of static and dynamic scheduling, and integration ofcentralized scheduling and layer-scheduling. It is a simple, practical and effective scheduling method whichaddresses the problem of how to both reduce the execution time of programs and economize on processorresources.

...read moreread less

Book Chapter•DOI•

Minimum Register Instruction Scheduling: A New Approach for Dynamic Instruction Issue Processors

[...]

Ramaswamy Govindarajan¹, Chihong Zhang², Guang R. Gao²•Institutions (2)

Indian Institute of Science¹, University of Delaware²

04 Aug 1999

TL;DR: An extended list-scheduling algorithm is proposed which uses the above number of required registers as a guide to derive a schedule for G that uses as few registers as possible and, based on such an algorithm, an integrated approach for register allocation and instruction scheduling for modern superscalar architectures can be developed.

...read moreread less

Abstract: Modern superscalar architectures with dynamic scheduling and register renaming capabilities have introduced subtle but important changes into the tradeoffs between compile-time register allocation and instruction scheduling. In particular, it is perhaps not wise to increase the degree of parallelism of the static instruction schedule at the expense of excessive register pressure which may result in additional spill code. To the contrary, it may even be beneficial to reduce the register pressure at the expense of constraining the degree of parallelism of the static instruction schedule. This leads to the following interesting problem: given a data dependence graph (DDG) G, can we derive a schedule S for G that uses the least number of registers ? In this paper, we present a heuristic approach to compute the near-optimal number of registers required for a DDG G (under all possible legal schedules). We propose an extended list-scheduling algorithm which uses the above number of required registers as a guide to derive a schedule for G that uses as few registers as possible. Based on such an algorithm, an integrated approach for register allocation and instruction scheduling for modern superscalar architectures can be developed.

...read moreread less

Journal Article•DOI•

Distributed Seismic Unix: a tool for seismic data processing

[...]

Alejandro E. Murillo¹, Jean Luc Bell¹•Institutions (1)

Colorado School of Mines¹

10 Apr 1999-Concurrency and Computation: Practice and Experience

TL;DR: DSU is designed to assist geophysicists in developing and executing sequences of Seismic Unix (SU) applications in clusters of workstations as well as on tightly coupled multiprocessor machines.

...read moreread less

Abstract: This paper describes a distributed system called Distributed Seismic Unix (DSU) DSU provides tools for creating and executing application sequences over several types of multiprocessor environments DSU is designed to assist geophysicists in developing and executing sequences of Seismic Unix (SU) applications in clusters of workstations as well as on tightly coupled multiprocessor machines SU is a large collection of subroutine libraries, graphics tools and fundamental seismic data processing applications that is freely available via the Internet from the Center for Wave Phenomena (CWP) of the Colorado School of Mines SU is currently used at more than 500 sites in 32 countries around the world DSU is built on top of three publicly available software packages: SU itself; TCL/TK, which provides the necessary tools to build the graphical user interface (GUI); and PVM (Parallel Virtual Machine), which supports process management and communication DSU handles tree-like graphs representing sequences of SU applications Nodes of a graph represent SU applications, while the arcs represent the way the data flow from the root node to the lead nodes of the tree In general the root node corresponds to an application that reads or creates synthetic seismic data, and the leaf nodes are associated with applications that write or display the processed seismic data; intermediate nodes are usually associated with typical seismic processing applications like filters, convolutions and signal processing Pipelining parallelism is obtained when executing single-branch tree sequences, while a higher degree of parallelism is obtained when executing sequences with several branches A major advantage of the DSU framework for distribution is that SU applications do not need to be modified for parallelism; only a few low-level system functions need to be modified Copyright © 1999 John Wiley & Sons, Ltd

...read moreread less

Journal Article•DOI•

Scalability of Sparse Cholesky Factorization

[...]

Thomas Rauber, Gudula Rünger, Carsten Scholtes

01 Mar 1999-International Journal of High Speed Computing

TL;DR: This article investigates shared-memory implementations of several variants of sparse Cholesky factorization algorithms in a task-oriented execution model with dynamic scheduling, and considers the degree of parallelism, the scalability, and the scheduling overhead of the different algorithms.

...read moreread less

Abstract: A variety of algorithms have been proposed for sparse Cholesky factorization, including left-looking, right-looking, and supernodal algorithms. This article investigates shared-memory implementations of several variants of these algorithms in a task-oriented execution model with dynamic scheduling. In particular, we consider the degree of parallelism, the scalability, and the scheduling overhead of the different algorithms. Our emphasis lies in the parallel implementation for relatively large numbers of processors. As execution platform, we use the SB-PRAM, a shared-memory machine with up to 2048 processors. This article can be considered as a case study in which we try to answer the question of which performance we can hope to get for a typical irregular application on an ideal machine on which the locality of memory accesses can be ignored but for which the overhead for the management of data structures still takes effect. The investigation shows that certain algorithms are the best choice for a small number of processors, while other algorithms are better for many processors.

...read moreread less

Proceedings Article•DOI•

A framework for generating task parallel programs

[...]

U. Fissgus¹, Thomas Rauber¹, Gudula Rünger•Institutions (1)

Wittenberg University¹

21 Feb 1999

TL;DR: The usefulness of the approach is demonstrated by examples from numerical analysis which offer the potential of a mixed task and data parallel execution but for which it is not a priori clear, how this potential should be used for an implementation on a specific parallel machine.

...read moreread less

Abstract: We consider the generation of mixed task and data parallel programs and discuss how a clear separation into a task and data parallel level can support the development of efficient programs. The program development starts with a specification of the maximum degree of task and data parallelism and proceeds by performing several derivation steps in which the degree of parallelism is adapted to a specific parallel machine. We show how the final message-passing programs are generated and how the interaction between the task and data parallel levels can be established. We demonstrate the usefulness of the approach by examples from numerical analysis which offer the potential of a mixed task and data parallel execution but for which it is not a priori clear, how this potential should be used for an implementation on a specific parallel machine.

...read moreread less

Book Chapter•DOI•

Optimising Skeletal-Stream Parallelism on a BSP Computer

[...]

Andrea Zavanella¹•Institutions (1)

University of Pisa¹

31 Aug 1999

TL;DR: The paper presents a methodology for efficient implementation of the P3L Pipe and Farm on a BSP computer and provides a set of analytical models to predict the constructors performance using the BSP cost model.

...read moreread less

Abstract: Stream parallelism allows parallel programs to exploit the potential of executing different parts of the computation on distinct input data items. Stream parallelism can also exploit the concurrent evaluation of the same function on different input items. These techniques are usually named "pipelining" and "farming out". The P3L language includes two stream parallel skeletons: the Pipe and the Farm constructors. The paper presents a methodology for efficient implementation of the P3L Pipe and Farm on a BSP computer. The methodology provides a set of analytical models to predict the constructors performance using the BSP cost model. Therefore a set of optimisation rules to decide the optimal degree of parallelism and the optimal size for input tasks (grain) are derived. A prototype has been validated on a Cluster of PC and on a Cray T3D computer.

...read moreread less

Book Chapter•DOI•

MPI-Based Parallel Omplementation of a Lithography Pattern Simulation Algorithm

[...]

H. Radhakrishna¹, S. Divakar¹, N. Magotra¹, S. R. J. Bruek¹, A. Waters¹ - Show less +1 more•Institutions (1)

University of New Mexico¹

12 Apr 1999

TL;DR: This paper presents the parallelization of a pattern simulation algorithm for Imaging Interferometric Lithography (IIL), a Very Large Scale Integration (VLSI) process technology for producing sub-micron features that uses Message Passing Interface (MPI) libraries.

...read moreread less

Abstract: This paper presents the parallelization of a pattern simulation algorithm for Imaging Interferometric Lithography (IIL), a Very Large Scale Integration (VLSI) process technology for producing sub-micron features. The approach uses Message Passing Interface (MPI) libraries [1]. We also discuss some modifications to the basic parallel implementation that will result in efficient memory utilization and reduced communications among the processors. The scalability of runtime with degree of parallelism is also demonstrated. The scalability of runtime with degree of parallelism is also demonstrated. The algorithm was tested on three different platforms: IBM SP-2 running AIX, SGI Onyx 2 running IRIX 6.4, and a LINUX cluster of Pentium-233 workstations. The paper presents the results of these tests and also provides a comparison with those obtained with Mathcad (on Windows 95) and serial C (on Unix) implementations.

...read moreread less

Patent•

Neural network for real-time channel equalisation

[...]

Gordon Shiells, Hamish Grant, Spracklen Timothy

26 Jun 1999

TL;DR: In this paper, a neural network capable of extracting digital data from a radio signal has a sufficiently high degree of parallelism to dynamically determine at least one suitable channel model presenting a selected propagation path of the radio signal.

...read moreread less

Abstract: A Neural Network capable of extracting digital data from a radio signal has a sufficiently high degree of parallelism to dynamically determine at least one suitable channel model presenting a selected propagation path of the radio signal. The degree of parallelism provided by the Neural Network is sufficient to obtain channel equalisation of the received signal by processing data inputted into the Neural Network as the data are received in real-time from the radio signal.

...read moreread less

Proceedings Article•DOI•

Introducing Java into the teaching of distributed systems

[...]

Lan Jin¹, B. Hatfield, Hui-Chuan Cheng•Institutions (1)

California State University, Fresno¹

10 Nov 1999

TL;DR: This paper presents an example of multithreaded, concurrent program in Java for simulating a banking problem and shows interesting results with a high degree of parallelism and the ACID (Atomic-Consistent-Isolate-Durable) properties of transactions maintained correctly.

...read moreread less

Abstract: Due to the successful development of the Internet, the course of distributed systems has received great attention by many universities and colleges in their computer science curricula. A comprehensive distributed system course involves many disciplines, such as computer architecture, operating systems, concurrent programming and distributed algorithms. It covers many interesting topics, such as client-server computing, remote procedure calling and distributed database and transaction management systems. These can serve as good topics for a programming or design project of the course. In particular, we have the management of banking accounts a relatively simple but typical problem for such a project. This project involves at least three basic topics of distributed systems: transaction processing, concurrency control, and multithreaded programming. The authors have introduced the Java language into the project, because literature and their experience showed that Java is a good concurrent language. Java's object-oriented paradigm, clean structure and strong support of threads and concurrency make the project even more interesting for students. This paper presents an example of multithreaded, concurrent program in Java for simulating a banking problem. Running the program on a 4-processor SPARCstation-20 symmetric multiprocessor has shown interesting results with a high degree of parallelism and the ACID (Atomic-Consistent-Isolate-Durable) properties of transactions maintained correctly.

...read moreread less

DOI•

Automatische Einstellung des Parallelitätsgrades von Programmen = [Automatic Tuning of the Degree of Parallelism of Programs]

[...]

Otilia Werner-Kytölä

01 Jan 1999