scispace - formally typeset
Search or ask a question

Showing papers on "Degree of parallelism published in 1996"


Journal ArticleDOI
TL;DR: The CSDF paradigm is an extension of synchronous dataflow that still allows for static scheduling and, thus, a very efficient implementation of an application and it is indicated that CSDF is essential for modelling prescheduled components, like application-specific integrated circuits.
Abstract: We present cycle-static dataflow (CSDF), which is a new model for the specification and implementation of digital signal processing algorithms. The CSDF paradigm is an extension of synchronous dataflow that still allows for static scheduling and, thus, a very efficient implementation of an application. In comparison with synchronous dataflow, it is more versatile because it also supports algorithms with a cyclically changing, but predefined, behavior. Our examples show that this capability results in a higher degree of parallelism and, hence, a higher throughput, shorter delays, and less buffer memory. Moreover, they indicate that CSDF is essential for modelling prescheduled components, like application-specific integrated circuits. Besides introducing the CSDF paradigm, we also derive necessary and sufficient conditions for the schedulability of a CSDF graph. We present and compare two methods for checking the liveness of a graph. The first one checks the liveness of loops, and the second one constructs a single-processor schedule for one iteration of the graph. Once the schedulability is tested, a makespan optimal schedule on a multiprocessor can be constructed. We also introduce the heuristic scheduling method of our graphical rapid prototyping environment (GRAPE).

509 citations


Proceedings ArticleDOI
06 Aug 1996
TL;DR: A high-level overview of Legion, its vision, objectives, a brief sketch of how some of those objectives will be met, and the current status of the project are presented.
Abstract: The coming of giga-bit networks makes possible the realization of a single nationwide virtual computer comprised of a variety of geographically distributed high-performance machines and workstations. To realize the potential that the physical infrastructure provides, software must be developed that is easy to use, supports a large degree of parallelism in the application code, and manages the complexity of the underlying physical system for the user. Legion is a metasystem project at the University of Virginia designed to provide users with a transparent interface to the available resources, both at the programming interface level as well as at the user level. Legion addresses issues such as parallelism, fault-tolerance, security, autonomy, heterogeneity, resource management and access transparency in a multi-language environment. In this paper, we present a high-level overview of Legion, its vision, objectives, a brief sketch of how some of those objectives will be met, and the current status of the project.

145 citations


Journal ArticleDOI
TL;DR: The ILUM factorization described in this paper can be viewed as a multifrontal version of a Gaussian elimination procedure with threshold dropping which has a high degree of potential parallelism.
Abstract: Standard preconditioning techniques based on incomplete LU (ILU) factorizations offer a limited degree of parallelism, in general. A few of the alternatives advocated so far consist of either using some form of polynomial preconditioning or applying the usual ILU factorization to a matrix obtained from a multicolor ordering. In this paper we present an incomplete factorization technique based on independent set orderings and multicoloring. We note that in order to improve robustness, it is necessary to allow the preconditioner to have an arbitrarily high accuracy, as is done with ILUs based on threshold techniques. The ILUM factorization described in this paper is in this category. It can be viewed as a multifrontal version of a Gaussian elimination procedure with threshold dropping which has a high degree of potential parallelism. The, emphasis is on methods that deal specifically with general unstructured sparse matrices such as those arising from finite element methods on unstructured meshes.

145 citations


Proceedings ArticleDOI
17 Nov 1996
TL;DR: This paper presents parallel algorithms for data mining of association rules, and studies the degree of parallelism, synchronization, and data locality issues on the SGI Power Challenge shared-memory multi-processor.
Abstract: Data mining is an emerging research area, whose goal is to extract significant patterns or interesting rules from large databases. High-level inference from large volumes of routine business data can provide valuable information to businesses, such as customer buying patterns, shelving criterion in supermarkets and stock trends. Many algorithms have been proposed for data mining of association rules. However, research so far has mainly focused on sequential algorithms. In this paper we present parallel algorithms for data mining of association rules, and study the degree of parallelism, synchronization, and data locality issues on the SGI Power Challenge shared-memory multi-processor. We further present a set of optimizations for the sequential and parallel algorithms.Experiments show that a significant improvement of performance is achieved using our proposed optimizations. We also achieved good speed-up for the parallel algorithm, but we observe a need for parallel I/O techniques for further performance gains.

143 citations


Journal ArticleDOI
TL;DR: An innovative parallel dataflow scheme that requires no global communication except at the pixel level is developed, which combines the benefits of very high memory bandwidth, modularity, and scalability, a result which has not been achieved before.

48 citations


Proceedings ArticleDOI
Yamauchi1, Nakaya, Kajihara
17 Apr 1996
TL;DR: Reconfigurable massively parallel computer system called SOP (Sea Of Processors) that has ability to change its structure and achieves high performance by mapping the control flow and data flow of target algorithms directly on the reconfigurable hardware is described.
Abstract: This paper describes reconfigurable massively parallel computer system called SOP (Sea Of Processors) that has ability to change its structure and achieves high performance by mapping the control flow and data flow of target algorithms directly on the reconfigurable hardware. SOP system consists of huge number of programmable logic, memory and switch elements. Each logic element is mainly used to map logic/arithmetic operations and control circuits. SOP memory element has ability to process global search, global sorting, heap tree and min/max operations quickly. SOP compiler extracts high degree of parallelism from application programs written in C-language by exploiting operation and function level parallelism using control-data-flow based mapping technique.

44 citations


Proceedings ArticleDOI
01 Sep 1996
TL;DR: This paper shows how to reduce time and space thread overhead using control flow and register liveness information inferred after compilation and reduces the overall execution time of fine-grain threaded programs by 15-30%.
Abstract: Modern languages and operating systems often encourage programmers to use threads, or independent control streams, to mask the overhead of some operations and simplify program structure. Multitasking operating systems use threads to mask communication latency, either with hardwares devices or users. Client-server applications typically use threads to simplify the complex control-flow that arises when multiple clients are used. Recently, the scientific computing community has started using threads to mask network communication latency in massively parallel architectures, allowing computation and communication to be overlapped. Lastly, some architectures implement threads in hardware, using those threads to tolerate memory latency.In general, it would be desirable if threaded programs could be written to expose the largest degree of parallelism possible, or to simplify the program design. However, threads incur time and space overheads, and programmers often compromise simple designs for performance. In this paper, we show how to reduce time and space thread overhead using control flow and register liveness information inferred after compilation. Our techniques work on binaries, are not specific to a particular compiler or thread library and reduce the the overall execution time of fine-grain threaded programs by ≈ 15-30%. We use execution-driven analysis and an instrumented operating system to show why the execution time is reduced and to indicate areas for future work.

41 citations


Journal ArticleDOI
01 Jun 1996
TL;DR: Two parallel versions of the Buchberger algorithm for computing Grobner bases are described, one for the general case and one for homogeneous ideals, which exploit coarse grain parallelism.
Abstract: We describe two parallel versions of the Buchberger algorithm for computing Grobner bases, one for the general case and one for homogeneous ideals, which exploit coarse grain parallelism. For the general case, to avoid the growth in number and complexity of the polynomials to reduce, the algorithm adheres strictly to the same strategies as the best sequential implementation. A suitable communication procotol has been designed to ensure proper synchronization of the various processes and to limit their idle time. We provide a detailed analysis the maximum potential degree of parallelism that is achievable with such architecture. The analysis corresponds to the results of our experimental implementation and also explains similar results obtained by other authors.

36 citations


31 Dec 1996
TL;DR: This paper discusses issues and reports on the experience of PSPARSLIB, an on-going project for building a library of parallel iterative sparse matrix solvers, which aims to find efficient ways to precondition the system.
Abstract: Solving sparse irregularly structured linear systems on parallel platforms poses several challenges. First, sparsity makes it difficult to exploit data locality, whether in a distributed or shared memory environment. A second, perhaps more serious challenge, is to find efficient ways to precondition the system. Preconditioning techniques which have a large degree of parallelism, such as multicolor SSOR, often have a slower rate of convergence than their sequential counterparts. Finally, a number of other computational kernels such as inner products could ruin any gains gained from parallel speed-ups, and this is especially true on workstation clusters where start-up times may be high. In this paper we discuss these issues and report on our experience with PSPARSLIB, an on-going project for building a library of parallel iterative sparse matrix solvers.

27 citations


03 Oct 1996
TL;DR: The development of the adaptive multiple independent runs (MIR) implementation, a flexible general-purpose annealing package suitable for large multi-computers that performs at least as well as the other parallel approaches, and requires no prior knowledge of the target problem, no parallelization of sequential code, and no painstaking fine-tuning of theAnnealing parameters.
Abstract: The purpose of this project was to develop an efficient multi-computer implementation of the simulated annealing approximation algorithm (SA). This algorithm has recently been proven successful in attacking large and complex multi-criteria combinatorial optimization problems. However, large problems still require significant amounts of time, so there is a need to develop implementations that will run efficiently on the massively parallel multi-computers that are appearing on the market. Previous parallel annealing implementations have been primarily aimed at shared-memory machines, since the most straightforward algorithms require that all processors have access to the data structures defining the model. Shared-memory machines appear to be limited in their degree of parallelism, so message-passing implementations must be developed to take full advantage of the increasing power of multi-computers. This dissertation will develop and evaluate new multi-computer annealing algorithms. The focus of the first part of this research is on the parallel move simulated annealing approach (PMSA). With this method each processor starts with an identical copy of the configuration being optimized. All processors then simultaneously make one or several moves each on their own copies of the configuration. After such a cycle, individual copies of the configuration are combined to produce one or more new configurations. Then the next cycle of moves begins. The analysis of PMSA indicates that results improve as we decrease the number of processors working on a configuration and thus increase the number of results from which to select at the conclusion. This leads to the development of the adaptive multiple independent runs (MIR) implementation. The MIR method simply distributes copies of independent sequential annealing runs over all available processors. The inverse power law convergence of sequential annealing allows us to further subdivide the runs on each processor into multiple short runs, improving the run-time and yielding more solutions from which to choose. During its initial phase, the MIR algorithm estimates the starting and stopping temperatures and the total run length. The use of multiple runs permits these estimates to be further improved, and suggests a scheme for determining a suitable cooling schedule. The MIR approach performs at least as well as the other parallel approaches that we have studied, and requires no prior knowledge of the target problem, no parallelization of sequential code, and no painstaking fine-tuning of the annealing parameters. The MIR approach thus provides a flexible general-purpose annealing package suitable for large multi-computers; furthermore, this approach should also provide an effective method for sequential machines.

22 citations


Journal ArticleDOI
15 May 1996
TL;DR: It is shown that in a simple K-random interaction model the local times progress at a rate 1/(K + 1), and it is found that the asymptotic distribution of local times is described by a traveling wave solution with exponentially decaying tails.
Abstract: Lubachevsky [5] introduced a new parallel simulation technique intended for systems with limited interactions between their many components or sites. Each site has a local simulation time, and the states of the sites are updated asynchronously. This asynchronous updating appears to allow the simulation to achieve a high degree of parallelism, with very low overhead in processor synchronization. The key issue for this asynchronous updating technique is: how fast do the local times make progress in the large system limit? We show that in a simple K-random interaction model the local times progress at a rate 1/(K + 1). More importantly, we find that the asymptotic distribution of local times is described by a traveling wave solution with exponentially decaying tails. In terms of the parallel simulation, though the interactions are local, a very high degree of global synchronization results, and this synchronization is succinctly described by the traveling wave solution. Moreover, we report on experiments that suggest that the traveling wave solution is universal; i.e., it holds in realistic scenarios (out of reach of our analysis) where interactions among sites are not random.

Proceedings ArticleDOI
23 Oct 1996
TL;DR: Two facts that suggest the desirability of a hierarchical approach to cost-effective high-performance computing are empirically established and the cost-efficiency advantage of heterogeneous over homogeneous multiprocessor systems are supported.
Abstract: Two facts that suggest the desirability of a hierarchical approach to cost-effective high-performance computing are empirically established in this paper. The first fact is the temporal locality of programs with respect to the degree of parallelism. Two temporal (instruction and data) locality principles are identified and empirically established for a set of programs. The impact of this behavior is discussed with respect to the proposed heterogeneous multilevel architecture. The second fact that supports the hierarchical architecture is the cost-efficiency advantage of heterogeneous over homogeneous multiprocessor systems. An initial performance analysis is presented which quantifies this fact for the proposed heterogeneous hierarchical organization. The proposed multilevel processor configuration uses fast and costly resources sparingly to reduce sequential and low parallelism bottlenecks. The resulting organization tries to balance cost, speed and parallelism granularity.

Journal ArticleDOI
TL;DR: The techniques presented in this paper used in combination with prior work on reducing the height of data dependences provide a comprehensive approach to accelerating loops with conditional exits.
Abstract: The performance of applications executing on processors with instruction level parallelism is often limited by control and data dependences. Performance bottlenecks caused by dependences can frequently be eliminated through transformations which reduce the height of critical paths through the program. The utility of these techniques can be demonstrated in an increasingly broad range of important situations. This paper focuses on the height reduction of control recurrences within loops with data dependent exits. Loops with exits are transformed so as to alleviate performance bottlenecks resulting from control dependences. A compilation approach to effect these transformations is described. The techniques presented in this paper used in combination with prior work on reducing the height of data dependences provide a comprehensive approach to accelerating loops with conditional exits. In many cases, loops with conditional exits provide a degree of parallelism traditionally associated with vectorization. Multiple iterations of a loop can be retired in a single cycle on a processor with adequate instruction level parallelism with no cost in code redundancy. In more difficult cases, height reduction requires redundant computation or may not be feasible.

Proceedings ArticleDOI
02 Sep 1996
TL;DR: The proposed multithreaded architecture obtains a high degree of parallelism at the server side, allowing both the disk controller and the network card controller work in parallel, and achieves synchronized playback of the video stream at its precise rate at the client side.
Abstract: In this paper we present the design and implementation of a client/server based multimedia architecture for supporting video-on-demand applications. We describe in detail the software architecture of the implementation along with the adopted buffering mechanism. The proposed multithreaded architecture obtains, on one hand, a high degree of parallelism at the server side, allowing both the disk controller and the network card controller work in parallel. On the other hand; at the client side, it achieves the synchronized playback of the video stream at its precise rate, decoupling this process from the reception of data through the network. Additionally, we have derived, under an engineering perspective, some services that a real-time operating system should offer to satisfy the requirements found in video-on-demand applications.

Proceedings ArticleDOI
12 Aug 1996
TL;DR: A compile time partitioning and scheduling approach based on the above motivation for DOALL loops where communication without data replication is inevitable and the load balancing phase attempts to reduce the number of processors without degrading the completion time.
Abstract: The loop partitioning problem on modern distributed memory systems is no longer fully communication bound primarily due to a significantly lower ratio of communication/computation speeds. The useful parallelism may be exploited on these systems to an extent that the communication balances the parallelism and does not produce a very high overhead to nullify all the gains due to the parallelism. We describe a compile time partitioning and scheduling approach based on the above motivation for DOALL loops where communication without data replication is inevitable. First, the code partitioning phase analyzes the references in the body of the DOALL loop nest and determines a set of directions for reducing a larger degree of communication by trading a lesser degree of parallelism. Next, the data distribution phase uses a new larger partition owns rule to achieve computation and communication load balance. The granularity adjustment phase attempts to further eliminate communication through merging partitions to reduce the completion time. Finally, the load balancing phase attempts to reduce the number of processors without degrading the completion time and the mapping phase schedules the partitions on available processors. Relevant theory and algorithms are developed along with a performance evaluation on Cray T3D.

Journal ArticleDOI
TL;DR: This paper deals with the application of a high-performance instrument which embodies a multiprocessor network and a special data acquisition board for the real-time implementation of a model-based measurement technique, based on a parallel solution of a measurement algorithm derived from a dynamic mathematical model of the system under analysis.
Abstract: This paper deals with the application of a high-performance instrument which embodies a multiprocessor network and a special data acquisition board Specifically it has been adopted for the real-time implementation of a model-based measurement technique, based on a parallel solution of a measurement algorithm derived from a dynamic mathematical model of the system under analysis The research reported in this paper has been oriented toward a real application We have tested the performance of our implementation, carrying out measurement of quantities which cannot be directly sensed on linear induction motors The results obtained by using this instrument are discussed, showing the system performance in terms of execution speed The effective errors of the acquisition system have been examined, in order to estimate the achievable accuracy, highlighting the influence of the ADC bit number on the measured quantities Furthermore the truncation error, which arises from the consideration of only a finite number of current harmonics, has been evaluated Finally the implementation of the measurement algorithm on a different number of processors has made it possible to evaluate its degree of parallelism by measuring the different speed-up factors

Journal ArticleDOI
TL;DR: The goal is to quantify the floating point, memory, I/O, and communication requirements of highly parallel scientific applications that perform explicit communication and develop analytical models for the effects of changing the problem size and the degree of parallelism for several of the applications.
Abstract: This paper studies the behavior of scientific applications running on distributed memory parallel computers Our goal is to quantify the floating point, memory, I/O, and communication requirements of highly parallel scientific applications that perform explicit communication In addition to quantifying these requirements for fixed problem sizes and numbers of processors, we develop analytical models for the effects of changing the problem size and the degree of parallelism for several of the applications The contribution of our paper is that it provides quantitative data about real parallel scientific applications in a manner that is largely independent of the specific machine on which the application was run Such data, which are clearly very valuable to an architect who is designing a new parallel computer, were not previously available For example, the majority of research papers in interconnection networks have used simulated communication loads consisting of fixed-size messages Our data, which show that using such simulated loads is unrealistic, can be used to generate more realistic communication loads

Book ChapterDOI
19 Aug 1996
TL;DR: Routines are inserted into a program which detect the amount of computational work without using problem-specific parameters and adapt the number of used CPUs at runtime under given speedup/efficiency constraints.
Abstract: In this paper we present a new method for achieving a higher cost-efficiency on parallel computers We insert routines into a program which detect the amount of computational work without using problem-specific parameters and adapt the number of used CPUs at runtime under given speedup/efficiency constraints Several user-tunable strategies for selecting the number of processors are presented and compared The modularity of this approach and its application-independence permit a general use on parallel computers with a scalable degree of parallelism

Journal ArticleDOI
TL;DR: The main objective is to preserve simulation/synthesis correspondence during synthesis and to produce hardware that operates with a high degree of parallelism.

Book ChapterDOI
01 Jan 1996
TL;DR: This chapter discusses parallelization of the sequential code SWAN (simulating waves nearshore) on distributed memory architectures using MPI for simulating wind-generated waves in coastal regions using the third-generation wave model SWAN.
Abstract: Publisher Summary This chapter discusses parallelization of the sequential code SWAN (simulating waves nearshore) on distributed memory architectures using MPI for simulating wind-generated waves in coastal regions. Efficient parallel algorithms are required to calculate spectra of random short-crested, wind-generated waves in coastal regions using the third-generation wave model SWAN. The propagation schemes used in SWAN are fully implicit, so that they can be utilized for computing waves in shallow water. Two strategies for parallelizing these schemes are presented: (1) the block Jacobi approximation with a high degree of parallelism, and (2) the block wavefront approach that is to a large extent parallelizable. Contrary to the first one, the latter has the same behavior as the sequential method with respect to convergence. Numerical experiments are run on a dedicated Beowulf cluster with a real-life application. They show that good speedups have been achieved with the block wavefront approach, as long as the computational domain is not divided into too thin slices. Concerning the block Jacobi method, a considerable decline in performance is observed, which is attributable to the numerical overhead arising from tripling the number of iterations.

Proceedings ArticleDOI
15 Apr 1996
TL;DR: A highly parallel method for extracting inferences from text based on a marker-propagation algorithm that establishes semantic paths between knowledge base concepts.
Abstract: In this paper, we describe a highly parallel method for extracting inferences from text. The method is based on a marker-propagation algorithm that establishes semantic paths between knowledge base concepts. The paper presents the structure of the system, the marker-propagation algorithm, and results that show a large degree of parallelism.

Proceedings ArticleDOI
24 Jan 1996
TL;DR: An algorithm for the solution of banded linear systems is presented and discussed which combines stability with scalability by implementing divide and conquer for Gaussian elimination with partial pivoting and has low redundancy, high degree of parallelism and relatively low communication.
Abstract: An algorithm for the solution of banded linear systems is presented and discussed which combines stability with scalability. This is achieved by implementing divide and conquer for Gaussian elimination with partial pivoting. Earlier divide and conquer algorithms for Gaussian elimination have problems with instabilities and can even break down as they implement a more restricted form of pivoting. The key observation used for the implementation is the invariance of LU factorization with partial pivoting under permutations. Theoretical analysis shows that the algorithm has low redundancy, a high degree of parallelism and relatively low communication.

Book ChapterDOI
26 Aug 1996
TL;DR: An exact algorithm for finding a computation mapping and data distributions that minimize, for a given degree of parallelism, the number of remote data accesses in a distributed memory parallel computer (DMPC).
Abstract: We describe an exact algorithm for finding a computation mapping and data distributions that minimize, for a given degree of parallelism, the number of remote data accesses in a distributed memory parallel computer (DMPC). This problem is shown to be NP-hard.

Proceedings ArticleDOI
12 Aug 1996
TL;DR: This study uses performance data obtained from an SGI multiprocessor to evaluate several processor scheduling strategies and examines gang scheduling, static space sharing, and a dynamic allocation scheme called loop-level process control with three new dynamic allocation heuristics.
Abstract: Small-scale shared-memory multiprocessors are commonly used in a workgroup environment where multiple applications, both parallel and sequential, are executed concurrently while sharing the processors and other system resources. To utilize the processors efficiently, an effective scheduling strategy is required. We use performance data obtained from an SGI multiprocessor to evaluate several processor scheduling strategies. We examine gang scheduling (coscheduling), static space sharing (space partitioning), and a dynamic allocation scheme called loop-level process control (LLPC) with three new dynamic allocation heuristics. We use regression analysis to quantify the measured data and thereby explore the relationship between the degree of parallelism of the application, the size of the system, the processor allocation strategy and the resulting performance. We also attempt to predict the performance of an application in a multiprogrammed environment. While the execution time predictions are relatively coarse, the models produce a reasonable rank-ordering of the scheduling strategies for each application. This study also shows that dynamically partitioning the system using LLPC or similar heuristics provides better performance for applications with a high degree of parallelism than either gang scheduling or static space sharing.

01 Jan 1996
TL;DR: It is shown that even though the scheduler may grant an irregular job substantial machine power, the job may not be able to harness that power efficiently unless it receives special run-time support in the form of load-balancing.
Abstract: The problem considered in this thesis is how to run a workload of multiple parallel jobs on a single parallel machine. Jobs are assumed to be data-parallel with large degrees of parallelism, and the machine is assumed to have an MIMD architecture. We identify a spectrum of scheduling policies between the two extremes of time-slicing, in which jobs take turns to use the whole machine, and space-slicing, in which jobs get disjoint subsets of processors for their own dedicated use. Each of these scheduling policies is evaluated using a metric suited for interactive execution: the minimum machine power being devoted to any job, averaged over time. The following result is demonstrated. If there is no advance knowledge of job characteristics (such as running time, I/O frequency and communication locality) the best scheduling policy is gang-scheduling with instruction-balance. This conclusion validates some of the current practices in commercial systems. The proof uses the notions of clairvoyant adversaries and competitive ratios, drawn from the field of on-line algorithms. The work is then extended to irregular jobs, i.e., jobs in which the degree of parallelism varies during execution. It is shown that even though the scheduler may grant an irregular job substantial machine power, the job may not be able to harness that power efficiently unless it receives special run-time support in the form of load-balancing. A unified analysis is then presented to compare and evaluate various load-balancing schemes. We conclude with some preliminary ideas on the interaction between scheduling and other functions of the operating systems, such as dynamic memory allocation, virtual memory and I/O.

Proceedings ArticleDOI
24 Jan 1996
TL;DR: The parallelizing algorithm solves the important problem of deciding the set of transformations to apply in order to maximize the degree of parallelism, the number of parallel loops within a loop nest, and presents a way of generating efficient transformed code that exploits coarse grain parallelism on a MIMD system.
Abstract: The paper extends the framework of linear loop transformations adding a new nonlinear step at the transformation process. The current framework of linear loop transformation cannot identify a significant fraction of parallelism. For this reason, we present a method to complement it with some basic transformations in order to extract the maximum loop parallelism in perfect nested loops with tight recurrences in the dependence graph. The parallelizing algorithm solves the important problem of deciding the set of transformations to apply in order to maximize the degree of parallelism, the number of parallel loops within a loop nest, and presents a way of generating efficient transformed code that exploits coarse grain parallelism on a MIMD system.

Journal ArticleDOI
Lu Jian1
TL;DR: A hierarchical object-oriented design methodology in which two kinds of parallelism, that is, internal parallelism and service parallelism can be exploited gradually and a kind of virtual atomicity is provided.
Abstract: After surveying the rely-guarantee and some related approaches to extending VDM to develop parallel programs, two main problems are found. One problem is that all explorations of parallelism are done in the stage of operation decomposition or afterwards so that the degree of parallelism is restricted. Another problem is that the atomicity is fixed at one level and the development complexity can not be controlled effectively because there is no natural means to let the level of granularity be under flexible control of the designer. In order to solve these two problems, we introduce a new concept — data decomposition which is based on the ideas of model split, modularisation and operation decomposition, and combine it with VDM to form a more general formal development method DD-VDM, in which some kind of operation decompositions, i.e., operation split can be done before some data reifications. Then a nested parallel object-oriented structure is proposed. Combining these ideas into the unified framework, this paper presents a hierarchical object-oriented design methodology in which two kinds of parallelism, that is, internal parallelism and service parallelism, can be exploited gradually and a kind of virtual atomicity is provided.

Book ChapterDOI
23 Sep 1996
TL;DR: This work analyzes three important factors that influence the behavior of the parallel execution model: skew factor, degree of parallelism and degree of partitioning, and reports on experiments varying these three parameters with the DBS3 prototype on a 72-node KSR1 multiprocessor.
Abstract: The gains of parallel query execution can be limited because of high start-up time, interference between execution entities, and poor load balancing. In this paper, we present a solution which reduces these limitations in DBS3, a shared-memory parallel database system. This solution combines static data partitioning and dynamic processor allocation to adapt to the execution context. It makes DBS3 almost insensitive to data skew and allows decoupling the degree of parallelism from the degree of data partitioning. To address the problem of load balancing in the presence of data skew, we analyze three important factors that influence the behavior of our parallel execution model: skew factor, degree of parallelism and degree of partitioning. We report on experiments varying these three parameters with the DBS3 prototype on a 72-node KSR1 multiprocessor. The results demonstrate high performance gains, even with highly skewed data.

ReportDOI
01 May 1996
TL;DR: The Newton-Krylov-Schwarz solution technique is increasingly being investigated for computational fluid dynamics (CFD) applications due to the advantages of full coupling of all variables and equations, rapid non-linear convergence, and moderate memory requirements.
Abstract: Parallel implementations of a Newton-Krylov-Schwarz algorithm are used to solve a model problem representing low Mach number compressible fluid flow over a backward-facing step. The Mach number is specifically selected to result in a numerically {open_quote}stiff{close_quotes} matrix problem, based on an implicit finite volume discretization of the compressible 2D Navier-Stokes/energy equations using primitive variables. Newton`s method is used to linearize the discrete system, and a preconditioned Krylov projection technique is used to solve the resulting linear system. Domain decomposition enables the development of a global preconditioner via the parallel construction of contributions derived from subdomains. Formation of the global preconditioner is based upon additive and multiplicative Schwarz algorithms, with and without subdomain overlap. The degree of parallelism of this technique is further enhanced with the use of a matrix-free approximation for the Jacobian used in the Krylov technique (in this case, GMRES(k)). Of paramount interest to this study is the implementation and optimization of these techniques on parallel shared-memory hardware, namely the Cray C90 and SGI Challenge architectures. These architectures were chosen as representative and commonly available to researchers interested in the solution of problems of this type. The Newton-Krylov-Schwarz solution technique is increasingly being investigated for computational fluid dynamics (CFD) applications due to the advantages of full coupling of all variables and equations, rapid non-linear convergence, and moderate memory requirements. A parallel version of this method that scales effectively on the above architectures would be extremely attractive to practitioners, resulting in efficient, cost-effective, parallel solutions exhibiting the benefits of the solution technique.

Proceedings ArticleDOI
12 Aug 1996
TL;DR: The evaluation results obtained through the join queries with skewed data, show that the SDC-II works quite efficiently under various conditions attaining a high degree of parallelism.
Abstract: This paper presents the implementation and performance evaluation of the SDC-II, the Super Database Computer II. The SDC-II is a highly parallel relational database server, which consists of eight data processing modules interconnected by two networks, where each module contains up to seven processors connected by two busses and four disk drives. The SDC-II employs several key techniques to efficiently support join-intensive queries: a sophisticated parallel hash join algorithm named "the bucket spreading parallel hash join algorithm": an efficient data passing mechanism in hardware and software, and an intelligent interconnection network which has an ability to generate a flat bucket distribution and to provide almost conflict free routing. The evaluation results obtained through the join queries with skewed data, show that the SDC-II works quite efficiently under various conditions attaining a high degree of parallelism.