scispace - formally typeset
Search or ask a question

Showing papers on "SPMD published in 2003"


Proceedings ArticleDOI
07 Jun 2003
TL;DR: Experimental results demonstrate that OpenMP provides competitive performance compared to MPI for a large set of experimental conditions, however the price of this performance is a strong programming effort on data set adaptation and inter-thread communications.
Abstract: When using a shared memory multiprocessor, the programmer faces the selection of the portable programming model which will deliver the best performance. Even if he restricts his choice to the standard programming environments (MPI and OpenMP), he has a choice of a broad range of programming approaches.To help the programmer in his selection, we compare MPI with three OpenMP programming styles (loop level, loop level with large parallel sections, SPMD) using a subset of the NAS benchmark (CG, MG, FT, LU), two dataset sizes (A and B) and two shared memory multiprocessors (IBM SP3 Night Hawk II, SGI Origin 3800). We also present a path from MPI to OpenMP SPMD guiding the programmers starting from an existing MPI code. We present the first SPMD OpenMP version of the NAS benchmark and compare it with other OpenMP versions from independent sources (PBN, SDSC and RWCP). Experimental results demonstrate that OpenMP provides competitive performance compared to MPI for a large set of experimental conditions. However the price of this performance is a strong programming effort on data set adaptation and inter-thread communications. MPI still provides the best performance under some conditions. We present breakdowns of the execution times and measurements of hardware performance counters to explain the performance differences.

74 citations


Journal ArticleDOI
01 Feb 2003
TL;DR: The hybrid MPI+OpenMP programming model is compared with pure MPI, compiler based parallelization, and other parallel programming models on hybrid architectures, and also on whether programming paradigms can separate the optimization of communication and computation.
Abstract: Most HPC systems are clusters of shared memory nodes. Parallel programming must combine the distributed memory parallelization on the node interconnect with the shared memory parallelization inside each node. The hybrid MPI+OpenMP programming model is compared with pure MPI, compiler based parallelization, and other parallel programming models on hybrid architectures. The paper focuses on bandwidth and latency aspects, and also on whether programming paradigms can separate the optimization of communication and computation. Benchmark results are presented for hybrid and pure MPI communication. This paper analyzes the strengths and weaknesses of several parallel programming models on clusters of SMP nodes.

60 citations


Book ChapterDOI
02 Oct 2003
TL;DR: Preliminary experiments comparing CAF and MPI versions of several of the NAS parallel benchmarks on an Itanium 2 cluster with a Myrinet 2000 interconnect show that the CAF compiler delivers performance that is roughly equal to or, in many cases, better than that of programs parallelized using MPI, even though support for global optimization of communication has not yet been implemented in the compiler.
Abstract: Co-array Fortran (CAF) is an emerging model for scalable, global address space parallel programming that consists of a small set of extensions to the Fortran 90 programming language. Compared to MPI, the widely-used message-passing programming model, CAF’s global address space programming model simplifies the development of single-program-multiple-data parallel programs by shifting the burden for choreographing and optimizing communication from developers to compilers. This paper describes an open-source, portable, and retargetable CAF compiler under development at Rice University that is well-suited for today’s high-performance clusters. Our compiler translates CAF into Fortran 90 plus calls to one-sided communication primitives. Preliminary experiments comparing CAF and MPI versions of several of the NAS parallel benchmarks on an Itanium 2 cluster with a Myrinet 2000 interconnect show that our CAF compiler delivers performance that is roughly equal to or, in many cases, better than that of programs parallelized using MPI, even though support for global optimization of communication has not yet been implemented in our compiler.

54 citations


Patent
14 Nov 2003
TL;DR: In this paper, a system and method for efficiently executing single program multiple data (SPMD) programs in a microprocessor is described, where a micro SIMD unit is located within the microprocessor.
Abstract: A system and method is disclosed for efficiently executing single program multiple data (SPMD) programs in a microprocessor. A micro single instruction multiple data (SIMD) unit is located within the microprocessor. A job buffer that is coupled to the micro SIMD unit dynamically allocates tasks to the micro SIMD unit. The SPMD programs each comprise a plurality of input data streams having moderate diversification of control flows. The system executes each SPMD program once for each input data stream of the plurality of input data streams.

47 citations


Journal ArticleDOI
TL;DR: This paper discusses and evaluates parallel implementations of a segmentation algorithm based on the Split-and-Merge approach and proposes and analyzes several strategies for the selection of region identifiers and their influence on execution time and load distribution.

31 citations


Journal ArticleDOI
01 Nov 2003
TL;DR: This work optimize the sequential program of the STEM-II (Sulphur Transport Eulerian Model 2) program, a large-scale pollution modeling application, with the aim of increasing data locality, and parallelized the program using OpenMP directives for shared memory systems, and the MPI library for distributed memory machines.
Abstract: The aim of this work is to provide a high performance air quality simulation using the STEM-II (Sulphur Transport Eulerian Model 2) program, a large-scale pollution modeling application First, we optimize the sequential program with the aim of increasing data locality Then, we parallelized the program using OpenMP directives for shared memory systems, and the MPI library for distributed memory machines Performance results are presented for a SGI O2000 multiproccessor, a Fujitsu AP3000 multicomputer and a Cluster of PCs Experimental results show that the parallel versions of the code achieve important reductions in the CPU time needed by each simulation This will allow us to obtain results with adequate speed and reliability for the industrial environment where it is intended to be applied

30 citations


Proceedings ArticleDOI
12 May 2003
TL;DR: Implementation of a "cluster-enabled" OpenMP compiler is presented, which converts programs written for OpenMP into parallel programs using the SCASH static library, moving all shared global variables into SCASH shared address space at runtime.
Abstract: OpenMP has attracted widespread interest because it is an easy-to-use parallel programming model for shared memory multiprocessor systems. Implementation of a "cluster-enabled" OpenMP compiler is presented. Compiled programs are linked to the page-based software distributed-shared-memory system, SCASH, which runs on PC clusters. This allows OpenMP programs to be run transparently in a distributed memory environment. The compiler converts programs written for OpenMP into parallel programs using the SCASH static library, moving all shared global variables into SCASH shared address space at runtime. As data mapping has a great impact on the performance of OpenMP programs compiled for software distributed-shared-memory, extensions to OpenMP directives are defined for specifying data mapping and loop scheduling behavior, allowing data to be allocated to the node where it is to be processed. Experimental results of benchmark programs on PC clusters using both Myrinet and fast Ethernet are reported.

24 citations


Journal ArticleDOI
TL;DR: This work aims at bringing single program multiple data (SPMD) programming into CORBA in a portable way, and shows that portable parallel CORBA objects can efficiently make use of high‐performance networks.
Abstract: With the availability of Computational Grids, new kinds of applications are emerging. They raise the problem of how to program them on such computing systems. In this paper, we advocate a programming model based on a combination of parallel and distributed programming models. Compared to previous approaches, this work aims at bringing SPMD programming into CORBA in a portable way. For example, we want to interconnect two parallel codes by CORBA without modifying either CORBA or the parallel communication API. We show that such an approach does not entail any loss of performance compared to previous approaches that required modification to the CORBA standard. Moreover, using an ORB that is able to exploit high performance networks, we show that portable parallel CORBA objects can efficiently make use of such networks.

22 citations


Book ChapterDOI
26 Jun 2003
TL;DR: A tool to relieve users from this task by automatically converting OpenMP programs into equivalent SPMD style OpenMP by considering how to modify array declarations, parallel loops, and showing how to handle a variety of OpenMP constructs including REDUCTION, ORDERED clauses and synchronization.
Abstract: The scalability of an OpenMP program in a ccNUMA system with a large number of processors suffers from remote memory accesses, cache misses and false sharing. Good data locality is needed to overcome these problems whereas OpenMP offers limited capabilities to control it on ccNUMA architecture. A so-called SPMD style OpenMP program can achieve data locality by means of array privatization, and this approach has shown good performance in previous research. It is hard to write SPMD OpenMP code; therefore we are building a tool to relieve users from this task by automatically converting OpenMP programs into equivalent SPMD style OpenMP. We show the process of the translation by considering how to modify array declarations, parallel loops, and showing how to handle a variety of OpenMP constructs including REDUCTION, ORDERED clauses and synchronization. We are currently implementing these translations in an interactive tool based on the Open64 compiler.

19 citations


Proceedings ArticleDOI
20 Oct 2003
TL;DR: The results show DSM has similar performance to message passing for the embarrassingly parallel class, however the performance of DSM is lower than PVM and MPI for the synchronous and loosely synchronous classes of problems.
Abstract: We compare the performance of the Treadmarks DSM system with two popular message passing systems (PVM and MPI). The comparison is done on 1, 2, 4, 8, 16, 24, and 32 nodes. Applications are chosen to represent three classes of problems: loosely synchronous, embarrassingly parallel, and synchronous. The results show DSM has similar performance to message passing for the embarrassingly parallel class. However the performance of DSM is lower than PVM and MPI for the synchronous and loosely synchronous classes of problems. An analysis of the reasons is presented.

19 citations


Proceedings ArticleDOI
12 May 2003
TL;DR: This paper describes the implementation of the page management in Mome, a user-level distributed shared memory (DSM) that provides a shared segment space to parallel programs running on distributed memory computers or clusters.
Abstract: This paper describes the implementation of the page management in Mome, a user-level distributed shared memory (DSM). Mome provides a shared segment space to parallel programs running on distributed memory computers or clusters. Individual processes can request for mappings between their local address space and Mome segments. The DSM handles the consistency of mapped memory regions at the page-level. A node can freely select the consistency model which is applied to its own view of a page among two models: the classical strong consistency model and a simple and very basic weak model. Under the weak model, each process of the parallel application must send a consistency request to the DSM each time its view of the shared data needs to integrate modifications from other nodes. Mome targets the execution of programs from the high performance community using an SPMD computation model and the coupling of these simulation codes using an MIMD model.

Journal ArticleDOI
01 Jun 2003
TL;DR: Numerical results and elapsed times measurements show the importance of using an appropriate load balancing algorithm and the associated reductions that can be achieved in the elapsed times and illustrate that the most suitable load balancing strategy may vary with the type of application and with the number of available processors.
Abstract: The central contribution of this work is SAMBA (Single Application, Multiple Load Balancing), a framework for the development of parallel SPMD (single program, multiple data) applications with load balancing. This framework models the structure and the characteristics common to different SPMD applications and supports their development. SAMBA also contains a library of load balancing algorithms. This environment allows the developer to focus on the specific problem at hand. Special emphasis is given to the identification of appropriate load balancing strategies for each application. Three different case studies were used to validate the functionality of the framework: matrix multiplication, numerical integration, and a genetic algorithm. These applications illustrate its ease of use and the relevance of load balancing. Their choice was oriented by the different load imbalance factors they present and by their different task creation mechanisms. The computational experiments reported for these case studies made possible the validation of SAMBA and the comparison, without additional reprogramming costs, of different load balancing strategies for each of them. The numerical results and the elapsed times measurements show the importance of using an appropriate load balancing algorithm and the associated reductions that can be achieved in the elapsed times. They also illustrate that the most suitable load balancing strategy may vary with the type of application and with the number of available processors. Besides the support to the development of SPMD applications, the facilities offered by SAMBA in terms of load balancing play also an important role in terms of the development of efficient parallel implementations.

Book ChapterDOI
26 Jun 2003
TL;DR: How to interprocedurally detect whether the OpenMP program consistently schedules the parallel loops and where the strategy used to translate them differs from the straightforward approach that can otherwise be applied is explained.
Abstract: A so-called SPMD style OpenMP program can achieve scalability on ccNUMA systems by means of array privatization, and earlier research has shown good performance under this approach. Since it is hard to write SPMD OpenMP code, we showed a strategy for the automatic translation of many OpenMP constructs into SPMD style in our previous work. In this paper, we first explain how to interprocedurally detect whether the OpenMP program consistently schedules the parallel loops. If the parallel loops are consistently scheduled, we may carry out array privatization according to OpenMP semantics. We give two examples of code patterns that can be handled despite the fact that they are not consistent, and where the strategy used to translate them differs from the straightforward approach that can otherwise be applied.

Book ChapterDOI
02 Oct 2003
TL;DR: In this article, the authors present new algorithms to enforce sequential consistency for the special case of the Single Program Multiple Data (SPMD) model of parallelism, and present three polynomial-time methods that more accurately support programs with array accesses.
Abstract: The simplest semantics for parallel shared memory programs is sequential consistency in which memory operations appear to take place in the order specified by the program. But many compiler optimizations and hardware features explicitly reorder memory operations or make use of overlapping memory operations which may violate this constraint. To ensure sequential consistency while allowing for these optimizations, traditional data dependence analysis is augmented with a parallel analysis called cycle detection. In this paper, we present new algorithms to enforce sequential consistency for the special case of the Single Program Multiple Data (SPMD) model of parallelism. First, we present an algorithm for the basic cycle detection problem, which lowers the running time from O(n 3) to O(n 2). Next, we present three polynomial-time methods that more accurately support programs with array accesses. These results are a step toward making sequentially consistent shared memory programming a practical model across a wide range of languages and hardware platforms.

Proceedings ArticleDOI
27 Oct 2003
TL;DR: A simple but effective approach for parallelization of cellular neural networks for image processing is developed using the SPMD model and is based on the structural data parallel approach.
Abstract: In this paper a simple but effective approach for parallelization of cellular neural networks for image processing is developed. Digital gray-scale images were used to evaluate the program. The approach uses the SPMD (single-program multiple-data) model and is based on the structural data parallel approach (Schikuta et al, 1996). The process of parallelizing the algorithm employs HPF to generate an MPI-based program and the performance behavior was analyzed on two different cluster architectures.

Journal ArticleDOI
TL;DR: This work discusses optimization strategies used and their degree of success to increase performance of an MPI-based unstructured finite element simulation code written in Fortran 90 and discusses performance results based on implementations using several modern massively parallel computing platforms.
Abstract: The message-passing interface (MPI) has become the standard in achieving effective results when using the message passing paradigm of parallelization. Codes written using MPI are extremely portable and are applicable to both clusters and massively parallel computing platforms. Since MPI uses the single program, multiple data (SPMD) approach to parallelism, good performance requires careful tuning of the serial code as well as careful data and control flow analysis to limit communication. We discuss optimization strategies used and their degree of success to increase performance of an MPI-based unstructured finite element simulation code written in Fortran 90. We discuss performance results based on implementations using several modern massively parallel computing platforms including the SGI Origin 3800, IBM Nighthawk 2 SMP, and Cray T3E-1200.

Proceedings ArticleDOI
22 Apr 2003
TL;DR: A small SPMD library (SPMDlib) is developed on top of MPI that contains much less but generic routines that can be optimized for different network topologies and Extensions for Fortran 90/95 and C are discussed.
Abstract: Most image processing algorithms can be parallelized by splitting parallel loops and by using very few communication patterns. Code parallelization using MPI still involves much programming overhead. In order to reduce these overheads, we first developed a small SPMD library (SPMDlib) on top of MPI. The programmer can use the library routines themselves, because they are easy to learn and to apply, even without knowing MPI. However, in order to increase user friendliness, we also develop a small set of parallelization and communication directives/pragmas (SPMDdir), together with a parser that converts these into library calls. SPMDdir is used to develop a new version of SPMDlib. This new version contains much less but generic routines that can be optimized for different network topologies. Extensions for Fortran 90/95 and C are discussed, as well as communication optimizations.

Journal ArticleDOI
01 Sep 2003
TL;DR: Some optimizations to achieve large simulations, such as communication overlapping, cache-friendly data management, and the use of a parallel sparse PCG solves the Poisson's equation are presented.
Abstract: In this work, we simulate the interaction between intense laser radiation and a fully ionized plasma by solving a Vlasov-Maxwell system using the "Particle-In-Cell" (PIC) method. This method provides a very detailed description of the plasma dynamics, but at the expense of large computer resources.Our SPMD 3D PIC code, CALDER, which is fully relativistic, is based on a spatial domain decomposition. Each processor is assigned one subdomain and is in charge of updating its field values and particle coordinates. This paper presents some optimizations to achieve large simulations, such as communication overlapping, cache-friendly data management, and the use of a parallel sparse PCG solves the Poisson's equation.Finally we present the benefits from these optimizations on the IBM SP3 and the physical results for a large case simulation obtained on the CEA/DIF Teraflops parallel computer.

Book ChapterDOI
26 Jun 2003
TL;DR: A prototype runtime system, providing support at the backend of the NANOS OpenMP compiler, that enables the execution of unmodified OpenMP Fortran programs on both SMPs and clusters of multiprocessors, either through the hybrid programming model (MPI+OpenMP) or directly on top of Software Distributed Shared Memory (SDSM).
Abstract: This paper presents a prototype runtime system, providing support at the backend of the NANOS OpenMP compiler, that enables the execution of unmodified OpenMP Fortran programs on both SMPs and clusters of multiprocessors, either through the hybrid programming model (MPI+OpenMP) or directly on top of Software Distributed Shared Memory (SDSM). The latter is feasible by adopting a share-everything approach for the generated by the OpenMP compiler code, which corresponds to the "default shared" philosophy of OpenMP. Specifically, the user-level thread stacks and the Fortran common blocks are allocated explicitly, though transparently to the programmer, in shared memory. The management of the internal runtime system structures and of the forkjoin multilevel parallelism is based on explicit communication, exploiting however the shared-memory hardware of the available SMP nodes whenever this is possible. The modular design of the runtime system allows the integration of existing unmodified SDSM libraries, despite their design for SPMD execution.

Journal ArticleDOI
TL;DR: The results reveal that both shared and distributed memory parallel computation are very efficient with an almost perfect application speedup and may be applied to the most advanced powder simulations.

01 Jan 2003
TL;DR: This work introduces a new view of distributed computation, called the NavP view, under which a distributed program is composed of multiple sequential self-migrating threads called DSCs, which exhibits the nice properties of algorithmic integrity and parallel program composition orthogonality.
Abstract: We introduce a new view of distributed computation, called the NavP view, under which a distributed program is composed of multiple sequential self-migrating threads called DSCs In contrast with those in the conventional SPMD style, programs developed in the NavP view exhibit the nice properties of algorithmic integrity and parallel program composition orthogonality, which make them clean and easy to develop and maintain The NavP programs are also scalable We use example code and performance data to demonstrate the advantages of using the NavP view for general purpose distributed parallel programming

Proceedings ArticleDOI
22 Apr 2003
TL;DR: This work presents a simple, case study performance analysis using three programs from the SPLASH-2 suite, and quantifies the overhead incurred by the programs when they are monitored with SBT, and concludes that the cost of the instrumentation is negligible.
Abstract: SBT is portable library and tool for on-line debugging and performance monitoring of shared-memory parallel programs using the single-program-multiple-data (SPMD) model of parallelism. SPMD programs often use barriers to synchronize threads of execution and to delimit the start and end of different phases of computation. Through its useful barrier constructs, dynamic performance warnings, and integration with hardware event counter libraries, SBT helps programmers localize deadlocks and performance bottlenecks in their parallel programs. To demonstrate SBT's applicability and usefulness, we present a simple, case study performance analysis using three programs from the SPLASH-2 suite. In addition, we quantify the overhead incurred by the programs when they are monitored with SBT, and conclude that the cost of the instrumentation is negligible.

Book ChapterDOI
TL;DR: DASUD (Diffusion Algorithm Searching Unbalanced Domains) algorithm has been implemented in an SPMD parallel-image thinning application to balance the workload in the processors as computation proceeds and was found to be effective in reducing computation time.
Abstract: DASUD (Diffusion Algorithm Searching Unbalanced Domains) algorithm has been implemented in an SPMD parallel-image thinning application to balance the workload in the processors as computation proceeds and was found to be effective in reducing computation time The average performance gain is about 40% for a test image of size 2688x1440 on a cluster of 12 PC’s in a PVM environment

Proceedings ArticleDOI
Suchuan Dong1, D. Lucor1, V. Symeonidis1, J. Xu1, George Em Karniadakis1 
09 Jun 2003
TL;DR: Because a greatly reduced number of processes are involved in the communications at each level, these multilevel parallel paradigms reduce the network latency overhead and enable the applications to scale to a large number of processors more efficiently.
Abstract: Realistic simulations of flow past a flexible cylinder subject to vortex-induced vibrations require a large number of Fourier modes along the cylinder span and high resolutions in the streamwise and cross-flow directions. Parallel computations employing a single-level parallelism for this type of problems have clear performance limitations that prevent effective scaling to the large processor count on modern supercomputers. We present two multilevel parallel paradigms based on MPI/MPI and MPI/OpenMP for high-order CFD methods within the spectral element framework and compare their performance. In the MPI/MPI model, we employ MPI process groups/communicators to decompose the flow domain and MPI processes into different levels. In the MPI/OpenMP model, we employ multiple OpenMP threads to split the workload within the subdomain and take a coarse-grain approach that significantly reduces the OpenMP synchronizations. For identical configurations the MPI/MPI model is observed to be generally more efficient. However, for dynamic p-refinement the MPI/OpenMP approach is more effective. Because a greatly reduced number of processes are involved in the communications at each level, these multilevel parallel paradigms reduce the network latency overhead and enable the applications to scale to a large number of processors more efficiently.

Proceedings ArticleDOI
TL;DR: A highly parallel (SIMD within SPMD) tokenizer for the APL language, itself written in APL, that serves the didactic purpose of demonstrating that a large amount of parallelism exists in non-numeric computation.
Abstract: We describe a highly parallel (SIMD within SPMD) tokenizer for the APL language, itself written in APL. The tokenizer does not break any new ground in the world of parallel computation, but does serve the didactic purpose of demonstrating that a large amount of parallelism exists in non-numeric computation. We plan to release the APEX APL Compiler, including the tokenizer, under the GNU Public License.

Book ChapterDOI
17 Dec 2003
TL;DR: O(m 2) algorithm for computing analogous delay set for SPMD programs that are used in practice, which must be structured with the property that all the variables are initialized before their value is read.
Abstract: We present compiler analysis for single program multiple data (SPMD) programs that communicate through shared address space. The choice of memory consistency model is sequential consistency as defined by Lamport [9]. Previous research has shown that these programs require cycle detection to perform any kind of code re-ordering either at hardware or software. So far, the best known cycle detection algorithm for SPMD programs has been given by Krishnamurthy et al [5, 6, 8]. Their algorithm computes a delay set that is composed of those memory access pairs that if re-ordered either by hardware or software may cause violation of sequential consistency. This delay set is computed in O(m 3) time where m is the number of read/write accesses. In this paper, we present O(m 2) algorithm for computing analogous delay set for SPMD programs that are used in practice. These programs must be structured with the property that all the variables are initialized before their value is read.

Journal Article
TL;DR: In this paper, the authors present a compiler analysis for single program multiple data (SPMD) programs that communicate through shared address space, where the choice of memory consistency model is sequential consistency as defined by Lamport.
Abstract: We present compiler analysis for single program multiple data (SPMD) programs that communicate through shared address space. The choice of memory consistency model is sequential consistency as defined by Lamport[9]. Previous research has shown that these programs require cycle detection to perform any kind of code re-ordering either at hardware or software. So far, the best known cycle detection algorithm for SPMD programs has been given by Krishnamurthy et al[5,6,8]. Their algorithm computes a delay set that is composed of those memory access pairs that if re-ordered either by hardware or software may cause violation of sequential consistency. This delay set is computed in O(m 3 ) time where m is the number of read/write accesses. In this paper, we present O(m 2 ) algorithm for computing analogous delay set for SPMD programs that are used in practice. These programs must be structured with the property that all the variables are initialized before their value is read.

Journal ArticleDOI
TL;DR: The local block distance between two active elements with the same offset and destination (source) in a processor will be investigated and an algorithm for the sending phase and receive-execute phase is developed.

01 Oct 2003
TL;DR: NASA Langley Research Center has developed a library that allows Intel NX message passing codes to be executed under the more popular and widely supported Parallel Virtual Machine (PVM) message passing library.
Abstract: NASA Langley Research Center has developed a library that allows Intel NX message passing codes to be executed under the more popular and widely supported Parallel Virtual Machine (PVM) message passing library. PVM was developed at Oak Ridge National Labs and has become the defacto standard for message passing. This library will allow the many programs that were developed on the Intel iPSC/860 or Intel Paragon in a Single Program Multiple Data (SPMD) design to be ported to the numerous architectures that PVM (version 3.2) supports. Also, the library adds global operations capability to PVM. A familiarity with Intel NX and PVM message passing is assumed.

Book ChapterDOI
Takanobu Ogawa1
01 Jan 2003
TL;DR: An adaptive Cartesian mesh flow solver in which a flow field is discretised with recursively subdivided rectangular meshes is parallelized and the 2N-tree data is utilised to organize anisotropically refined meshes.
Abstract: An adaptive Cartesian mesh flow solver in which a flow field is discretised with recursively subdivided rectangular meshes is parallelized. The 2N-tree data is utilised to organize anisotropically refined meshes. Parallelization is based on the SPMD paradigm. The domain decomposition technique is used and a tree data structure is split so that computational load should be balanced. Parallel efficiency is examined on a PC cluster.