scispace - formally typeset
Search or ask a question

Showing papers on "Parallel algorithm published in 1997"


Journal ArticleDOI
TL;DR: A parallel preconditioner is presented for the solution of general sparse linear systems of equations using a sparse approximate inverse computed explicitly and then applied as a preconditionser to an iterative method.
Abstract: A parallel preconditioner is presented for the solution of general sparse linear systems of equations. A sparse approximate inverse is computed explicitly and then applied as a preconditioner to an iterative method. The computation of the preconditioner is inherently parallel, and its application only requires a matrix-vector product. The sparsity pattern of the approximate inverse is not imposed a priori but captured automatically. This keeps the amount of work and the number of nonzero entries in the preconditioner to a minimum. Rigorous bounds on the clustering of the eigenvalues and the singular values are derived for the preconditioned system, and the proximity of the approximate to the true inverse is estimated. An extensive set of test problems from scientific and industrial applications provides convincing evidence of the effectiveness of this approach.

635 citations


Journal ArticleDOI
17 Oct 1997-Science
TL;DR: The maximal clique problem has been solved by means of molecular biology techniques and the algorithm is highly parallel and has satisfactory fidelity, representing further evidence for the ability of DNA computing to solve NP-complete search problems.
Abstract: The maximal clique problem has been solved by means of molecular biology techniques. A pool of DNA molecules corresponding to the total ensemble of six-vertex cliques was built, followed by a series of selection processes. The algorithm is highly parallel and has satisfactory fidelity. This work represents further evidence for the ability of DNA computing to solve NP-complete search problems.

610 citations


Proceedings ArticleDOI
01 Jun 1997
TL;DR: The experimental results on a Cray T3D parallel computer show that the Hybrid Distribution algorithm scales linearly and exploits the aggregate memory better and can generate more association rules with a single scan of database per pass.
Abstract: One of the important problems in data mining is discovering association rules from databases of transactions where each transaction consists of a set of items. The most time consuming operation in this discovery process is the computation of the frequency of the occurrences of interesting subset of items (called candidates) in the database of transactions. To prune the exponentially large space of candidates, most existing algorithms, consider only those candidates that have a user defined minimum support. Even with the pruning, the task of finding all association rules requires a lot of computation power and time. Parallel computers offer a potential solution to the computation requirement of this task, provided efficient and scalable parallel algorithms can be designed. In this paper, we present two new parallel algorithms for mining association rules. The Intelligent Data Distribution algorithm efficiently uses aggregate memory of the parallel computer by employing intelligent candidate partitioning scheme and uses efficient communication mechanism to move data among the processors. The Hybrid Distribution algorithm further improves upon the Intelligent Data Distribution algorithm by dynamically partitioning the candidate set to maintain good load balance. The experimental results on a Cray T3D parallel computer show that the Hybrid Distribution algorithm scales linearly and exploits the aggregate memory better and can generate more association rules with a single scan of database per pass.

410 citations


Journal ArticleDOI
TL;DR: This paper describes new parallel association mining algorithms that use novel itemset clustering techniques to approximate the set of potentially maximal frequent itemsets, and presents results on the performance of the algorithms on various databases, and compares it against a well known parallel algorithm.
Abstract: Discovery of association rules is an important data mining task. Several parallel and sequential algorithms have been proposed in the literature to solve this problem. Almost all of these algorithms make repeated passes over the database to determine the set of frequent itemsets (a subset of database items), thus incurring high I/O overhead. In the parallel case, most algorithms perform a sum-reduction at the end of each pass to construct the global counts, also incurring high synchronization cost. In this paper we describe new parallel association mining algorithms. The algorithms use novel itemset clustering techniques to approximate the set of potentially maximal frequent itemsets. Once this set has been identified, the algorithms make use of efficient traversal techniques to generate the frequent itemsets contained in each cluster. We propose two clustering schemes based on equivalence classes and maximal hypergraph cliques, and study two lattice traversal techniques based on bottom-up and hybrid search. We use a vertical database layout to cluster related transactions together. The database is also selectively replicated so that the portion of the database needed for the computation of associations is local to each processor. After the initial set-up phase, the algorithms do not need any further communication or synchronization. The algorithms minimize I/O overheads by scanning the local database portion only twice. Once in the set-up phase, and once when processing the itemset clusters. Unlike previous parallel approaches, the algorithms use simple intersection operations to compute frequent itemsets and do not have to maintain or search complex hash structures. Our experimental testbed is a 32-processor DEC Alpha cluster inter-connected by the Memory Channel network. We present results on the performance of our algorithms on various databases, and compare it against a well known parallel algorithm. The best new algorithm outperforms it by an order of magnitude.

341 citations


Journal ArticleDOI
TL;DR: Time-efficient algorithms to solve the maze-routing problem on a reconfigurable mesh architecture and a fast algorithm to find the single shortest path (SSP) are presented.

275 citations


Journal ArticleDOI
TL;DR: The first algorithms to factor a wide class of sparse matrices that are asymptotically as scalable as dense matrix factorization algorithms on a variety of parallel architectures are presented.
Abstract: In this paper, we describe scalable parallel algorithms for symmetric sparse matrix factorization, analyze their performance and scalability, and present experimental results for up to 1,024 processors on a Gray T3D parallel computer. Through our analysis and experimental results, we demonstrate that our algorithms substantially improve the state of the art in parallel direct solution of sparse linear systems-both in terms of scalability and overall performance. It is a well known fact that dense matrix factorization scales well and can be implemented efficiently on parallel computers. In this paper, we present the first algorithms to factor a wide class of sparse matrices (including those arising from two- and three-dimensional finite element problems) that are asymptotically as scalable as dense matrix factorization algorithms on a variety of parallel architectures. Our algorithms incur less communication overhead and are more scalable than any previously known parallel formulation of sparse matrix factorization. Although, in this paper, we discuss Cholesky factorization of symmetric positive definite matrices, the algorithms can be adapted for solving sparse linear least squares problems and for Gaussian elimination of diagonally dominant matrices that are almost symmetric in structure. An implementation of one of our sparse Cholesky factorization algorithms delivers up to 20 GFlops on a Gray T3D for medium-size structural engineering and linear programming problems. To the best of our knowledge, this is the highest performance ever obtained for sparse Cholesky factorization on any supercomputer.

239 citations


Journal ArticleDOI
TL;DR: The minimum-degree greedy algorithm is shown to achieve a performance ratio of (Δ+2)/3 for approximating independent sets in graphs with degree bounded by Δ, and a precise characterization of the size of the independent sets found by the algorithm as a function of the independence number is found.
Abstract: Theminimum-degree greedy algorithm, or Greedy for short, is a simple and well-studied method for finding independent sets in graphs. We show that it achieves a performance ratio of (Δ+2)/3 for approximating independent sets in graphs with degree bounded by Δ. The analysis yields a precise characterization of the size of the independent sets found by the algorithm as a function of the independence number, as well as a generalization of Turan's bound. We also analyze the algorithm when run in combination with a known preprocessing technique, and obtain an improved $$(2\bar d + 3)/5$$ performance ratio on graphs with average degree $$\bar d$$ , improving on the previous best $$(\bar d + 1)/2$$ of Hochbaum. Finally, we present an efficient parallel and distributed algorithm attaining the performance guarantees of Greedy.

234 citations


Book
26 Feb 1997
TL;DR: This book is written as a textbook for undergraduate and graduate students and provides a careful explanation of the subject as well as motivation for further research.
Abstract: This book is devoted to the investigation of a special topic in theoretical computer science - communication complexity as an abstract measure of the complexity of computing problems. Its main aim is to show how the theoretical study of communication complexity can be useful in the process of designing effective parallel algorithms. The author shows how to get important information about the parallel complexity (parallel time, the number of processors, the descriptional complexity of the topology of the parallel architecture) of specific computing problems from knowledge of their communication complexity. The book is written as a textbook for undergraduate and graduate students and provides a careful explanation of the subject as well as motivation for further research.

220 citations


Journal ArticleDOI
TL;DR: An approach of selecting shape points and outer-layer used for erosion during each iteration of parallel thinning is introduced and the approach produces good skeleton for different types of corners.

201 citations


Journal ArticleDOI
TL;DR: The PSA algorithm proposed in the paper has shown significant improvements in solution quality for the largest of the test networks, and the conditions under which the parallel algorithm is most efficient are investigated.
Abstract: The simulated annealing optimization technique has been successfully applied to a number of electrical engineering problems, including transmission system expansion planning. The method is general in the sense that it does not assume any particular property of the problem being solved, such as linearity or convexity. Moreover, it has the ability to provide solutions arbitrarily close to an optimum (i.e. it is asymptotically convergent) as the cooling process slows down. The drawback of the approach is the computational burden: finding optimal solutions may be extremely expensive in some cases. This paper presents a parallel simulated annealing (PSA) algorithm for solving the long-term transmission network expansion planning problem. A strategy that does not affect the basic convergence properties of the sequential simulated annealing algorithm have been implemented and tested. The paper investigates the conditions under which the parallel algorithm is most efficient. The parallel implementations have been tested on three example networks: a small 6-bus network; and two complex real-life networks. Excellent results are reported in the test section of the paper: in addition to reductions in computing times, the PSA algorithm proposed in the paper has shown significant improvements in solution quality for the largest of the test networks.

164 citations


Journal ArticleDOI
TL;DR: This paper proposes a novel GA-based algorithm with an objective to simultaneously meet the goals of high performance, scalability, and fast running time and outperforms both heuristics while taking considerably less running time.

Journal ArticleDOI
TL;DR: An efficient method for computing the discrete cosine transform (DCT) is proposed, which is a generalization of the radix 2 DCT algorithm, and the recursive properties of the DCT for an even length input sequence are derived.
Abstract: An efficient method for computing the discrete cosine transform (DCT) is proposed. Based on direct decomposition of the DCT, the recursive properties of the DCT for an even length input sequence is derived, which is a generalization of the radix 2 DCT algorithm. Based on the recursive property, a new DCT algorithm for an even length sequence is obtained. The proposed algorithm is very structural and requires fewer computations when compared with others. The regular structure of the proposed algorithm is suitable for fast parallel algorithm and VLSI implementation.

Proceedings ArticleDOI
19 Oct 1997
TL;DR: Four-round protocols whose error does not decrease under parallel repetition are presented, which exploit non-malleable encryption and can be based on any trapdoor permutation.
Abstract: Whether or not parallel repetition lowers the error has been a fundamental question in the theory of protocols, with applications in many different areas. It is well known that parallel repetition reduces the error at an exponential rate in interactive proofs and Arthur-Merlin games. It seems to have been taken for granted that the same is true in arguments, or other proofs where the soundness only holds with respect to computationally bounded parties. We show that this is not the case. Surprisingly, parallel repetition can actually fail in this setting. We present four-round protocols whose error does not decrease under parallel repetition. This holds for any (polynomial) number of repetitions. These protocols exploit non-malleable encryption and can be based on any trapdoor permutation. On the other hand we show that for three-round protocols the error does go down exponentially fast. The question of parallel error reduction is particularly important when the protocol is used in cryptographic settings like identification, and the error represents the probability that an intruder succeeds.

Journal ArticleDOI
TL;DR: This work gives a randomized parallel algorithm for computing single-source shortest paths in weighted digraphs and shows that the exact shortest-path problem can be efficiently reduced to solving a series of approximate shortest- path subproblems.

Journal ArticleDOI
TL;DR: This paper introduces two arctangent radices and shows that about 2/3 of the rotation directions can be derived in parallel without any error.
Abstract: Each coordinate rotation digital computer iteration selects the rotation direction by analyzing the results of the previous iteration. In this paper, we introduce two arctangent radices and show that about 2/3 of the rotation directions can be derived in parallel without any error. Some architectures exploiting these strategies are proposed.

Journal ArticleDOI
01 Apr 1997
TL;DR: The proposed twodimensional multistate cellular automaton architecture achieves high frequency of operation and it is particularly suited for VLSI implementation due to its inherent parallelism, structural locality, regularity, and modularity.
Abstract: This paper presents a new parallel algorithm for collision-free path planning of a diamond-shaped robot among arbitrarily shaped obstacles, which are represented as a discrete image, and its implementation in VLSI. The proposed algorithm is based on a retraction of free space onto the Voronoi diagram, which is constructed through the time evolution of cellular automata, after an initial phase during which the boundaries of obstacles are identified and coded with respect to their orientation. The proposed algorithm is both space and time efficient, since it does not require the modeling of objects or distance and intersection calculations. Additionally, the proposed twodimensional multistate cellular automaton architecture achieves high frequency of operation and it is particularly suited for VLSI implementation due to its inherent parallelism, structural locality, regularity, and modularity.

Journal ArticleDOI
TL;DR: Timings and segmentation results of the algorithm built on top of the message passing interface and tested on the Gray T3D are brought forward to justify the superiority of the novel design solution compared against previous implementations.
Abstract: The parallel watershed transformation used in gray scale image segmentation is reconsidered on the basis of the component labeling problem. The main idea is to break the sequentiality of the watershed transformation and to correctly delimit the extent of all connected components locally, on each processor, simultaneously. The internal fragmentation of the catchment basins, due to domain decomposition, into smaller subcomponents is finally solved by employing a global connected components operator. Therefore, in a pyramidal structure of master-slave processors, internal contours of adjacent subcomponents within the same component are hierarchically removed. Global final connected areas are efficiently obtained in log/sub 2/ N steps on a logical grid of N processors. Timings and segmentation results of the algorithm built on top of the message passing interface and tested on the Gray T3D are brought forward to justify the superiority of the novel design solution compared against previous implementations.

Journal ArticleDOI
01 Jul 1997
TL;DR: A prototype suggests that the DEVS formalism can be combined with genetic algorithms running in parallel to serve as the basis of a very general, very fast class of simulation environments.
Abstract: DEVS-C++, a high-performance environment for modeling large-scale systems at high resolution, uses the DEVS (Discrete-EVent system Specification) formalism to represent both continuous and discrete processes. A prototype suggests that the DEVS formalism can be combined with genetic algorithms running in parallel to serve as the basis of a very general, very fast class of simulation environments.

Proceedings Article
25 Aug 1997
TL;DR: This work develops a general approach to the problem of scheduling distributed multi-dimensional resource units for all kinds of parallelism within and across queries and operators, and presents heuristic algorithms for various forms of the problem.
Abstract: Scheduling query execution plans is a particularly complex problem in hierarchical parallel systems, where each site consists of a collection of local time-shared (e.g., CPU(s) or disk(s)) and space-shared (e.g., memory) resources and communicates with remote sites by messagepassing. We develop a general approach to the problem, capturing the full complexity of scheduling distributed multi-dimensional resource units for all kinds of parallelism within and across queries and operators. We present heuristic algorithms for various forms of the problem, some of which are provably near-optimal. Preliminary experimental results confirm the effectiveness of our approach.

Proceedings ArticleDOI
21 Jun 1997
TL;DR: This paper defines the LoPC model and derives the general form of the model for parallel applications that communicate via active messages, which is inspired by the LogP model but accounts for contention for message processing resources in parallel algorithms on a multiprocessor or network of workstations.
Abstract: Parallel algorithm designers need computational models that take first order system costs into account, but are also simple enough to use in practice. This paper introduces the LoPC model, which is inspired by the LogP model but accounts for contention for message processing resources in parallel algorithms on a multiprocessor or network of workstations. LoPC takes the L, o and P parameters directly from the LogP model and uses them to predict the cost of contention, C.This paper defines the LoPC model and derives the general form of the model for parallel applications that communicate via active messages. Model modifications for systems that implement coherent shared memory abstractions are also discussed. We carry out the analysis for two important classes of applications that have irregular communication. In the case of parallel applications with homogeneous all-to-any communication, such as sparse matrix computations, the analysis yields a simple rule of thumb and insight into contention costs. In the case of parallel client-server algorithms, the LoPC analysis provides a simple and accurate calculation of the optimal allocation of nodes between clients and servers. The LoPC estimates for these applications are shown to be accurate when compared against event driven simulation and against a sparse matrix computation on the MIT Alewife multiprocessor.


Journal ArticleDOI
TL;DR: This paper describes four different parallel algorithms for implementing the spectral transform method on hypercube- and mesh-connected multicomputers with cut-through routing and reports on computational experiments that were conducted to evaluate their efficiency on parallel computers.
Abstract: The spectral transform method is a standard numerical technique for solving partial differential equations on a sphere and is widely used in atmospheric circulation models. Recent research has identified several promising algorithms for implementing this method on massively parallel computers; however, no detailed comparison of the different algorithms has previously been attempted. In this paper, we describe these different parallel algorithms and report on computational experiments that we have conducted to evaluate their efficiency on parallel computers. The experiments used a testbed code that solves the nonlinear shallow water equations on a sphere; considerable care was taken to ensure that the experiments provide a fair comparison of the different algorithms and that the results are relevant to global models. We focus on hypercube- and mesh-connected multicomputers with cut-through routing, such as the Intel iPSC/860, DELTA, and Paragon, and the nCUBE/2, but we also indicate how the results extend to other parallel computer architectures. The results of this study are relevant not only to the spectral transform method but also to multidimensional fast Fourier transforms (FFTs) and other parallel transforms.

Proceedings ArticleDOI
19 Jan 1997
TL;DR: In this article, the authors demonstrate that DNA computers can simulate Boolean circuits with a small overhead, and they also show that for the class NC$^1, the slowdown can be reduced to a constant, and that the inputs, the Boolean AND gates, and the OR gates can be encoded to DNA oligonucleotide sequences.
Abstract: We demonstrate that DNA computers can simulate Boolean circuits with a small overhead. Boolean circuits embody the notion of massively parallel signal processing and are frequently encountered in many parallel algorithms. Many important problems such as sorting, integer arithmetic, and matrix multiplication are known to be computable by small size Boolean circuits much faster than by ordinary sequential digital computers. This paper shows that DNA chemistry allows one to simulate large semi-unbounded fan-in Boolean circuits with a logarithmic slowdown in computation time. Also, for the class NC$^1$, the slowdown can be reduced to a constant. In this algorithm we have encoded the inputs, the Boolean AND gates, and the OR gates to DNA oligonucleotide sequences. We operate on the gates and the inputs by standard molecular techniques of sequence-specific annealing, ligation, separation by size, limited amplification, sequence-specific cleavage, and detection by size. Preliminary biochemical experiments on a small test circuit have produced encouraging results. Further confirmatory experiments are in progress.

Journal ArticleDOI
TL;DR: A realistic combinatorial optimization problem is used as an example to show how a genetic algorithm can be parallelized in an efficient way and it is shown that it is possible to obtain good solutions to the problem even with a very low communication load.

Proceedings ArticleDOI
18 Dec 1997
TL;DR: The technique is based on a comparison routine that determines the relative position of two points in the order induced by a space filling curve and could be used in conjunction with any parallel sorting algorithm to effect parallel domain decomposition.
Abstract: Partitioning techniques based on space filling curves have received much recent attention due to their low running time and good load balance characteristics. The basic idea underlying these methods is to order the multidimensional data according to a space filling curve and partition the resulting one dimensional order. However, space filling curves are defined for points that lie on a uniform grid of a particular resolution. It is typically assumed that the coordinates of the points are representable using a fixed number of bits, and the run times of the algorithms depend upon the number of bits used. We present a simple and efficient technique for ordering arbitrary and dynamic multidimensional data using space filling curves and its application to parallel domain decomposition and load balancing. Our technique is based on a comparison routine that determines the relative position of two points in the order induced by a space filling curve. The comparison routine could then be used in conjunction with any parallel sorting algorithm to effect parallel domain decomposition.

Journal ArticleDOI
TL;DR: This paper addresses the compile-time optimization of a form of nested-loop computation that is motivated by a computational physics application and a pruning search strategy for determination of an optimal form is developed.
Abstract: This paper addresses the compile-time optimization of a form of nested-loop computation that is motivated by a computational physics application. The computations involve multi-dimensional surface and volume integrals where the integrand is a product of a number of array terms. Besides the issue of optimal distribution of the arrays among the processors, there is also scope for reordering of the operations using the commutativity and associativity properties of addition and multiplication, and the application of the distributive law to significantly reduce the number of operations executed. A formalization of the operation minimization problem and proof of its NP-completeness is provided. A pruning search strategy for determination of an optimal form is developed. An analysis of the communication requirements and a polynomial-time algorithm for determination of optimal distribution of the arrays are also provided.

Book ChapterDOI
07 Jul 1997
TL;DR: In this paper, the authors present deterministic parallel algorithms for the coarse-grained multicomputer (CGM) and bulk-synchronous parallel computer (BSP) models which solve the following well known graph problems: (1) list ranking, (2) Euler tour construction, (3) computing the connected components and spanning forest, (4) lowest common ancestor preprocessing, (5) tree contraction and expression tree evaluation, (6) computing an ear decomposition, (7) 2-edge connectivity and biconnectivity (testing and component computation
Abstract: In this paper, we present deterministic parallel algorithms for the coarse grained multicomputer (CGM) and bulk-synchronous parallel computer (BSP) models which solve the following well known graph problems: (1) list ranking, (2) Euler tour construction, (3) computing the connected components and spanning forest, (4) lowest common ancestor preprocessing, (5) tree contraction and expression tree evaluation, (6) computing an ear decomposition or open ear decomposition, (7) 2-edge connectivity and biconnectivity (testing and component computation), and (8) cordai graph recognition (finding a perfect elimination ordering). The algorithms for Problems 1–7 require O(log p) communication rounds and linear sequential work per round. Our results for Problems 1 and 2 hold for arbitrary ratios \(\frac{n}{p}\), i.e. they are fully scalable, and for Problems 3–8 it is assumed that \(\frac{n}{p} \geqslant p^ \in ,{\mathbf{ }} \in {\mathbf{ }} > 0\), which is true for all commercially available multiprocessors. We view the algorithms presented as an important step towards the final goal of O(1) communication rounds. Note that, the number of communication rounds obtained in this paper is independent of n and grows only very slowly with respect to p. Hence, for most practical purposes, the number of communication rounds can be considered as constant. The result for Problem 1 is a considerable improvement over those previously reported. The algorithms for Problems 2–7 are the first practically relevant deterministic parallel algorithms for these problems to be used for commercially available coarse grained parallel machines.

Journal ArticleDOI
TL;DR: The proposed topology of dual-direction ring is shown to be well amenable to parallel implementation of the GA for the UC problem and speed-up and efficiency for each topology with different number of processor are compared to those of the sequential GA approach.
Abstract: Through a constraint handling technique, this paper proposes a parallel genetic algorithm (GA) approach to solving the thermal unit commitment (UC) problem. The developed algorithm is implemented on an eight-processor transputer network, processors of which are arranged in master-slave and dual-direction ring structures, respectively. The proposed approach has been tested on a 38-unit thermal power system over a 24-hour period. Speed-up and efficiency for each topology with different number of processor are compared to those of the sequential GA approach. The proposed topology of dual-direction ring is shown to be well amenable to parallel implementation of the GA for the UC problem.

Journal ArticleDOI
TL;DR: The parallel approach is shown to consistently perform better than a sequential genetic algorithm when applied to these routing problems and is able to significantly reduce the occurrence of crosstalk.
Abstract: This paper presents a novel approach to solve the VLSI (very large scale integration) channel and switchbox routing problems. The approach is based on a parallel genetic algorithm (PGA) that runs on a distributed network of workstations. The algorithm optimizes both physical constraints (length of nets, number of vias) and crosstalk (delay due to coupled capacitance). The parallel approach is shown to consistently perform better than a sequential genetic algorithm when applied to these routing problems. An extensive investigation of the parameters of the algorithm yields routing results that are qualitatively better or as good as the best published results. In addition, the algorithm is able to significantly reduce the occurrence of crosstalk.

Proceedings ArticleDOI
15 Nov 1997
TL;DR: This work describes implementation of a mark-sweep garbage collector for shared-memory machines and reports its performance, and observed that the implementation detail affects the performance heavily.
Abstract: This work describes implementation of a mark-sweep garbage collector (GC) for shared-memory machines and reports its performance. It is a simple ''parallel'' collector in which all processors cooperatively traverse objects in the global shared heap. The collector stops the application program during a collection and assumes a uniform access cost to all locations in the shared heap. Implementation is based on the Boehm-Demers-Weiser conservative GC (Boehm GC). Experiments have been done on Ultra Enterprise 10000 (Ultra Sparc processor 250 MHz, 64 processors). We wrote two applications, BH (an N-body problem solver) and CKY (a context free grammar parser) in a parallel extension to C++.Through the experiments, We observe that load balancing is the key to achieving scalability. A naive collector without load redistribution hardly exhibits speed-up (at most fourfold speed-up on 64 processors). Performance can be improved by dynamic load balancing, which exchanges objects to be scanned between processors, but we still observe that straightforward implementation severely limits performance. First, large objects become a source of significant load imbalance, because the unit of load redistribution is a single object. Performance is improved by splitting a large object into small pieces before pushing it onto the mark stack. Next, processors spend a significant amount of time uselessly because of serializing method for termination detection using a shared counter. This problem suddenly appeared on more than 32 processors. By implementing non-serializing method for termination detection, the idle time is eliminated and performance is improved. With all these careful implementation, we achieved average speed-up of 28.0 in BH and 28.6 in CKY on 64 processors.