scispace - formally typeset
Search or ask a question

Showing papers on "Bitonic sorter published in 1987"


Book ChapterDOI
19 Feb 1987
TL;DR: An algorithm is presented sorting N=n1n2...nr, r≧2, elements on an n1 × n2 × ... × nr mesh-connected array of processors within 2(n1+…+nr−1)+nr+0(n11−e+...+nr1−e), ɛ>0, data interchange steps, which asymptotically matches the quite recently given lower bound for r-dimensional meshes.
Abstract: An algorithm is presented sorting N=n1n2...nr, r≧2, elements on an n1 × n2 × ... × nr mesh-connected array of processors within 2(n1+...+nr−1)+nr+0(n11−e+...+nr1−e), ɛ>0, data interchange steps. Hence this algorithm asymptotically matches the quite recently given lower bound for r-dimensional meshes. The asymptotically optimal lower bound of (2r/21/r) N1/r interchange steps can only be obtained on r-dimensional meshes withaspect ratio ni : nr=1 : 2 for all i=1,...,r−1. Moreover, for meshes with wraparound connections the slightly altered algorithm only need 1.5(n1+...+nr−1)+nr+0(n11−e+...+nr1−e) data interchange steps, which asymptotically is significantly smaller than the lower bound for sorting on meshes without wrap-arounds.

42 citations


Proceedings ArticleDOI
01 May 1987
TL;DR: The performance of sorting algorithms on local area networks (LANs) should be analyzed in a manner that is different from the ways that parallel and distributed sorting algorithms are usually analyzed, and an empirical approach is proposed which will provide more insight into the performance of the algorithms.
Abstract: We adapt several parallel sorting algorithms (block sorting algorithms) and distributed sorting algorithms for implementation on an Ethernet network with diskless Sun workstations. We argue that the performance of sorting algorithms on local area networks (LANs) should be analyzed in a manner that is different from the ways that parallel and distributed sorting algorithms are usually analyzed. Consequently, we propose an empirical approach which will provide more insight into the performance of the algorithms. We obtain data on communication time, local processing time, and response time (i.e. total running time) of each algorithm for various file sizes and different numbers of processors. Comparing the performance data with our theoretical analysis, we attempt to provide rationale for the behaviour of the algorithms and project the future behaviour of the algorithms as file size, number of processors, or interprocessor communication facilities change.

6 citations


Journal ArticleDOI
TL;DR: It is shown that the omega network cannot produce without conflicts some of the bit-permute mappings such as the perfect shuffle and the bit reversal, and can produce both of these mappings provided that data items are accessed from memories according to a specific skewed scheme.
Abstract: This paper presents a study of the best and worst mappings for the omega network proposed by D. H. Lawrie in 1975. We identify mappings that produce no conflicts in the network and mappings that produce a maximum number of conflicts. The analysis of mappings for some typical applications shows that an initial allocation of data to memory modules determines the contention within the network for all iterations of the algorithm. For the case of the FFT and the bitonic sort algorithm executed on a shared-memory architecture, we prove that if no conflicts are produced during the first iteration of the algorithm, then no conflicts are produced during any other iteration. Moreover, if a maximum number of conflicts are produced during the first iteration, then a maximum number of conflicts are produced during all other iterations of the algorithm. For the d-dimensional grid computations where communication is required with 2d nearest neighbors, we prove that if the initial allocation produces no conflicts within the network, then communication with all the neighbors is conflict-free. Also, if the initial allocation produces a maximum number of conflicts, then communication with all the neighbors is maximum-conflict. We show that the omega network cannot produce without conflicts some of the bit-permute mappings such as the perfect shuffle and the bit reversal. The network can produce both of these mappings provided that data items are accessed from memories according to a specific skewed scheme.

5 citations


ReportDOI
01 Feb 1987
TL;DR: It is shown that copies present in divide and conquer algorithms like bitonic sort and quicksort can be removed and the effectiveness of these optimizations is evaluated, showing that in many cases, they can come to the efficiency of an imperative language.
Abstract: : Copy elimination is an important optimization for implementing functional languages. Though it is related to the problem of copy propagation that has been considered in many compilers and also to storage compaction, the term is used in a more general context where structured values can be updated and the computation tree can be reordered. Because of these two additional possibilities, copy elimination is a hard problem, being undecidable in general. We propose an optimization approach based on abstract interpretation which uses fixpoint iteration for computing address expressions. These address expressions supply the final target for a computation, eliminating the need to copy values through intermediate results. Our work is in the context of a single assignment language called SAL. Our implementation has an operational model for computing address expressions by using reduction rules. Using this, we show that copies present in divide and conquer algorithms like bitonic sort and quicksort can be removed. We evaluate the effectiveness of these optimizations, showing that in many cases, we can come to the efficiency of an imperative language. We present some data on optimising some small tough benchmarks.

3 citations


Dissertation
01 Jan 1987
TL;DR: This thesis provides empirical results for selected parallel sorting algorithms (block sorting algorithms) and distributed sorting algorithms which have been adapted for implementation on an Ethernet network with diskless Sun workstations to provide more insight into the performance of the algorithms.
Abstract: This thesis provides empirical results for selected parallel sorting algorithms (block sorting algorithms) and distributed sorting algorithms which have been adapted for implementation on an Ethernet network with diskless Sun workstations. Most work concerning the performance of parallel and distributed sorting algorithms has been theoretical and assumes simplified models. Hence. we adopt an empirical approach which provides more insight into the performance of the algorithms. Our cost model considers both local processing costs and communication costs to be important factors when evaluating the performance of b the sorting algorithms in the LAN environment. We obtain our experimental results on communication time, local processing time and response time of each algorithm for various file sizes and different numbers of processors. These results are analyzed and compared to our theoretical model. In cases where the experimental results do not agree with the theoretical results. the discrepancies are explained. We also make an attempt to project the behaviour of the algorithms as number of processors or interprocess communication facilities changes.

3 citations


Journal ArticleDOI
U. Kleine1
TL;DR: In this letter a novel sorter architecture for two-dimensional rank order filters is presented and a parallel sorting network is described, based on Batcher's odd-even merge algorithm.
Abstract: In this letter a novel sorter architecture for two-dimensional rank order filters is presented. Rank order filters are widely used in image processing applications to smooth noisy images without perturbing edge structures. The main element of such filters is a sorter. In the letter a parallel sorting network is described, based on Batcher's odd-even merge algorithm. The required chip area of the sorter network is proportional to N(log2N)[(log2 N) + 1], where N is the number of pixels to be sorted. An example of a 25-pixel sorter network is given.

2 citations


Book ChapterDOI
01 Jan 1987
TL;DR: A realistic comparison of the practical feasability of sorting algorithms for VLSI is obtained and takes into account the maximal problem size that is realizable on a single chip under the restrictions imposed by the available technology.
Abstract: A method for comparing the asymptotic performance of different sorting algorithms for VLSI is proposed. For each algorithm it takes into account the maximal problem size that is realizable on a single chip under the restrictions imposed by the available technology. This sorting chip is used to perform a sort-split operation on blocks of data in an external merge algorithm for sorting arbitrarily large sets of data. The performance of the merge algorithm is determined by the execution time and period of the sorting chip used. Thus a realistic comparison of the practical feasability of sorting algorithms for VLSI is obtained.

2 citations


Book ChapterDOI
01 Dec 1987
TL;DR: In this paper, a unifying mathematical proof which replaces a mechanical certification of the optimal parallelization of sorting networks on a case basis is provided. But this proof does not address the problem of parallelization in the context of systolic design.
Abstract: This paper provides a unifying mathematical proof which replaces a mechanical certification of the optimal parallelization of sorting networks on a case basis. Parallelization of sequential program traces by means of semantic-preserving transformation is discussed in the literature in the context of a method for synthesis of systolic architecture. The issue of optimal parallelization is important in systolic design. The mathematical proof provides a better insight into the fundamental aspects of the transformation.

2 citations