scispace - formally typeset
Search or ask a question

Showing papers on "Bitonic sorter published in 2011"


Proceedings ArticleDOI
27 Feb 2011
TL;DR: This paper analyses different hardware sorting architectures in order to implement a highly scaleable sorter for solving huge problems at high performance up to the GB range in linear time complexity and demonstrates how partial run-time reconfiguration can be used for saving almost half the FPGA resources or alternatively for improving the speed.
Abstract: This paper analyses different hardware sorting architectures in order to implement a highly scaleable sorter for solving huge problems at high performance up to the GB range in linear time complexity. It will be proven that a combination of a FIFO-based merge sorter and a tree-based merge sorter results in the best performance at low cost. Moreover, we will demonstrate how partial run-time reconfiguration can be used for saving almost half the FPGA resources or alternatively for improving the speed. Experiments show a sustainable sorting throughput of 2GB/s for problems fitting into the on-chip FPGA memory and 1 GB/s when using external memory. These values surpass the best published results on large problem sorting implementations on FPGAs, GPUs, and the Cell processor.

180 citations


Proceedings ArticleDOI
Davide Pasetto1, Albert Akhriev1
22 Oct 2011
TL;DR: Several general-purpose methods, with particular interest in sorting of database records and huge arrays, are evaluated and a brief analysis is provided.
Abstract: In this paper we examine the performance of parallel sorting algorithms on modern multi-core hardware. Several general-purpose methods, with particular interest in sorting of database records and huge arrays, are evaluated and a brief analysis is provided.

35 citations


Journal ArticleDOI
TL;DR: This work assigned compare/exchange operations to threads in a way that decreases low‐performance global‐memory access and makes efficient use of high‐performance shared memory, which greatly increases the performance of this in‐place, comparison‐based sorting algorithm.
Abstract: State-of-the-art graphics processors provide high processing power and furthermore, the high programmability of GPUs offered by frameworks like CUDA (Compute Unified Device Architecture) increases their usability as high-performance co-processors for general-purpose computing. Sorting is well investigated in Computer Science in general, but (because of this new field of application for GPUs) there is a demand for high-performance parallel sorting algorithms that fit with the characteristics of the modern GPU-architecture. We present a high-performance in-place implementation of Batcher's bitonic sorting networks for CUDA-enabled GPUs. Therefore, we assigned compare/exchange operations to threads in a way that decreases low-performance global-memory access and makes efficient use of high-performance shared memory. This greatly increases the performance of this in-place, comparison-based sorting algorithm. Our implementation outperforms all other algorithms in our tests when sorting 64-bit keys. It is the fastest comparison-based GPU sorting algorithm for 32-bit keys, being only outperformed by (non-comparison-based) radix sort when sorting sequences larger than 223. Copyright © 2011 John Wiley & Sons, Ltd.

34 citations


Proceedings ArticleDOI
19 Dec 2011
TL;DR: This paper is presenting an analysis of parallel and sequential bitonic, odd-even and rank-sort algorithms on different GPU and CPU architectures written to exploit task parallelism model as available on multi-core GPUs using the OpenCL specification.
Abstract: Sorting algorithms have been studied extensively since past three decades. Their uses are found in many applications including real-time systems, operating systems, and discrete event simulations. In most cases, the efficiency of an application itself depends on usage of a sorting algorithm. Lately, the usage of graphic cards for general purpose computing has again revisited sorting algorithms. In this paper we extended our previous work regarding parallel sorting algorithms on GPU, and are presenting an analysis of parallel and sequential bitonic, odd-even and rank-sort algorithms on different GPU and CPU architectures. Their performance for various queue sizes is measured with respect to sorting time and rate and also the speed up of bitonic sort over odd-even sorting algorithms is shown on different GPUs and CPU. The algorithms have been written to exploit task parallelism model as available on multi-core GPUs using the OpenCL specification. Our findings report minimum of 19x speed-up of bitonic sort against odd-even sorting technique for small queue sizes on CPU and maximum of 2300x speed-up for very large queue sizes on Nvidia Quadro 6000 GPU architecture.

21 citations


Book ChapterDOI
18 Sep 2011
TL;DR: This work develops, analyzes, and test several algorithms that can split millions of processes into groups based on arbitrary, user-defined data, and finds that bitonic sort and the authors' new hash-based algorithm best suit the task.
Abstract: In the quest to build exascale supercomputers, designers are increasing the number of hierarchical levels that exist among system components. Software developed for these systems must account for the various hierarchies to achieve maximum efficiency. The first step in this work is to identify groups of processes that share common resources. We develop, analyze, and test several algorithms that can split millions of processes into groups based on arbitrary, user-defined data. We find that bitonic sort and our new hash-based algorithm best suit the task.

13 citations


Proceedings ArticleDOI
16 May 2011
TL;DR: This paper presents a way to adopt the well-known Bitonic sorting method to dynamically reconfigurable systems such that this drawback is overcome.
Abstract: Sorting is one of the most investigated tasks computers are used for. Up to now, not much research has been put into increasing the flexibility and performance of sorting applications by applying emph{reconfigurable} computer systems. There are parallel sorting algorithms (emph{sorting circuits}) which are highly suitable for VLSI hardware realization and which outperform sequential sorting methods applied on traditional software processors by far. But usually they require a large area that increases %But usually they also have a high area requirement, increasing with the number of keys to be sorted. This drawback concerns ASIC and statically reconfigurable systems. In this paper, we present a way to adopt the well-known Bitonic sorting method to dynamically reconfigurable systems such that this drawback is overcome. We present a detailed description of the design and actual implementation, and we present experimental results of our approach to show its benefits in performance and the trade-offs of our approach.

7 citations


Proceedings ArticleDOI
24 Apr 2011
TL;DR: A novel high-speed parallel sorting scheme based on field programmable gate array (FPGA) is proposed and a technique that will make the clock rate constant regardless of the length of the list that will be sorted is provided.
Abstract: Efficient data sorting is important for searching and optimization algorithms in high time demanding fields such as image, multi-media data processing and radar detection. To accelerate the data sorting algorithm applied in practical radar algorithms detection such as OS-CFAR, a novel high-speed parallel sorting scheme based on field programmable gate array (FPGA) is proposed in this paper. It also provides a technique that will make the clock rate constant regardless of the length of the list that will be sorted. The paper presents new results in: 1) parallel sorting algorithms; 2) FPGA-based parallel architectures; and 3) the technique of sorting the most recently entered data items to the memory while discarding the oldest items is presented. Results obtained show a reduction in the clock rate. FPGA implementation results are presented and discussed.

7 citations


Journal ArticleDOI
N. Jin1, Xiaoping Jin1, Y. G. Ying1, Wang Shubin1, Xizhong Lou1 
TL;DR: The complexity analysis and simulation results show that the K cycles sorting dynamic K-best detection achieves best trade-off on throughput and required memory, and the architecture of the Batcher's merge sorting dynamicK- best detection is more beneficial to parallel processing and multiple-processor structure.
Abstract: The breadth-first searching algorithms, typically represented by K-best algorithm, are widely studied for multiple-symbol differential detection in multiple-input multiple-output systems due to the advantages of fixed complexity and latency which are very attractive for hardware implementation. However, it needs a large K value to achieve near maximum likelihood performance, which results in large complexity. In this study, a dynamic K-best detection with reduced average K value is proposed. It reduces the complexity on path expanding, path updating and comparing and swapping (C&S) operations by 24.24, 25 and 43.46%, respectively, with less performance degradation. After that, two low-complexity sorting architectures, Batcher's merge sort and K cycles sort, are presented and applied to the proposed dynamic K-best detection. The complexity analysis and simulation results show that, compared with the traditional Bubble sorting dynamic K-best detection, the K cycles sorting and the Batcher's merge sorting dynamic K-best detections can further save C&S operations by 59.5 and 11.2%, respectively, while performance cost capable of being ignored. Moreover, the K cycles sorting dynamic K-best detection achieves best trade-off on throughput and required memory, and the architecture of the Batcher's merge sorting dynamic K-best detection is more beneficial to parallel processing and multiple-processor structure.

3 citations


Proceedings ArticleDOI
12 Feb 2011
TL;DR: From results it has been observed that by reducing the internetwork communication a performance improvement is achieved and the proposed scheme is sufficiently general which is independent of hardware and interconnection network among them.
Abstract: This paper presents a bitonic sort scheme in a shared memory mesh-connected SIMD array processor. In addition, it uses the two types of comparators of sorting networks in the mesh-connected parallel computer. This scheme uses variable multiple pivots and non-pivots. Parity strategy has been implemented to minimize the number of accesses in the mesh-connected interconnection network by introducing the concept of global and local memory. The proposed scheme is sufficiently general which is independent of hardware and interconnection network among them. From results it has been observed that by reducing the internetwork communication a performance improvement is achieved.

2 citations


Proceedings ArticleDOI
Rajat Kumar Pal1
10 Jun 2011
TL;DR: A complete graph structure based comparison sorting algorithm, CompleteGraphSort has been proposed that takes time Θ(n2) in the worst-case, where n is the number of records in the given list to be sorted.
Abstract: Sorting is a well-known problem frequently used in many aspects in the world of computational applications. Sorting means arranging a set of records (or a list of keys) in some (increasing or decreasing) order. In this solution report, a complete graph structure based comparison sorting algorithm, CompleteGraphSort has been proposed that takes time Θ(n2) in the worst-case, where n is the number of records in the given list to be sorted.

2 citations


Journal ArticleDOI
TL;DR: It is established that the minimal depth of a Bitonic sorter of n keys is 2@?log(n)@?-@? log(n?)@?.

Proceedings Article
01 Jan 2011

DOI
01 Jan 2011
TL;DR: This paper presented hybrid sorting algorithm that splits array to sort concurrently in CPU and GPU, which decided most effective range of array based on hardware performance, then accomplished reducing whole sorting time by concurrent sorting on CPU andGPU.
Abstract: Data sorting is one of important pre-process to utilize huge data in modern society, but sorting spends a lot of time by sorting itself. In this paper, we presented hybrid sorting algorithm that splits array to sort concurrently in CPU and GPU. To do this, we decided most effective range of array based on hardware performance, then accomplished reducing whole sorting time by concurrent sorting on CPU and GPU. As shown in results of experiment, hybrid sorting improved about eight percent of sorting time in comparison with the sorting time using only GPU.