Showing papers on "Bitonic sorter published in 2011"

PDF

Open Access

Proceedings Article•DOI•

FPGASort: a high performance sorting architecture exploiting run-time reconfiguration on fpgas for large problem sorting

[...]

Dirk Koch¹, Jim Torresen¹•Institutions (1)

University of Oslo¹

27 Feb 2011

TL;DR: This paper analyses different hardware sorting architectures in order to implement a highly scaleable sorter for solving huge problems at high performance up to the GB range in linear time complexity and demonstrates how partial run-time reconfiguration can be used for saving almost half the FPGA resources or alternatively for improving the speed.

...read moreread less

Abstract: This paper analyses different hardware sorting architectures in order to implement a highly scaleable sorter for solving huge problems at high performance up to the GB range in linear time complexity. It will be proven that a combination of a FIFO-based merge sorter and a tree-based merge sorter results in the best performance at low cost. Moreover, we will demonstrate how partial run-time reconfiguration can be used for saving almost half the FPGA resources or alternatively for improving the speed. Experiments show a sustainable sorting throughput of 2GB/s for problems fitting into the on-chip FPGA memory and 1 GB/s when using external memory. These values surpass the best published results on large problem sorting implementations on FPGAs, GPUs, and the Cell processor.

...read moreread less

180 citations

Proceedings Article•DOI•

A comparative study of parallel sort algorithms

[...]

Davide Pasetto¹, Albert Akhriev¹•Institutions (1)

IBM¹

22 Oct 2011

TL;DR: Several general-purpose methods, with particular interest in sorting of database records and huge arrays, are evaluated and a brief analysis is provided.

...read moreread less

Abstract: In this paper we examine the performance of parallel sorting algorithms on modern multi-core hardware. Several general-purpose methods, with particular interest in sorting of database records and huge arrays, are evaluated and a brief analysis is provided.

...read moreread less

35 citations

Journal Article•DOI•

Fast in-place, comparison-based sorting with CUDA: a study with bitonic sort

[...]

Hagen Peters¹, Ole Schulz-Hildebrandt¹, Norbert Luttenberger¹•Institutions (1)

University of Kiel¹

01 May 2011-Concurrency and Computation: Practice and Experience

TL;DR: This work assigned compare/exchange operations to threads in a way that decreases low‐performance global‐memory access and makes efficient use of high‐performance shared memory, which greatly increases the performance of this in‐place, comparison‐based sorting algorithm.

...read moreread less

Abstract: State-of-the-art graphics processors provide high processing power and furthermore, the high programmability of GPUs offered by frameworks like CUDA (Compute Unified Device Architecture) increases their usability as high-performance co-processors for general-purpose computing. Sorting is well investigated in Computer Science in general, but (because of this new field of application for GPUs) there is a demand for high-performance parallel sorting algorithms that fit with the characteristics of the modern GPU-architecture. We present a high-performance in-place implementation of Batcher's bitonic sorting networks for CUDA-enabled GPUs. Therefore, we assigned compare/exchange operations to threads in a way that decreases low-performance global-memory access and makes efficient use of high-performance shared memory. This greatly increases the performance of this in-place, comparison-based sorting algorithm. Our implementation outperforms all other algorithms in our tests when sorting 64-bit keys. It is the fastest comparison-based GPU sorting algorithm for 32-bit keys, being only outperformed by (non-comparison-based) radix sort when sorting sequences larger than 223. Copyright © 2011 John Wiley & Sons, Ltd.

...read moreread less

34 citations

Proceedings Article•DOI•

Analysis of Fast Parallel Sorting Algorithms for GPU Architectures

[...]

Fiaz Gul Khan, Omar Usman Khan, Bartolomeo Montrucchio, Paolo Giaccone

19 Dec 2011

TL;DR: This paper is presenting an analysis of parallel and sequential bitonic, odd-even and rank-sort algorithms on different GPU and CPU architectures written to exploit task parallelism model as available on multi-core GPUs using the OpenCL specification.

...read moreread less

Abstract: Sorting algorithms have been studied extensively since past three decades. Their uses are found in many applications including real-time systems, operating systems, and discrete event simulations. In most cases, the efficiency of an application itself depends on usage of a sorting algorithm. Lately, the usage of graphic cards for general purpose computing has again revisited sorting algorithms. In this paper we extended our previous work regarding parallel sorting algorithms on GPU, and are presenting an analysis of parallel and sequential bitonic, odd-even and rank-sort algorithms on different GPU and CPU architectures. Their performance for various queue sizes is measured with respect to sorting time and rate and also the speed up of bitonic sort over odd-even sorting algorithms is shown on different GPUs and CPU. The algorithms have been written to exploit task parallelism model as available on multi-core GPUs using the OpenCL specification. Our findings report minimum of 19x speed-up of bitonic sort against odd-even sorting technique for small queue sizes on CPU and maximum of 2300x speed-up for very large queue sizes on Nvidia Quadro 6000 GPU architecture.

...read moreread less

21 citations

Book Chapter•DOI•

Exascale algorithms for generalized MPI_comm_split

[...]

Adam Moody¹, Dong H. Ahn¹, Bronis R. de Supinski¹•Institutions (1)

Lawrence Livermore National Laboratory¹

18 Sep 2011

TL;DR: This work develops, analyzes, and test several algorithms that can split millions of processes into groups based on arbitrary, user-defined data, and finds that bitonic sort and the authors' new hash-based algorithm best suit the task.

...read moreread less

Abstract: In the quest to build exascale supercomputers, designers are increasing the number of hierarchical levels that exist among system components. Software developed for these systems must account for the various hierarchies to achieve maximum efficiency. The first step in this work is to identify groups of processes that share common resources. We develop, analyze, and test several algorithms that can split millions of processes into groups based on arbitrary, user-defined data. We find that bitonic sort and our new hash-based algorithm best suit the task.

...read moreread less

13 citations

Proceedings Article•DOI•

Bitonic Sorting on Dynamically Reconfigurable Architectures

[...]

Josef Angermeier¹, E. Sibirko¹, Rolf Wanka¹, Jürgen Teich¹•Institutions (1)

University of Erlangen-Nuremberg¹

16 May 2011

TL;DR: This paper presents a way to adopt the well-known Bitonic sorting method to dynamically reconfigurable systems such that this drawback is overcome.

...read moreread less

Abstract: Sorting is one of the most investigated tasks computers are used for. Up to now, not much research has been put into increasing the flexibility and performance of sorting applications by applying emph{reconfigurable} computer systems. There are parallel sorting algorithms (emph{sorting circuits}) which are highly suitable for VLSI hardware realization and which outperform sequential sorting methods applied on traditional software processors by far. But usually they require a large area that increases %But usually they also have a high area requirement, increasing with the number of keys to be sorted. This drawback concerns ASIC and statically reconfigurable systems. In this paper, we present a way to adopt the well-known Bitonic sorting method to dynamically reconfigurable systems such that this drawback is overcome. We present a detailed description of the design and actual implementation, and we present experimental results of our approach to show its benefits in performance and the trade-offs of our approach.

...read moreread less

7 citations

Proceedings Article•DOI•

A novel high-speed parallel sorting algorithm based on FPGA

[...]

Faisal Alquaied¹, Abdullah I. Almudaifer¹, Mohammed A. AlShaya²•Institutions (2)

King Abdulaziz City for Science and Technology¹, King Saud University²

24 Apr 2011

TL;DR: A novel high-speed parallel sorting scheme based on field programmable gate array (FPGA) is proposed and a technique that will make the clock rate constant regardless of the length of the list that will be sorted is provided.

...read moreread less

Abstract: Efficient data sorting is important for searching and optimization algorithms in high time demanding fields such as image, multi-media data processing and radar detection. To accelerate the data sorting algorithm applied in practical radar algorithms detection such as OS-CFAR, a novel high-speed parallel sorting scheme based on field programmable gate array (FPGA) is proposed in this paper. It also provides a technique that will make the clock rate constant regardless of the length of the list that will be sorted. The paper presents new results in: 1) parallel sorting algorithms; 2) FPGA-based parallel architectures; and 3) the technique of sorting the most recently entered data items to the memory while discarding the oldest items is presented. Results obtained show a reduction in the clock rate. FPGA implementation results are presented and discussed.

...read moreread less

7 citations

Journal Article•DOI•

Research on low-complexity breadth-first detection for multiple-symbol differential unitary space–time modulation systems

[...]

N. Jin¹, Xiaoping Jin¹, Y. G. Ying¹, Wang Shubin¹, Xizhong Lou¹ - Show less +1 more•Institutions (1)

China Jiliang University¹

15 Sep 2011-Iet Communications

TL;DR: The complexity analysis and simulation results show that the K cycles sorting dynamic K-best detection achieves best trade-off on throughput and required memory, and the architecture of the Batcher's merge sorting dynamicK- best detection is more beneficial to parallel processing and multiple-processor structure.

...read moreread less

Abstract: The breadth-first searching algorithms, typically represented by K-best algorithm, are widely studied for multiple-symbol differential detection in multiple-input multiple-output systems due to the advantages of fixed complexity and latency which are very attractive for hardware implementation. However, it needs a large K value to achieve near maximum likelihood performance, which results in large complexity. In this study, a dynamic K-best detection with reduced average K value is proposed. It reduces the complexity on path expanding, path updating and comparing and swapping (C&S) operations by 24.24, 25 and 43.46%, respectively, with less performance degradation. After that, two low-complexity sorting architectures, Batcher's merge sort and K cycles sort, are presented and applied to the proposed dynamic K-best detection. The complexity analysis and simulation results show that, compared with the traditional Bubble sorting dynamic K-best detection, the K cycles sorting and the Batcher's merge sorting dynamic K-best detections can further save C&S operations by 59.5 and 11.2%, respectively, while performance cost capable of being ignored. Moreover, the K cycles sorting dynamic K-best detection achieves best trade-off on throughput and required memory, and the architecture of the Batcher's merge sorting dynamic K-best detection is more beneficial to parallel processing and multiple-processor structure.

...read moreread less

3 citations

Proceedings Article•DOI•

Bitonic sort in shared SIMD array processor

[...]

Anukul Chandra Panda¹, Pankaj Kumar Sa¹, Banshidhar Majhi²•Institutions (2)

National Institute of Technology, Rourkela¹, King Khalid University²

12 Feb 2011

TL;DR: From results it has been observed that by reducing the internetwork communication a performance improvement is achieved and the proposed scheme is sufficiently general which is independent of hardware and interconnection network among them.

...read moreread less

Abstract: This paper presents a bitonic sort scheme in a shared memory mesh-connected SIMD array processor. In addition, it uses the two types of comparators of sorting networks in the mesh-connected parallel computer. This scheme uses variable multiple pivots and non-pivots. Parity strategy has been implemented to minimize the number of accesses in the mesh-connected interconnection network by introducing the concept of global and local memory. The proposed scheme is sufficiently general which is independent of hardware and interconnection network among them. From results it has been observed that by reducing the internetwork communication a performance improvement is achieved.

...read moreread less

2 citations

Proceedings Article•DOI•

CompleteGraphSort: A complete graph structure based sorting algorithm

[...]

Rajat Kumar Pal¹•Institutions (1)

Assam University¹

10 Jun 2011

TL;DR: A complete graph structure based comparison sorting algorithm, CompleteGraphSort has been proposed that takes time Θ(n2) in the worst-case, where n is the number of records in the given list to be sorted.

...read moreread less

Abstract: Sorting is a well-known problem frequently used in many aspects in the world of computational applications. Sorting means arranging a set of records (or a list of keys) in some (increasing or decreasing) order. In this solution report, a complete graph structure based comparison sorting algorithm, CompleteGraphSort has been proposed that takes time Θ(n2) in the worst-case, where n is the number of records in the given list to be sorted.

...read moreread less

2 citations

Journal Article•DOI•

Bitonic sorters of minimal depth

[...]

Tamir Levi¹, Ami Litman¹•Institutions (1)

Technion – Israel Institute of Technology¹

01 May 2011-Theoretical Computer Science

TL;DR: It is established that the minimal depth of a Bitonic sorter of n keys is 2@?log(n)@?-@? log(n?)@?.

...read moreread less

Proceedings Article•

Bitonic Sorting, Adaptive.

[...]

Gabriel Zachmann¹•Institutions (1)

Clausthal University of Technology¹

01 Jan 2011

DOI•

Designing Hybrid Sorting Algorithm for PC with GPU

[...]

Oh-Young Kwon

01 Jan 2011

TL;DR: This paper presented hybrid sorting algorithm that splits array to sort concurrently in CPU and GPU, which decided most effective range of array based on hardware performance, then accomplished reducing whole sorting time by concurrent sorting on CPU andGPU.

...read moreread less

Abstract: Data sorting is one of important pre-process to utilize huge data in modern society, but sorting spends a lot of time by sorting itself. In this paper, we presented hybrid sorting algorithm that splits array to sort concurrently in CPU and GPU. To do this, we decided most effective range of array based on hardware performance, then accomplished reducing whole sorting time by concurrent sorting on CPU and GPU. As shown in results of experiment, hybrid sorting improved about eight percent of sorting time in comparison with the sorting time using only GPU.

...read moreread less