scispace - formally typeset
Search or ask a question

Showing papers on "Bitonic sorter published in 2015"


Book
Richard Cole1
06 Sep 2015
TL;DR: This paper provides a general method that trims a factor of O(log n) time for many applications of this technique.
Abstract: Megiddo introduced a technique for using a parallel algorithm for one problem to construct an efficient serial algorithm for a second problem. We give a general method that trims a factor o f 0(logn) time (or more) for many applications of this technique.

301 citations


Proceedings ArticleDOI
22 Feb 2015
TL;DR: This paper proposes a streaming permutation network (SPN) by "folding" the classic Clos network and proves that the SPN is programmable to realize all the interconnection patterns in the bitonic sorting network.
Abstract: Parallel sorting networks are widely employed in hardware implementations for sorting due to their high data parallelism and low control overhead. In this paper, we propose an energy and memory efficient mapping methodology for implementing bitonic sorting network on FPGA. Using this methodology, the proposed sorting architecture can be built for a given data parallelism while supporting continuous data streams. We propose a streaming permutation network (SPN) by "folding" the classic Clos network. We prove that the SPN is programmable to realize all the interconnection patterns in the bitonic sorting network. A low cost design for sorting with minimal resource usage is obtained by reusing one SPN . We also demonstrate a high throughput design by trading off area for performance. With a data parallelism of p (2 ≤ p ≤ N/ log2 N), the high throughput design sorts an N-key sequence with latency O(N/p), throughput (# of keys sorted per cycle) O(p) and uses O(N) memory. This achieves optimal memory efficiency (defined as the ratio of throughput to the amount of on-chip memory used by the design) of O(p/N). Another noteworthy feature of the high throughput design is that only single-port memory rather than dual-port memory is required for processing continuous data streams. This results in 50% reduction in memory consumption. Post place-and-route results show that our architecture demonstrates 1.3x ∼1.6x improvment in energy efficiency and 1.5x ∼ 5.3x better memory efficiency compared with the state-of-the-art designs.

72 citations


Book ChapterDOI
TL;DR: The goal of this study is to compare the sorting times of network sorting algorithms using a Maxeler dataflow computer with the sorting time of optimal sequential and parallel sorting algorithms use a control flow computer.
Abstract: The primary contribution of this study is the implementation and evaluation of network sorting algorithms on a Maxeler dataflow computer. Sorting is extensively used in numerous applications. We discuss sequential, parallel, and network sorting algorithms. The major part of this study is dedicated to the properties, construction, and testing of sorting networks. We introduce and compare principal network sorting algorithms with predominant sequential and parallel sorting algorithms. We implement network sorting algorithms in an entry model of the Maxeler dataflow supercomputing system. The goal of our study is to compare the sorting times of network sorting algorithms using a Maxeler dataflow computer with the sorting times of optimal sequential and parallel sorting algorithms using a control flow computer. In different testing scenarios, we demonstrate that high sorting speedups can be achieved with network sorting using a Maxeler dataflow computer. We sorted arrays of 128 values. Using different testing parameters, we achieved speedups that ranged from approximately 10 to more than 200. Sorting networks that execute parallel sorting using the dataflow computational paradigm offer a possible solution for expanding volumes of data. By converting to more advanced Maxeler systems and researching new ideas and solutions, we aim to sort large arrays and achieve large speedups.

26 citations


Proceedings ArticleDOI
01 Dec 2015
TL;DR: This work proposes a merge sort based hybrid design where the final few stages in the merge sort network are replaced with “folded” bitonic merge networks, and presents a theoretical analysis to quantify latency, memory and throughput of the proposed design.
Abstract: Sorting is a key kernel in numerous big data application including database operations, graphs and text analytics. Due to low control overhead, parallel bitonic sorting networks are usually employed for hardware implementations to accelerate sorting. Although a typical implementation of merge sort network can lead to low latency and small memory usage, it suffers from low throughput due to the lack of parallelism in the final stage. We analyze a pipelined merge sort network, showing its theoretical limits in terms of latency, memory and, throughput. To increase the throughput, we propose a merge sort based hybrid design where the final few stages in the merge sort network are replaced with “folded” bitonic merge networks. In these “folded” networks, all the interconnection patterns are realized by streaming permutation networks (SPN). We present a theoretical analysis to quantify latency, memory and throughput of our proposed design. Performance evaluations are performed by experiments on Xilinx Virtex-7 FPGA with post place-androute results. We demonstrate that our implementation achieves a throughput close to 10 GBps, outperforming state-of-the-art implementation of sorting on the same hardware by 1.2x, while preserving lower latency and higher memory efficiency.

23 citations


Proceedings ArticleDOI
26 Oct 2015
TL;DR: In this paper, the complexity of sorting in homomorphic domain will always have worst case complexity independent of the nature of input, and combining different sorting algorithms to sort encrypted data does not give any performance gain when compared to the application of sorting algorithms individually.
Abstract: In this paper, we show implementation results of various algorithms that sort data encrypted with Fully Ho-momorphic Encryption scheme based on Integers. We analyze the complexities of sorting algorithms over encrypted data by considering Bubble Sort, Insertion Sort, Bitonic Sort and Odd-Even Merge sort. Our complexity analysis together with implementation results show that Odd-Even Merge Sort has better performance than the other sorting techniques. We observe that complexity of sorting in homomorphic domain will always have worst case complexity independent of the nature of input. In addition, we show that combining different sorting algorithms to sort encrypted data does not give any performance gain when compared to the application of sorting algorithms individually.

18 citations


Posted Content
TL;DR: The bitonicsort algorithm is described in detail, and the bitonic sort algorithm based on cuda architecture is implemented, and two e?ective optimization of implementation details according to the characteristics of the GPU greatly improve the e?ciency.
Abstract: This paper describes in detail the bitonic sort algorithm,and implements the bitonic sort algorithm based on cuda architecture.At the same time,we conduct two effective optimization of implementation details according to the characteristics of the GPU,which greatly improve the efficiency. Finally,we survey the optimized Bitonic sort algorithm on the GPU with the speedup of quick sort algorithm on the CPU.Since Quick Sort is not suitable to be implemented in parallel,but it is more efficient than other sorting algorithms on CPU to some extend.Hence,to see the speedup and performance,we compare bitonic sort on GPU with quick Sort on CPU. For a series of 32-bit random integer,the experimental results show that the acceleration of our work is nearly 20 times.When array size is about 216,the speedup ratio is even up to 30.

11 citations


Book ChapterDOI
08 Sep 2015
TL;DR: Two particular steps of permutation index construction are focused on – the selection of top-k nearest pivot points and sorting these pivots according to their respective distances.
Abstract: Permutation-based indexing is one of the most popular techniques for the approximate nearest-neighbor search problem in high-dimensional spaces. Due to the exponential increase of multimedia data, the time required to index this data has become a serious constraint of current techniques. One of the possible steps towards faster index construction is the utilization of massively parallel platforms such as the GPGPU architectures. In this paper, we have focused on two particular steps of permutation index construction – the selection of top-k nearest pivot points and sorting these pivots according to their respective distances. Even though these steps are integrated into a more complex algorithm, we address them selectively since they may be employed individually for different indexing techniques or query processing algorithms in multimedia databases. We also provide a discussion of alternative approaches that we have tested but which have proved less efficient on present hardware.

7 citations


Book ChapterDOI
TL;DR: In this article, an application of the theory of sorting networks to facilitate the synthesis of optimized general purpose sorting libraries is presented. But it is shown that if considering the number of comparisons and swaps then theory predicts no real advantage of this approach, and no real speedup is obtained when taking advantage of instruction level parallelism and non-branching conditional assignment instructions.
Abstract: This paper shows an application of the theory of sorting networks to facilitate the synthesis of optimized general purpose sorting libraries. Standard sorting libraries are often based on combinations of the classic Quicksort algorithm with insertion sort applied as the base case for small fixed numbers of inputs. Unrolling the code for the base case by ignoring loop conditions eliminates branching and results in code which is equivalent to a sorting network. This enables the application of further program transformations based on sorting network optimizations, and eventually the synthesis of code from sorting networks. We show that if considering the number of comparisons and swaps then theory predicts no real advantage of this approach. However, significant speed-ups are obtained when taking advantage of instruction level parallelism and non-branching conditional assignment instructions, both of which are common in modern CPU architectures. We provide empirical evidence that using code synthesized from efficient sorting networks as the base case for Quicksort libraries results in significant real-world speed-ups.

7 citations


Posted Content
TL;DR: This master's thesis studied, implemented and compared sequential and parallel sorting algorithms, and showed that radix sort is the fastest sequential sorting algorithm, whereas radixsort and merge sort are the fastest parallel algorithms (depending on the input distribution).
Abstract: In our study we implemented and compared seven sequential and parallel sorting algorithms: bitonic sort, multistep bitonic sort, adaptive bitonic sort, merge sort, quicksort, radix sort and sample sort. Sequential algorithms were implemented on a central processing unit using C++, whereas parallel algorithms were implemented on a graphics processing unit using CUDA platform. We chose these algorithms because to the best of our knowledge their sequential and parallel implementations were not yet compared all together in the same execution environment. We improved the above mentioned implementations and adopted them to be able to sort input sequences of arbitrary length. We compared algorithms on six different input distributions, which consisted of 32-bit numbers, 32-bit key-value pairs, 64-bit numbers and 64-bit key-value pairs. In this report we give a short description of seven sorting algorithms and all the results obtained by our tests.

6 citations


Book ChapterDOI
17 Aug 2015
TL;DR: This paper introduces and makes use of spiking neural P systems with anti-spikes and rules on synapses to sort integers and discusses two types of sorting, bead sort and bitonic sort tosort integers.
Abstract: This paper introduces and makes use of spiking neural P systems with anti-spikes and rules on synapses to sort integers. Here we discuss two types of sorting, bead sort and bitonic sort to sort integers.

4 citations


Journal ArticleDOI
TL;DR: A parallel bucket sort algorithm is implemented for many-core architecture of graphics processing units (GPUs) and is competitive to the state-of-the-art GPU sorting algorithms and is superior to most of them for long sorting keys.
Abstract: We found an interesting relation between convex optimization and sorting problem. We present a parallel algorithm to compute multiple order statistics of the data by minimizing a number of related convex functions. The computed order statistics serve as splitters that group the data into buckets suitable for parallel bitonic sorting. This led us to a parallel bucket sort algorithm, which we implemented for many-core architecture of graphics processing units (GPUs). The proposed sorting method is competitive to the state-of-the-art GPU sorting algorithms and is superior to most of them for long sorting keys.

Journal ArticleDOI
TL;DR: Optimized Bitonic Sort (OBS) is proposed to improve complexity and sorting time and outperform those of the best of existing method 35%-54%.

Posted Content
TL;DR: The complexity analysis together with implementation results show that Odd-Even Merge Sort has better performance than the other sorting techniques and that complexity of sorting in homomorphic domain will always have worst case complexity independent of the nature of input.
Abstract: In this paper, we show implementation results of various algorithms that sort data encrypted with Fully Homomorphic Encryption scheme based on Integers. We analyze the complexities of sorting algorithms over encrypted data by considering Bubble Sort, Insertion Sort, Bitonic Sort and OddEven Merge sort. Our complexity analysis together with implementation results show that Odd-Even Merge Sort has better performance than the other sorting techniques. We observe that complexity of sorting in homomorphic domain will always have worst case complexity independent of the nature of input. In addition, we show that combining different sorting algorithms to sort encrypted data does not give any performance gain when compared to the application of sorting algorithms individually.

Dissertation
02 Jul 2015
TL;DR: In this paper, the authors studied, implemented and compared sequential and parallel sorting algorithms, and showed that radix sort is the fastest sequential sorting algorithm, whereas quicksort and merge sort are the fastest parallel algorithms (depending on the input distribution).
Abstract: In this master's thesis we studied, implemented and compared sequential and parallel sorting algorithms. We implemented seven algorithms: bitonic sort, multistep bitonic sort, adaptive bitonic sort, merge sort, quicksort, radix sort and sample sort. Sequential algorithms were implemented on a central processing unit using C++, whereas parallel algorithms were implemented on a graphics processing unit using CUDA architecture. We improved the above mentioned implementations and adopted them to be able to sort input sequences of arbitrary length. We compared algorithms on six different input distributions, which consist of 32-bit numbers, 32-bit key-value pairs, 64-bit numbers and 64-bit key-value pairs. The results show that radix sort is the fastest sequential sorting algorithm, whereas radix sort and merge sort are the fastest parallel algorithms (depending on the input distribution). With parallel implementations we achieved speedups of up to 157-times in comparison to sequential implementations.

Proceedings ArticleDOI
01 Sep 2015
TL;DR: After several iterations of improvement, the final implementation of Van Voorhis's network is more than ten percent faster than the existing code for Batcher's network.
Abstract: Sorting is a basic operation in data processing. A common direction in research is to design fast methods that sort millions of numbers. The focus of this article is to sort 16 numbers. Let a hextuple be an unordered tuple of 16 numbers. Although the data may consist of thousands to millions of hextuples, the task is to sort the numbers in each hextuple. GPUs have become powerful coprocessors to CPUs. Sorting networks, originally meant for hardware implementation, are suitable for sorting many hextuples on GPUs. Batcher's sorting network for 16 numbers has ten parallel steps, whereas Van Voorhis's network has nine. Software implementations of the former are well-known and publicly available, whereas the latter seems to remain on paper. The main results in this article are implementations of Van Voorhis's network. After several iterations of improvement, the final implementation of Van Voorhis's network is more than ten percent faster than the existing code for Batcher's network. Insights gained in implementing Van Voorhis's network lead to an improved implementation of Batcher's network. The last result is useful for sorting more than 16 numbers.

Dissertation
01 Jan 2015
TL;DR: It is shown that the problems generally scale well with the LE1 architecture, up to eight cores, when the memory system becomes a serious bottleneck, and the scalability of the proposed framework is evaluated on a highly precise cycle-accurate simulator.
Abstract: Modern system-on-chips augment their baseline CPU with coprocessors and accelerators to increase overall computational capacity and power efficiency, and thus have evolved into heterogeneous systems. Several languages have been developed to enable this paradigm shift, including CUDA and OpenCL. This thesis discusses a unified compilation environment to enable heterogeneous system design through the use of OpenCL and a customised VLIW chip multiprocessor (CMP) architecture, known as the LE1. An LLVM compilation framework was researched and a prototype developed to enable the execution of OpenCL applications on the LE1 CPU. The framework fully automates the compilation flow and supports work-item coalescing to better utilise the CPU cores and alleviate the effects of thread divergence. This thesis discusses in detail both the software stack and target hardware architecture and evaluates the scalability of the proposed framework on a highly precise cycle-accurate simulator. This is achieved through the execution of 12 benchmarks across 240 different machine configurations, as well as further results utilising an incomplete development branch of the compiler. It is shown that the problems generally scale well with the LE1 architecture, up to eight cores, when the memory system becomes a serious bottleneck. Results demonstrate superlinear performance on certain benchmarks (x9 for the bitonic sort benchmark with 8 dual-issue cores) with further improvements from compiler optimisations (x14 for bitonic with the same configuration)

Patent
28 Oct 2015
TL;DR: In this paper, a method and apparatus enable an efficient hardware design capable of simultaneously sorting multiple data inputs for high throughput at reduced complexity using Insertion Sort Algorithm (ISA).
Abstract: Sorting algorithms are generally used at different steps in data processing. In many situations, the efficiency of the sorting algorithm used determines the throughput/execution speed of the application. Methods for implementing high speed sorting in hardware are often based on Batcher's Odd/Even sort or Bitonic sort algorithms. These algorithms are computation intensive and involve high number of logic gates to implement and high power consumption. The higher the number of logic gates, the more silicon area may be required and may lead to higher cost. Insertion sort is a sorting algorithm that is relatively simpler and may require fewer logic gates to implement. However, throughput achieved using Insertion sort algorithm is much lower than the throughput achieved using high speed sorting algorithms. A method and apparatus enable an efficient hardware design capable of simultaneously sorting multiple data inputs for high throughput at reduced complexity.


Journal ArticleDOI
TL;DR: Time estimation of bitonic sort algorithm is done in both sequential as well as in parallel domain in order to improve up the performance of initial batcher`s bitonic sorting algorithm.
Abstract: The Batcher`s bitonic sorting algorithm is one of the best parallel sorting algorithms, for sorting random numbers in modern parallel machines. Load balancing property of bitonic sorting algorithm makes it unique among other parallel sorting algorithms. Contribution of bitonic sorting algorithm can be seen in various scientific and engineering applications. Research on a bitonic sorting algorithm has been reported by various researchers in order to improve up the performance of initial batcher`s bitonic sorting algorithm. In this paper, time estimation of bitonic sort algorithm is done in both sequential as well as in parallel domain.

Journal ArticleDOI
31 Mar 2015
TL;DR: A modified K-best detector algorithm which employs parallel and distributed sorting strategy combined with bitonic sorter that has near-ML detection solution is proposed that targeting 3GPP-LTE standard.
Abstract: This paper presents a VLSI implementation of reduced -complexity and reconfigurable MIMO(Multiple-Input Multiple-Output) signal detector targeting 3GPP-LTE standard. In recent wireless communication system, MIMO technology is considered as the key technique in LTE to meet the target. Maximum Likelihood (ML) detection is the optimal detection algorithm for MIMO systems. FPGA implementation of ML detector becomes infeasible as its complexity grows exponentially with the increase in number of antennas. Therefore, we propose a modified K-best detector algorithm which employs parallel and distributed sorting strategy combined with bitonic sorter that has near-ML detection solution. The design was implemented targeting Xilinx Spartan 6 device and the resource utilization results are presented and the performance comparison with the literature was also done. The total on-chip power estimated is 213mW.

01 Jan 2015
TL;DR: This study is proposing a new sorting algorithm (BIT Sorting) and compared with existing algorithms in terms of complexities and results obtained after implementation are described in a graphical form with an objective to compare the efficiency of the proposed algorithm with standard algorithm method.
Abstract: In field of computer science, there are various applications of sorting algorithm. Sorting is an operation to arrange the elements of a data structure in some logical order Sorting is data arranged in a particular fashion either in ascending or descending form. As we know that, there existing many sorting algorithms with different complexities. In this study, I am proposing a new sorting algorithm (BIT Sorting) and compared with existing algorithms in terms of complexities. Results obtained after implementation are described in a graphical form with an objective to compare the efficiency of the proposed algorithm with standard algorithm method.

Proceedings ArticleDOI
20 Aug 2015
TL;DR: The proposed implementation of the novel heterogeneous DSP architecture ePUMA can rival the sorting performance of high-performance commercial CPUs and GPUs, with two orders of magnitude higher energy efficiency, which would allow high- performance sorting on low-power devices.
Abstract: This paper presents the novel heterogeneous DSP architecture ePUMA and demonstrates its features through an implementation of sorting of larger data sets. We derive a sorting algorithm with fixed-size merging tasks suitable for distributed memory architectures, which allows very simple scheduling and predictable data-independent sorting time. The implementation on ePUMA utilizes the architecture's specialized compute cores and control cores, and local memory parallelism, to separate and overlap sorting with data access and control for close to stall-free sorting. Penalty-free unaligned and out-of-order local memory access is used in combination with proposed application-specific sorting instructions to derive highly efficient local sorting and merging kernels used by the system-level algorithm. Our evaluation shows that the proposed implementation can rival the sorting performance of high-performance commercial CPUs and GPUs, with two orders of magnitude higher energy efficiency, which would allow high-performance sorting on low-power devices.

Journal ArticleDOI
TL;DR: Various researches have worked on a bitonic sorting algorithm in order to improve up the performance of original batchers bitonicsort, and the contribution made by these researchers is reviewed.
Abstract: Batchers bitonic sorting algorithm is a parallel sorting algorithm, which is used for sorting the numbers in modern parallel machines. There are various parallel sorting algorithms such as radix sort, bitonic sort, etc. It is one of the efficient parallel sorting algorithm because of load balancing property. It is widely used in various scientific and engineering applications. However, Various researches have worked on a bitonic sorting algorithm in order to improve up the performance of original batchers bitonic sorting algorithm. In this paper, tried to review the contribution made by these researchers.

Book ChapterDOI
25 Sep 2015
TL;DR: This paper introduces a non-dominated set construction algorithm based on Two Dimensional Sequence (TSNS), which performs better than NSGA-II in terms of the quality of solutions and the speed of convergence.
Abstract: The complexity of multi-objective evolutionary algorithms based on the non-dominated principles mainly depends on finding non-dominated fronts. In order to reduce complexity and improve construction efficiency, this paper introduces a non-dominated set construction algorithm based on Two Dimensional Sequence (TSNS). When the non-dominated set closes to the Pareto optimal front, it always maintains one dimension by ascending order while the other dimension by descending order. In order to verify the effectiveness of the proposed algorithm, we integrate the algorithm into GA, DE, PSO, then we tested and compared it with classical benchmark functions. The experimental results indicate that the proposed algorithm performs better than NSGA-II in terms of the quality of solutions and the speed of convergence.