scispace - formally typeset
Search or ask a question

Showing papers on "Counting sort published in 2010"


Proceedings ArticleDOI
06 Jun 2010
TL;DR: This paper presents a competitive analysis of comparison and non-comparison based sorting algorithms on two modern architectures - the latest CPU and GPU architectures, and proposes novel CPU radix sort and GPU merge sort implementations which are 2X faster than previously published results.
Abstract: Sort is a fundamental kernel used in many database operations. In-memory sorts are now feasible; sort performance is limited by compute flops and main memory bandwidth rather than I/O. In this paper, we present a competitive analysis of comparison and non-comparison based sorting algorithms on two modern architectures - the latest CPU and GPU architectures. We propose novel CPU radix sort and GPU merge sort implementations which are 2X faster than previously published results. We perform a fair comparison of the algorithms using these best performing implementations on both architectures. While radix sort is faster on current architectures, the gap narrows from CPU to GPU architectures. Merge sort performs better than radix sort for sorting keys of large sizes - such keys will be required to accommodate the increasing cardinality of future databases. We present analytical models for analyzing the performance of our implementations in terms of architectural features such as core count, SIMD and bandwidth. Our obtained performance results are successfully predicted by our models. Our analysis points to merge sort winning over radix sort on future architectures due to its efficient utilization of SIMD and low bandwidth utilization. We simulate a 64-core platform with varying SIMD widths under constant bandwidth per core constraints, and show that large data sizes of 240 (one trillion records), merge sort performance on large key sizes is up to 3X better than radix sort for large SIMD widths on future architectures. Therefore, merge sort should be the sorting method of choice for future databases.

250 citations


Proceedings ArticleDOI
11 Sep 2010
TL;DR: This poster presents efficient strategies for sorting large sequences of fixed-length keys (and values) using GPGPU stream processors using a parallel scan stream primitive that has been generalized in two ways: with local interfaces for producer/consumer operations (visiting logic), and with interfaces for performing multiple related, concurrent prefix scans (multi-scan).
Abstract: This poster presents efficient strategies for sorting large sequences of fixed-length keys (and values) using GPGPU stream processors. Compared to the state-of-the-art, our radix sorting methods exhibit speedup of at least 2x for all generations of NVIDIA GPGPUs, and up to 3.7x for current GT200-based models. Our implementations demonstrate sorting rates of 482 million key-value pairs per second, and 550 million keys per second (32-bit). For this domain of sorting problems, we believe our sorting primitive to be the fastest available for any fully-programmable microarchitecture. These results motivate a different breed of parallel primitives for GPGPU stream architectures that can better exploit the memory and computational resources while maintaining the flexibility of a reusable component. Our sorting performance is derived from a parallel scan stream primitive that has been generalized in two ways: (1) with local interfaces for producer/consumer operations (visiting logic), and (2) with interfaces for performing multiple related, concurrent prefix scans (multi-scan).

191 citations


Proceedings ArticleDOI
19 Apr 2010
TL;DR: In this paper, the authors present a sample sort algorithm for manycore GPUs, which is robust to different distributions and entropy levels of keys and scales almost linearly with the input size.
Abstract: We present the design of a sample sort algorithm for manycore GPUs. Despite being one of the most efficient comparison-based sorting algorithms for distributed memory architectures its performance on GPUs was previously unknown. For uniformly distributed keys our sample sort is at least 25% and on average 68% faster than the best comparison-based sorting algorithm, GPU Thrust merge sort, and on average more than 2 times faster than GPU quicksort. Moreover, for 64-bit integer keys it is at least 63% and on average 2 times faster than the highly optimized GPU Thrust radix sort that directly manipulates the binary representation of keys. Our implementation is robust to different distributions and entropy levels of keys and scales almost linearly with the input size. These results indicate that multi-way techniques in general and sample sort in particular achieve substantially better performance than two-way merge sort and quicksort.

117 citations


Proceedings ArticleDOI
Xiaochun Ye1, Dongrui Fan1, Wei Lin1, Nan Yuan1, Paolo Ienne1 
19 Apr 2010
TL;DR: A new algorithm, GPU-Warpsort, to perform comparison-based parallel sort on Graphics Processing Units (GPUs), which mainly consists of a bitonic sort followed by a merge sort that achieves high performance by efficiently mapping the sorting tasks to GPU architectures.
Abstract: Sorting is a kernel algorithm for a wide range of applications. We present a new algorithm, GPU-Warpsort, to perform comparison-based parallel sort on Graphics Processing Units (GPUs). It mainly consists of a bitonic sort followed by a merge sort. Our algorithm achieves high performance by efficiently mapping the sorting tasks to GPU architectures. Firstly, we take advantage of the synchronous execution of threads in a warp to eliminate the barriers in bitonic sorting network. We also provide sufficient homogeneous parallel operations for all the threads within a warp to avoid branch divergence. Furthermore, we implement the merge sort efficiently by assigning each warp independent pairs of sequences to be merged and by exploiting totally coalesced global memory accesses to eliminate the bandwidth bottleneck. Our experimental results indicate that GPU-Warpsort works well on different kinds of input distributions, and it achieves up to 30% higher performance than previous optimized comparison-based GPU sorting algorithm on input sequences with millions of elements.

68 citations


Patent
04 Aug 2010
TL;DR: In this paper, a database machine is provided with specialized hardware that can be used to accelerate the sort function, which can be embodied in a direct circuit (e.g., ASIC), a programmable circuit, a parallel compute engine, or any parallel computer.
Abstract: As described herein, a database machine is provided with specialized hardware that can be used to accelerate the sort function. This hardware lowers the computation cost of performing a raw sort operation over the result rows. The hardware may be embodied in a direct circuit (e.g., ASIC), a programmable circuit (e.g., FPGA), a parallel compute engine (e.g., GPU) or any parallel computer. A hardware-assisted sort procedure provides for the early return of up to K results. This early return feature is critically valuable in database operations because often an entire result set is not required. For requests that require only the first L results, when L<=K the query can be satisfied with only a single pass over the data. The hardware- or GPU-assisted sort procedure, referred to herein as “scraper sort,” may be based on modifications of well-known, existing parallel sort algorithms.

11 citations


Proceedings ArticleDOI
01 Mar 2010
TL;DR: This paper introduces a new evaluation technique, called cooperative sort, that exploits the relationships among the input set of sort orders to minimize I/O operations for the collection of sort operations.
Abstract: Many applications require sorting a table over multiple sort orders: generation of multiple reports from a table, evaluation of a complex query that involves multiple instances of a relation, and batch processing of a set of queries. In this paper, we study how multiple sortings of a table can be efficiently performed. We introduce a new evaluation technique, called cooperative sort, that exploits the relationships among the input set of sort orders to minimize I/O operations for the collection of sort operations. To demonstrate the efficiency of the proposed scheme, we implemented it in PostgreSQL and evaluated its performance using both TPC-DS benchmark and synthetic data. Our experimental results show significant performance improvement over the traditional non-cooperative sorting scheme.

10 citations


Patent
29 Mar 2010
TL;DR: In this paper, a system, method, and computer program product are provided for sorting a set of records in a sort run, and metadata regarding the sort run is gathered, and subsequently used to determine bounds of two or more disjoint subsets of the sorted run.
Abstract: A system, method, and computer program product are provided for sorting a set of records in a sort run. As the records are sorted, metadata regarding the sort run is gathered, and subsequently used to determine bounds of two or more disjoint subsets of the sort run. This enables the parallelization of several tasks over the sort run data using efficient, dynamic bounds determination, such as the outputting of sorted data from the disjoint subsets in parallel.

9 citations


Patent
28 Oct 2010
TL;DR: In this paper, a method for generating results for a sort operation is described, which includes writing a subset of input to memory and the subset may be sorted based on the sort operation.
Abstract: There is provided a method for generating results for a sort operation. The method includes writing a subset of input to memory. The subset may be sorted based on the sort operation. The sorted subset may be compared to previous results. The previous results may be recalled from a client of the sort operation based on the comparison.

8 citations


Journal ArticleDOI
TL;DR: It is shown that an elegant alternative can be used which outperforms the traditional method both in terms of processing speed and main memory access, and the improvements that can be obtained when the algorithm is combined with a well-known watershed image segmentation method.

7 citations


Proceedings ArticleDOI
22 Jun 2010
TL;DR: This paper considers the sorting of a large number of multifield records on the Cell Broadband engine and shows that this method outperforms previously proposed sort methods that use either comb sort or bitonic sort for run generation followed by a 2-way odd-even merging of runs.
Abstract: We consider the sorting of a large number of multifield records on the Cell Broadband engine We show that our method, which generates runs using a 2-way merge and then merges these runs using a 4-way merge, outperforms previously proposed sort methods that use either comb sort or bitonic sort for run generation followed by a 2-way odd-even merging of runs Interestingly, best performance is achieved by using scalar memory copy instructions rather than vector instructions

7 citations


Posted Content
TL;DR: This work presents a fast radix sorting algorithm that builds upon a microarchitecture-aware variant of counting sort that outperforms Intel's recently published radix sort by a factor of 1.5 and compares favorably to the reported performance of an algorithm for Fermi GPUs when data-transfer overhead is included.
Abstract: Sorting algorithms are the deciding factor for the performance of common operations such as removal of duplicates or database sort-merge joins. This work focuses on 32-bit integer keys, optionally paired with a 32-bit value. We present a fast radix sorting algorithm that builds upon a microarchitecture-aware variant of counting sort. Taking advantage of virtual memory and making use of write-combining yields a per-pass throughput corresponding to at least 88 % of the system’s peak memory bandwidth. Our implementation outperforms Intel’s recently published radix sort by a factor of 1.5. It also compares favorably to the reported performance of an algorithm for Fermi GPUs when data-transfer overhead is included. These results indicate that scalar, bandwidth-sensitive sorting algorithms remain competitive on current architectures. Various other memory-intensive applications can benefit from the techniques described herein.

Journal Article
TL;DR: The SM S algorithm is considered as an enhancement on the Quicksort algorithm in the best, average, and worst cases when dealing with an input array of a large size and when the maximum and the minimum values were small, especially when sorting a list of distinct elements.
Abstract: Sorting is considered as one of the important issues of computer science. Although there is a huge number of sorting algorithms, sorting problem has attracted a great deal of research; because efficient sorting is important to optimize the use of other algorithms. It is also often in producing human$readable output. This paper presents a new sorting algorithm called SMS$algorithm (Scan, Move, and Sort). The SM S algorithm is considered as an enhancement on the Quicksort algorithm in the best, average, and worst cases when dealing with an input array of a large size and when the maximum and the minimum values were small, especially when sorting a list of distinct elements. The SMS algorithm is compared with the Quicksort algorithm and the results were promising.

01 Jan 2010
TL;DR: This paper presents the first parallel sorting algorithm to combine all herein before mentioned properties, while laying the foundations to overcome scalability problems for sorting data on the next generation of massively parallel systems.
Abstract: Sorting is one of the most fundamental algorithmic kernels, used by a large fraction of computer applications. This paper proposes a novel parallel sorting algorithm based on exact splitting that combines excellent scaling behavior with universal applicability. In contrast to many existing parallel sorting algorithms that make limiting assumptions regarding the input problem or the underlying computation model, our general-purpose algorithm can be used without restrictions on any MIMD-class computer architecture, demonstrating its full potential on massively parallel systems with distributed memory. It is comparison-based like most sequential sorting algorithms, handles an arbitrary number of keys per processing element, works in a deterministic way, does not fail in the presence of duplicate keys, minimizes the communication bandwidth requirements, does not require any knowledge of the key-value distribution, and uses only a small and a priori known amount of additional memory. Moreover, our algorithm can be turned into a stable sort without altering the time complexity, and can be made work in place. The total running time for sorting n elements on p processors is O( n log n + plog 2 n). Practical scalability is shown using more than thirty thousand compute nodes. This paper presents the first parallel sorting algorithm to combine all herein before mentioned properties, while laying the foundations to overcome scalability problems for sorting data on the next generation of massively parallel systems.

Journal ArticleDOI
TL;DR: In this paper, lots of experiments are done to compare the performance of incomplete selection sort algorithm and incomplete bubble sort algorithm, and results and algorithm analysis show that incomplete selection sorts perform very well in case of both short sequence and long sequence.
Abstract: To quickly get the median value of any given sequence is very important in many research fields. Some popular sort algorithms are discussed in this paper. Selection sort algorithm and bubble sort algorithm are redesigned as incomplete sort algorithms to quickly give the median value of randomly given sequence. In the new algorithms, only parts of the items in sequence need to be sorted to give the median value, then many data comparison and movement operation are reduced, and the speed of getting median value can be improved greatly. Besides, insertion sort algorithm and merge sort algorithm are analyzed thoroughly and found not suitable to be redesigned as incomplete sort algorithms for this purpose. At last, lots of experiments are done to compare the performance of incomplete selection sort algorithm and incomplete bubble sort algorithm. Experiment results and algorithm analysis show that incomplete selection sort algorithm perform very well in case of both short sequence and long sequence.

Proceedings ArticleDOI
09 Sep 2010
TL;DR: In this paper, an incremental attribute reduction algorithm is present and a fast counting sort algorithm is introduced for dealing with redundant and inconsistent data in decision tables.
Abstract: Attribute reduction is one of the key problems in Rough Set theory, and many algorithms have been proposed for static data. Very little work has been done in incremental attribute reduction algorithm. In this paper, an incremental attribute reduction algorithm is present. In order to reduce the computational complexity, a fast counting sort algorithm is introduced for dealing with redundant and inconsistent data in decision tables. When the objects in decision table increase dynamically, a new reduct can be updated by the old reduct effectively. Experiments show that our algorithm outperforms other incremental attribute reduction algorithms.

Proceedings ArticleDOI
13 Dec 2010
TL;DR: Theoretical analysis and experimental evaluation show that the proposed algorithm is easy to implement, and has much lower computational complexity than one-dimensional sorting algorithms, especially when arrays have large size.
Abstract: Borrowing ideas from one-dimensional array selection sorting algorithms, we propose a sorting algorithm for two-dimensional arrays. Both theoretical analysis and experimental evaluation show that the proposed algorithm is easy to implement, and has much lower computational complexity than one-dimensional sorting algorithms, especially when arrays have large size. Furthermore, we convert the sorting of one-dimensional arrays to that of two-dimensional (m×n) arrays, and find the values of m and n that minimize the computation time.

Journal ArticleDOI
TL;DR: This paper studies the question of which are the smallest general graphs that can sort an arbitrary permutation and what is their efficiency, and shows that certain two-node graphs can sort in time @Q(n^2) and no simpler graph can sort all permutations.

Patent
29 Mar 2010
TL;DR: In this article, a system, method, and computer program product are provided for sorting a set of records in a sort run, and metadata regarding the sort run is gathered, and subsequently used to determine bounds of two or more disjoint subsets of the sorted run.
Abstract: A system, method, and computer program product are provided for sorting a set of records in a sort run. As the records are sorted, metadata regarding the sort run is gathered, and subsequently used to determine bounds of two or more disjoint subsets of the sort run. This enables the parallelization of several tasks over the sort run data using efficient, dynamic bounds determination, such as the outputting of sorted data from the disjoint subsets in parallel.

Patent
23 Jun 2010
TL;DR: In this article, a method for streaming counting sort by utilizing a micro light pattern is proposed, where a cell convergence channel, a single cell sampling recognition channel and a cell separation channel are formed by sequentially constructing the light pattern, so as to realize continuous convergence, recognition and separation of cells.
Abstract: The invention relates to a method for carrying out streaming counting sort by utilizing a micro light pattern; a cell convergence channel, a single cell sampling recognition channel and a cell separation channel are formed by sequentially constructing the light pattern, so as to realize continuous convergence, recognition and separation of cells; in the cell recognition process, sorting counting of various cells is carried out, and the three channels form a sorting microsystem sequentially to realize the counting sort of various cells in an organism sample finally The organism cell counting and separating method provided by the invention fully utilizes the flexibility of the micro light pattern, and a complex physical entity electrode array is avoided from being manufactured on the chip, and the method is superior to the existing counting sorting method of the organism cells on the cost, function and performance

Patent
Wilke Wolf-Stephan1
27 Apr 2010
TL;DR: In this paper, a method and an apparatus sort flat postal items in two sorting processes using a sorting feature such that a predetermined sequence of the feature values is maintained, and then the items of both classes are jointly sorted by a sorting installation.
Abstract: A method and an apparatus sort items, in particular of flat postal items, in two sorting processes Two classes of items are sorted on the basis of a sorting feature such that a predetermined sequence of the feature values is maintained In a first sorting process, a sorting system sorts the items in the first class separately from the items in the second class In the process, the sorting system determines which possible feature values are actually assumed by at least one item in the second class The sorting system generates a sequence of item sets These item sets and the items in the second class are jointly sorted by a sorting installation In the second sorting process, the sorting installation generates the predetermined feature value sequence of the items of both classes

Journal ArticleDOI
TL;DR: A new stable sorting algorithm for internal sorting that scans an unsorted input array of length n and arranges it into m sorted sub-arrays by using the m-way merge algorithm, which outperforms other stable sorting algorithms that are designed for join-based queries.
Abstract: The performance of several Database Management Systems (DBMSs) and Data Stream Management Systems (DSMSs) queries is dominated by the cost of the sorting algorithm. Sorting is an integral component of most database management systems. Stable sorting algorithms play an important role in DBMS queries since such operations requires stable sorting outputs. In this paper, we present a new stable sorting algorithm for internal sorting that scans an unsorted input array of length n and arranges it into m sorted sub-arrays. By using the m-way merge algorithm, the sorted m subarrays will be merged into the final output sorted array. The proposed algorithm keeps the stability of the keys intact. The scanning process requires linear time complexity (O(n)) in the best case, and O(n log m) in the worst case, and the m-way merge process requires O (n log m) time complexity. The proposed algorithm has a time complexity of O (n log m) element comparisons. The performed experimental results have shown that the proposed algorithm outperforms other stable sorting algorithms that are designed for join-based queries.

Proceedings ArticleDOI
Darrah Chavey1
10 Mar 2010
TL;DR: This work suggests the use of two "Double Sorting" techniques whose solution is not standardly available, are fairly straight-forward to code, and offer speed improvements over the "straight" sorts.
Abstract: You're teaching elementary sorting techniques, and you would like your students to do a programming assignment that tests their understanding of the ideas. But all of the code for elementary sorting techniques are in the textbook, easily found on the Web, etc. We suggest the use of two "Double Sorting" techniques whose solution is not standardly available, are fairly straight-forward to code, and offer speed improvements over the "straight" sorts. Double Sorting, the idea of processing two chosen elements simultaneously, applies to both Insertion Sort and Selection Sort, with speedups of 33% and 25%, hence are good enough to justify coding, but not good enough to be in Web collections. Code for this can be written in as little as a dozen lines of C++/Java code, and is easily within the reach of introductory students who understand the basic algorithms. In addition, the ideas used for double sorting are natural first steps in understanding how the N2 sorts can be improved towards the N log N sorts that they will study later. For more advanced students, these double sorts also generate good exercises in the analysis of algorithms.

Journal ArticleDOI
TL;DR: A new Sorting Algorithm for Join Queries (SAJQ) that has both advantages of being efficient and stable and outperforms other Stable–Sorting algorithms that are designed for join-based queries.


Proceedings ArticleDOI
18 Jul 2010
TL;DR: The RP approach improves on earlier results that that use genetic programming (GP), and the resulting algorithm is a novel algorithm that is more efficient than comparable sorting routines.
Abstract: Reinforcement Programming (RP) is a new approach to automatically generating algorithms, that uses reinforcement learning techniques. This paper describes the RP approach and gives results of experiments using RP to generate a generalized, in-place, iterative sort algorithm. The RP approach improves on earlier results that that use genetic programming (GP). The resulting algorithm is a novel algorithm that is more efficient than comparable sorting routines. RP learns the sort in fewer iterations than GP and with fewer resources. Results establish interesting empirical bounds on learning the sort algorithm: A list of size 4 is sufficient to learn the generalized sort algorithm. The training set only requires one element and learning took less than 200,000 iterations. RP has also been used to generate three binary addition algorithms: a full adder, a binary incrementer, and a binary adder.

Journal Article
TL;DR: The interesting feature of this algorithm is that the order of alphabets is not change and the approach is too much simpler.
Abstract: In This paper, the authors present a positional algorithmic approach for alphabetic sort. Results are achieved in linear time. Within this approach two embedded algorithms, Binary Search and Counting Sort are executed in parallel to achieve the goal. In this approach a Pre-Processor or Priority Queue is used, which minimizes time complexity. The algorithm is linear in speed. Time Complexity of this newly proposed algorithm is Θ (n). The interesting feature of this algorithm is that the order of alphabets is not change and the approach is too much simpler KeywordsAlgorithm, Priority Queue, Sort, Search, Complexity, Analysis.