The periodic balanced sorting network

doi:10.1145/76359.76362

Home
/
Papers
/
The periodic balanced sorting network

Journal Article•DOI•

The periodic balanced sorting network

Martin Dowd¹, Yehoshua Perl², Larry Rudolph³, Michael Saks¹•Institutions (3)

Rutgers University¹, New Jersey Institute of Technology², Hebrew University of Jerusalem³

01 Oct 1989-Journal of the ACM (ACM)-Vol. 36, Iss: 4, pp 738-757

TL;DR: The periodic balanced sorting network, which consists of log log blocks, is introduced and each block, called a balanced merging block, merges elements on the even input lines with those on the odd input lines.

read less

Abstract: A periodic sorting network consists of a sequence of identical blocks. In this paper, the periodic balanced sorting network, which consists of log n blocks, is introduced. Each block, called a balanced merging block, merges elements on the even input lines with those on the odd input lines.The periodic balanced sorting network sorts n items in O([log n]2) time using (n/2)(log n)2 comparators. Although these bounds are comparable to many existing sorting networks, the periodic structure enables a hardware implementation consisting of only one block with the output of the block recycled back as input until the output is sorted. An implementation of our network on the shuffle exchange interconnection model in which the direction of the comparators are all identical and fixed is also presented.

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

A Survey of General-Purpose Computation on Graphics Hardware

[...]

John D. Owens¹, David Luebke², Naga K. Govindaraju³, Mark J. Harris², Jens Krüger⁴, Aaron Lefohn, Timothy John Purcell² - Show less +3 more•Institutions (4)

University of California, Davis¹, Nvidia², Microsoft³, Technische Universität München⁴

01 Mar 2007-Computer Graphics Forum

TL;DR: This report describes, summarize, and analyzes the latest research in mapping general‐purpose computation to graphics hardware.

...read moreread less

Abstract: The rapid increase in the performance of graphics hardware, coupled with recent improvements in its programmability, have made graphics hardware a compelling platform for computationally demanding tasks in a wide variety of application domains. In this report, we describe, summarize, and analyze the latest research in mapping general-purpose computation to graphics hardware. We begin with the technical motivations that underlie general-purpose computation on graphics processors (GPGPU) and describe the hardware and software developments that have led to the recent interest in this field. We then aim the main body of this report at two separate audiences. First, we describe the techniques used in mapping general-purpose computation to graphics hardware. We believe these techniques will be generally useful for researchers who plan to develop the next generation of GPGPU algorithms and techniques. Second, we survey and categorize the latest developments in general-purpose application development on graphics hardware. This survey should be of particular interest to researchers who are interested in using the latest GPGPU applications in their systems of interest.

...read moreread less

1,998 citations

Proceedings Article•

A Survey of General-Purpose Computation on Graphics Hardware.

[...]

John D. Owens¹, David Luebke², Naga K. Govindaraju³, Mark J. Harris², Jens Krüger⁴, Aaron Lefohn, Timothy John Purcell² - Show less +3 more•Institutions (4)

University of California, Davis¹, Nvidia², Microsoft³, Technische Universität München⁴

01 Jan 2005

TL;DR: The techniques used in mapping general-purpose computation to graphics hardware will be generally useful for researchers who plan to develop the next generation of GPGPU algorithms and techniques.

...read moreread less

1,728 citations

Book•

The Art of Multiprocessor Programming

[...]

Maurice Herlihy¹•Institutions (1)

Brown University¹

14 Mar 2008

TL;DR: Transactional memory as discussed by the authors is a computational model in which threads synchronize by optimistic, lock-free transactions, and there is a growing community of researchers working on both software and hardware support for this approach.

...read moreread less

Abstract: Computer architecture is about to undergo, if not another revolution, then a vigorous shaking-up. The major chip manufacturers have, for the time being, simply given up trying to make processors run faster. Instead, they have recently started shipping "multicore" architectures, in which multiple processors (cores) communicate directly through shared hardware caches, providing increased concurrency instead of increased clock speed.As a result, system designers and software engineers can no longer rely on increasing clock speed to hide software bloat. Instead, they must somehow learn to make effective use of increasing parallelism. This adaptation will not be easy. Conventional synchronization techniques based on locks and conditions are unlikely to be effective in such a demanding environment. Coarse-grained locks, which protect relatively large amounts of data, do not scale, and fine-grained locks introduce substantial software engineering problem.Transactional memory is a computational model in which threads synchronize by optimistic, lock-free transactions. This synchronization model promises to alleviate many (not all) of the problems associated with locking, and there is a growing community of researchers working on both software and hardware support for this approach. This talk will survey the area, with a focus on open research problems.

...read moreread less

1,268 citations

Journal Article•DOI•

Counting networks

[...]

James Aspnes¹, Maurice Herlihy, Nir Shavit²•Institutions (2)

IBM¹, Massachusetts Institute of Technology²

01 Sep 1994-Journal of the ACM

TL;DR: Two counting network constructions are given that avoid the sequential bottlenecks inherent to earlier solutions and substantially lower the memory contention, and are provided with experimental evidence that they outperform conventional synchronization techniques under a variety of circumstances.

...read moreread less

Abstract: Many fundamental multi-processor coordination problems can be expressed as counting problems: Processes must cooperate to assign successive values from a given range, such as addresses in memory or destinations on an interconnection network. Conventional solutions to these problems perform poorly because of synchronization bottlenecks and high memory contention.Motivated by observations on the behavior of sorting networks, we offer a new approach to solving such problems, by introducing counting networks, a new class of networks that can be used to count. We give two counting network constructions, one of depth log n(1 p log n)/2 using n log (1 p log n)/4 “gates,” and a second of depth log2n using n log2n/2 gates. These networks avoid the sequential bottlenecks inherent to earlier solutions and substantially lower the memory contention.Finally, to show that counting networks are not merely mathematical creatures, we provide experimental evidence that they outperform conventional synchronization techniques under a variety of circumstances.

...read moreread less

183 citations

Proceedings Article•DOI•

Fast and approximate stream mining of quantiles and frequencies using graphics processors

[...]

Naga K. Govindaraju¹, Nikunj Raghuvanshi¹, Dinesh Manocha¹•Institutions (1)

University of North Carolina at Chapel Hill¹

14 Jun 2005

TL;DR: The results demonstrate that the graphics processors available on a commodity computer system are efficient stream-processor and useful co-processors for mining data streams.

...read moreread less

Abstract: We present algorithms for fast quantile and frequency estimation in large data streams using graphics processors (GPUs). We exploit the high computation power and memory bandwidth of graphics processors and present a new sorting algorithm that performs rasterization operations on the GPUs. We use sorting as the main computational component for histogram approximation and construction of e-approximate quantile and frequency summaries. Our algorithms for numerical statistics computation on data streams are deterministic, applicable to fixed or variable-sized sliding windows and use a limited memory footprint. We use GPU as a co-processor and minimize the data transmission between the CPU and GPU by taking into account the low bus bandwidth. We implemented our algorithms on a PC with a NVIDIA GeForce FX 6800 Ultra GPU and a 3.4 GHz Pentium IV CPU and applied them to large data streams consisting of more than 100 million values. We also compared the performance of our GPU-based algorithms with optimized implementations of prior CPU-based algorithms. Overall, our results demonstrate that the graphics processors available on a commodity computer system are efficient stream-processor and useful co-processors for mining data streams.

...read moreread less

152 citations

Cites background or methods from "The periodic balanced sorting netwo..."

...At the end of log n stages, the input is sorted [16]....
[...]
...Using these two functionalities, we can design optimal sorting networks such as bitonic sort [8] and periodic balanced sort [16] efficiently using any traditional GPU....
[...]
...Sorting networks are a class of sorting algorithms that map well to mesh-based architectures [8, 16]....
[...]
...4.4 Periodic Balanced Sorting Network Our sorting algorithm is based on the periodic balanced sorting network (PBSN) [16]....
[...]
...Our sorting algorithm is based on the periodic balanced sorting network (PBSN) [16]....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

Collapse

References

PDF

Open Access

More filters

Proceedings Article•DOI•

Sorting networks and their applications

[...]

Kenneth E. Batcher¹•Institutions (1)

Goodyear Aerospace¹

30 Apr 1968

TL;DR: To achieve high throughput rates today's computers perform several operations simultaneously; not only are I/O operations performed concurrently with computing, but also, in multiprocessors, several computing operations are done concurrently.

...read moreread less

Abstract: To achieve high throughput rates today's computers perform several operations simultaneously. Not only are I/O operations performed concurrently with computing, but also, in multiprocessors, several computing operations are done concurrently. A major problem in the design of such a computing system is the connecting together of the various parts of the system (the I/O devices, memories, processing units, etc.) in such a way that all the required data transfers can be accommodated. One common scheme is a high-speed bus which is time-shared by the various parts; speed of available hardware limits this scheme. Another scheme is a cross-bar switch or matrix; limiting factors here are the amount of hardware (an m × n matrix requires m × n cross-points) and the fan-in and fan-out of the hardware.

...read moreread less

2,553 citations

Journal Article•DOI•

Parallel Processing with the Perfect Shuffle

[...]

Harold S. Stone

01 Feb 1971-IEEE Transactions on Computers

TL;DR: Given a vector of N elements, the perfect shuffle of this vector is a permutation of the elements that are identical to aperfect shuffle of a deck of cards.

...read moreread less

Abstract: Given a vector of N elements, the perfect shuffle of this vector is a permutation of the elements that are identical to a perfect shuffle of a deck of cards. Elements of the first half of the vector are interlaced with elements of the second half in the perfect shuffle of the vector.

...read moreread less

1,331 citations

Journal Article•DOI•

Access and Alignment of Data in an Array Processor

[...]

D.H. Lawrie¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

01 Dec 1975-IEEE Transactions on Computers

TL;DR: This paper discusses the design of a primary memory system for an array processor which allows parallel, conflict-free access to various slices of data, and subsequent alignment of these data for processing, and a network based on Stone's shuffle-exchange operation is presented.

...read moreread less

Abstract: This paper discusses the design of a primary memory system for an array processor which allows parallel, conflict-free access to various slices of data (e.g., rows, columns, diagonals, etc.), and subsequent alignment of these data for processing. Memory access requirements for an array processor are discussed in general terms and a set of common requirements are defined. The ability to meet these requirements is shown to depend on the number of independent memory units and on the mapping of the data in these memories. Next, the need to align these data for processing is demonstrated and various alignment requirements are defined. Hardware which can perform this alignment function is discussed, e.g., permutation, indexing, switching or sorting networks, and a network (the omega network) based on Stone's shuffle-exchange operation [1] is presented. Construction of this network is described and many of its useful properties are proven. Finally, as an example of these ideas, an array processor is shown which allows conflict-free access and alignment of rows, columns, diagonals, backward diagonals, and square blocks in row or column major order, as well as certain other special operations.

...read moreread less

1,210 citations

Proceedings Article•DOI•

An 0(n log n) sorting network

[...]

Miklós Ajtai, János Komlós, Endre Szemerédi

01 Dec 1983

TL;DR: A sorting network of size 0(n log n) and depth 0(log n) is described, and a derived procedure (&egr;-nearsort) are described below, and the sorting network will be centered around these elementary steps.

...read moreread less

Abstract: The purpose of this paper is to describe a sorting network of size 0(n log n) and depth 0(log n). A natural way of sorting is through consecutive halvings: determine the upper and lower halves of the set, proceed similarly within the halves, and so on. Unfortunately, while one can halve a set using only 0(n) comparisons, this cannot be done in less than log n (parallel) time, and it is known that a halving network needs (½)n log n comparisons. It is possible, however, to construct a network of 0(n) comparisons which halves in constant time with high accuracy. This procedure (e-halving) and a derived procedure (e-nearsort) are described below, and our sorting network will be centered around these elementary steps.

...read moreread less

683 citations

Journal Article•DOI•

A logarithmic time sort for linear size networks

[...]

John H. Reif¹, Leslie G. Valiant¹•Institutions (1)

Harvard University¹

01 Jan 1987-Journal of the ACM

TL;DR: A randomized algorithm that sorts on an N- node network with constant valence in O(log N) time with probability at least 1 - N- “α” - “ α” for all large enough items.

...read moreread less

Abstract: A randomized algorithm that sorts on an N node network with constant valence in O(log N) time is given. More particularly, the algorithm sorts N items on an N-node cube-connected cycles graph, and, for some constant k, for all large enough a, it terminates within ka log N time with probability at least 1 - N-a.

...read moreread less

242 citations