scispace - formally typeset
Search or ask a question
Journal ArticleDOI

The periodic balanced sorting network

TL;DR: The periodic balanced sorting network, which consists of log log blocks, is introduced and each block, called a balanced merging block, merges elements on the even input lines with those on the odd input lines.
Abstract: A periodic sorting network consists of a sequence of identical blocks. In this paper, the periodic balanced sorting network, which consists of log n blocks, is introduced. Each block, called a balanced merging block, merges elements on the even input lines with those on the odd input lines.The periodic balanced sorting network sorts n items in O([log n]2) time using (n/2)(log n)2 comparators. Although these bounds are comparable to many existing sorting networks, the periodic structure enables a hardware implementation consisting of only one block with the output of the block recycled back as input until the output is sorted. An implementation of our network on the shuffle exchange interconnection model in which the direction of the comparators are all identical and fixed is also presented.
Citations
More filters
Journal ArticleDOI
TL;DR: This report describes, summarize, and analyzes the latest research in mapping general‐purpose computation to graphics hardware.
Abstract: The rapid increase in the performance of graphics hardware, coupled with recent improvements in its programmability, have made graphics hardware a compelling platform for computationally demanding tasks in a wide variety of application domains. In this report, we describe, summarize, and analyze the latest research in mapping general-purpose computation to graphics hardware. We begin with the technical motivations that underlie general-purpose computation on graphics processors (GPGPU) and describe the hardware and software developments that have led to the recent interest in this field. We then aim the main body of this report at two separate audiences. First, we describe the techniques used in mapping general-purpose computation to graphics hardware. We believe these techniques will be generally useful for researchers who plan to develop the next generation of GPGPU algorithms and techniques. Second, we survey and categorize the latest developments in general-purpose application development on graphics hardware. This survey should be of particular interest to researchers who are interested in using the latest GPGPU applications in their systems of interest.

1,998 citations

Proceedings Article
01 Jan 2005
TL;DR: The techniques used in mapping general-purpose computation to graphics hardware will be generally useful for researchers who plan to develop the next generation of GPGPU algorithms and techniques.
Abstract: The rapid increase in the performance of graphics hardware, coupled with recent improvements in its programmability, have made graphics hardware a compelling platform for computationally demanding tasks in a wide variety of application domains. In this report, we describe, summarize, and analyze the latest research in mapping general-purpose computation to graphics hardware. We begin with the technical motivations that underlie general-purpose computation on graphics processors (GPGPU) and describe the hardware and software developments that have led to the recent interest in this field. We then aim the main body of this report at two separate audiences. First, we describe the techniques used in mapping general-purpose computation to graphics hardware. We believe these techniques will be generally useful for researchers who plan to develop the next generation of GPGPU algorithms and techniques. Second, we survey and categorize the latest developments in general-purpose application development on graphics hardware. This survey should be of particular interest to researchers who are interested in using the latest GPGPU applications in their systems of interest.

1,728 citations

Book
Maurice Herlihy1
14 Mar 2008
TL;DR: Transactional memory as discussed by the authors is a computational model in which threads synchronize by optimistic, lock-free transactions, and there is a growing community of researchers working on both software and hardware support for this approach.
Abstract: Computer architecture is about to undergo, if not another revolution, then a vigorous shaking-up. The major chip manufacturers have, for the time being, simply given up trying to make processors run faster. Instead, they have recently started shipping "multicore" architectures, in which multiple processors (cores) communicate directly through shared hardware caches, providing increased concurrency instead of increased clock speed.As a result, system designers and software engineers can no longer rely on increasing clock speed to hide software bloat. Instead, they must somehow learn to make effective use of increasing parallelism. This adaptation will not be easy. Conventional synchronization techniques based on locks and conditions are unlikely to be effective in such a demanding environment. Coarse-grained locks, which protect relatively large amounts of data, do not scale, and fine-grained locks introduce substantial software engineering problem.Transactional memory is a computational model in which threads synchronize by optimistic, lock-free transactions. This synchronization model promises to alleviate many (not all) of the problems associated with locking, and there is a growing community of researchers working on both software and hardware support for this approach. This talk will survey the area, with a focus on open research problems.

1,268 citations

Journal ArticleDOI
TL;DR: Two counting network constructions are given that avoid the sequential bottlenecks inherent to earlier solutions and substantially lower the memory contention, and are provided with experimental evidence that they outperform conventional synchronization techniques under a variety of circumstances.
Abstract: Many fundamental multi-processor coordination problems can be expressed as counting problems: Processes must cooperate to assign successive values from a given range, such as addresses in memory or destinations on an interconnection network. Conventional solutions to these problems perform poorly because of synchronization bottlenecks and high memory contention.Motivated by observations on the behavior of sorting networks, we offer a new approach to solving such problems, by introducing counting networks, a new class of networks that can be used to count. We give two counting network constructions, one of depth log n(1 p log n)/2 using n log (1 p log n)/4 “gates,” and a second of depth log2n using n log2n/2 gates. These networks avoid the sequential bottlenecks inherent to earlier solutions and substantially lower the memory contention.Finally, to show that counting networks are not merely mathematical creatures, we provide experimental evidence that they outperform conventional synchronization techniques under a variety of circumstances.

183 citations

Proceedings ArticleDOI
14 Jun 2005
TL;DR: The results demonstrate that the graphics processors available on a commodity computer system are efficient stream-processor and useful co-processors for mining data streams.
Abstract: We present algorithms for fast quantile and frequency estimation in large data streams using graphics processors (GPUs). We exploit the high computation power and memory bandwidth of graphics processors and present a new sorting algorithm that performs rasterization operations on the GPUs. We use sorting as the main computational component for histogram approximation and construction of e-approximate quantile and frequency summaries. Our algorithms for numerical statistics computation on data streams are deterministic, applicable to fixed or variable-sized sliding windows and use a limited memory footprint. We use GPU as a co-processor and minimize the data transmission between the CPU and GPU by taking into account the low bus bandwidth. We implemented our algorithms on a PC with a NVIDIA GeForce FX 6800 Ultra GPU and a 3.4 GHz Pentium IV CPU and applied them to large data streams consisting of more than 100 million values. We also compared the performance of our GPU-based algorithms with optimized implementations of prior CPU-based algorithms. Overall, our results demonstrate that the graphics processors available on a commodity computer system are efficient stream-processor and useful co-processors for mining data streams.

152 citations


Cites background or methods from "The periodic balanced sorting netwo..."

  • ...At the end of log n stages, the input is sorted [16]....

    [...]

  • ...Using these two functionalities, we can design optimal sorting networks such as bitonic sort [8] and periodic balanced sort [16] efficiently using any traditional GPU....

    [...]

  • ...Sorting networks are a class of sorting algorithms that map well to mesh-based architectures [8, 16]....

    [...]

  • ...4.4 Periodic Balanced Sorting Network Our sorting algorithm is based on the periodic balanced sorting network (PBSN) [16]....

    [...]

  • ...Our sorting algorithm is based on the periodic balanced sorting network (PBSN) [16]....

    [...]

References
More filters
Proceedings ArticleDOI
30 Apr 1968
TL;DR: To achieve high throughput rates today's computers perform several operations simultaneously; not only are I/O operations performed concurrently with computing, but also, in multiprocessors, several computing operations are done concurrently.
Abstract: To achieve high throughput rates today's computers perform several operations simultaneously. Not only are I/O operations performed concurrently with computing, but also, in multiprocessors, several computing operations are done concurrently. A major problem in the design of such a computing system is the connecting together of the various parts of the system (the I/O devices, memories, processing units, etc.) in such a way that all the required data transfers can be accommodated. One common scheme is a high-speed bus which is time-shared by the various parts; speed of available hardware limits this scheme. Another scheme is a cross-bar switch or matrix; limiting factors here are the amount of hardware (an m × n matrix requires m × n cross-points) and the fan-in and fan-out of the hardware.

2,553 citations

Journal ArticleDOI
TL;DR: Given a vector of N elements, the perfect shuffle of this vector is a permutation of the elements that are identical to aperfect shuffle of a deck of cards.
Abstract: Given a vector of N elements, the perfect shuffle of this vector is a permutation of the elements that are identical to a perfect shuffle of a deck of cards. Elements of the first half of the vector are interlaced with elements of the second half in the perfect shuffle of the vector.

1,331 citations

Journal ArticleDOI
TL;DR: This paper discusses the design of a primary memory system for an array processor which allows parallel, conflict-free access to various slices of data, and subsequent alignment of these data for processing, and a network based on Stone's shuffle-exchange operation is presented.
Abstract: This paper discusses the design of a primary memory system for an array processor which allows parallel, conflict-free access to various slices of data (e.g., rows, columns, diagonals, etc.), and subsequent alignment of these data for processing. Memory access requirements for an array processor are discussed in general terms and a set of common requirements are defined. The ability to meet these requirements is shown to depend on the number of independent memory units and on the mapping of the data in these memories. Next, the need to align these data for processing is demonstrated and various alignment requirements are defined. Hardware which can perform this alignment function is discussed, e.g., permutation, indexing, switching or sorting networks, and a network (the omega network) based on Stone's shuffle-exchange operation [1] is presented. Construction of this network is described and many of its useful properties are proven. Finally, as an example of these ideas, an array processor is shown which allows conflict-free access and alignment of rows, columns, diagonals, backward diagonals, and square blocks in row or column major order, as well as certain other special operations.

1,210 citations

Proceedings ArticleDOI
01 Dec 1983
TL;DR: A sorting network of size 0(n log n) and depth 0(log n) is described, and a derived procedure (&egr;-nearsort) are described below, and the sorting network will be centered around these elementary steps.
Abstract: The purpose of this paper is to describe a sorting network of size 0(n log n) and depth 0(log n). A natural way of sorting is through consecutive halvings: determine the upper and lower halves of the set, proceed similarly within the halves, and so on. Unfortunately, while one can halve a set using only 0(n) comparisons, this cannot be done in less than log n (parallel) time, and it is known that a halving network needs (½)n log n comparisons. It is possible, however, to construct a network of 0(n) comparisons which halves in constant time with high accuracy. This procedure (e-halving) and a derived procedure (e-nearsort) are described below, and our sorting network will be centered around these elementary steps.

683 citations

Journal ArticleDOI
TL;DR: A randomized algorithm that sorts on an N- node network with constant valence in O(log N) time with probability at least 1 - N- “α” - “ α” for all large enough items.
Abstract: A randomized algorithm that sorts on an N node network with constant valence in O(log N) time is given. More particularly, the algorithm sorts N items on an N-node cube-connected cycles graph, and, for some constant k, for all large enough a, it terminates within ka log N time with probability at least 1 - N-a.

242 citations