scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A logarithmic time sort for linear size networks

01 Jan 1987-Journal of the ACM (ACM)-Vol. 34, Iss: 1, pp 60-76
TL;DR: A randomized algorithm that sorts on an N- node network with constant valence in O(log N) time with probability at least 1 - N- “α” - “ α” for all large enough items.
Abstract: A randomized algorithm that sorts on an N node network with constant valence in O(log N) time is given. More particularly, the algorithm sorts N items on an N-node cube-connected cycles graph, and, for some constant k, for all large enough a, it terminates within ka log N time with probability at least 1 - N-a.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: A digital signature scheme based on the computational difficulty of integer factorization possesses the novel property of being robust against an adaptive chosen-message attack: an adversary who receives signatures for messages of his choice cannot later forge the signature of even a single additional message.
Abstract: We present a digital signature scheme based on the computational difficulty of integer factorization. The scheme possesses the novel property of being robust against an adaptive chosen-message attack: an adversary who receives signatures for messages of his choice (where each message may be chosen in a way that depends on the signatures of previously chosen messages) cannot later forge the signature of even a single additional message. This may be somewhat surprising, since in the folklore the properties of having forgery being equivalent to factoring and being invulnerable to an adaptive chosen-message attack were considered to be contradictory. More generally, we show how to construct a signature scheme with such properties based on the existence of a "claw-free" pair of permutations--a potentially weaker assumption than the intractibility of integer factorization. The new scheme is potentially practical: signing and verifying signatures are reasonably fast, and signatures are compact.

3,150 citations

Book
Maurice Herlihy1
14 Mar 2008
TL;DR: Transactional memory as discussed by the authors is a computational model in which threads synchronize by optimistic, lock-free transactions, and there is a growing community of researchers working on both software and hardware support for this approach.
Abstract: Computer architecture is about to undergo, if not another revolution, then a vigorous shaking-up. The major chip manufacturers have, for the time being, simply given up trying to make processors run faster. Instead, they have recently started shipping "multicore" architectures, in which multiple processors (cores) communicate directly through shared hardware caches, providing increased concurrency instead of increased clock speed.As a result, system designers and software engineers can no longer rely on increasing clock speed to hide software bloat. Instead, they must somehow learn to make effective use of increasing parallelism. This adaptation will not be easy. Conventional synchronization techniques based on locks and conditions are unlikely to be effective in such a demanding environment. Coarse-grained locks, which protect relatively large amounts of data, do not scale, and fine-grained locks introduce substantial software engineering problem.Transactional memory is a computational model in which threads synchronize by optimistic, lock-free transactions. This synchronization model promises to alleviate many (not all) of the problems associated with locking, and there is a growing community of researchers working on both software and hardware support for this approach. This talk will survey the area, with a focus on open research problems.

1,268 citations

Book ChapterDOI
02 Jan 1991
TL;DR: In this paper, the authors discuss parallel algorithms for shared-memory machines and discuss the theoretical foundations of parallel algorithms and parallel architectures, and present a theoretical analysis of the appropriate logical organization of a massively parallel computer.
Abstract: Publisher Summary This chapter discusses parallel algorithms for shared-memory machines. Parallel computation is rapidly becoming a dominant theme in all areas of computer science and its applications. It is estimated that, within a decade, virtually all developments in computer architecture, systems programming, computer applications and the design of algorithms will be taking place within the context of parallel computation. In preparation for this revolution, theoretical computer scientists have begun to develop a body of theory centered on parallel algorithms and parallel architectures. As there is no consensus yet on the appropriate logical organization of a massively parallel computer, and as the speed of parallel algorithms is constrained as much by limits on interprocessor communication as it is by purely computational issues, it is not surprising that a variety of abstract models of parallel computation have been pursued. Closest to the hardware level are the VLSI models, which focus on the technological limits of today's chips, in which gates and wires are packed into a small number of planar layers.

812 citations

Proceedings ArticleDOI
23 May 2009
TL;DR: The design of high-performance parallel radix sort and merge sort routines for manycore GPUs, taking advantage of the full programmability offered by CUDA, are described, which are the fastest GPU sort and the fastest comparison-based sort reported in the literature.
Abstract: We describe the design of high-performance parallel radix sort and merge sort routines for manycore GPUs, taking advantage of the full programmability offered by CUDA. Our radix sort is the fastest GPU sort and our merge sort is the fastest comparison-based sort reported in the literature. Our radix sort is up to 4 times faster than the graphics-based GPUSort and greater than 2 times faster than other CUDA-based radix sorts. It is also 23% faster, on average, than even a very carefully optimized multicore CPU sorting routine. To achieve this performance, we carefully design our algorithms to expose substantial fine-grained parallelism and decompose the computation into independent tasks that perform minimal global communication. We exploit the high-speed onchip shared memory provided by NVIDIA's GPU architecture and efficient data-parallel primitives, particularly parallel scan. While targeted at GPUs, these algorithms should also be well-suited for other manycore processors.

684 citations


Cites methods from "A logarithmic time sort for linear ..."

  • ...Another elegant parallelization technique is used by sample sort [30], [31], [32]....

    [...]

Book
25 Aug 2011
TL;DR: A new algorithm for finding the blocks (biconnected components) of an undirected graph and a general algorithmic technique that simplifies and improves computation of various functions on trees is introduced.
Abstract: In this paper we propose a new algorithm for finding the blocks (biconnected components) of an undirected graph. A serial implementation runs in $O(n + m)$ time and space on a graph of n vertices and m edges. A parallel implementation runs in $O(\log n)$ time and $O(n + m)$ space using $O(n + m)$ processors on a concurrent-read, concurrent-write parallel RAM. An alternative implementation runs in $O(n^2 /p)$ time and $O(n^2 )$ space using any number $p \leqq n^2 /\log ^2 n$ of processors, on a concurrent-read, exclusive-write parallel RAM. The last algorithm has optimal speedup, assuming an adjacency matrix representation of the input. A general algorithmic technique that simplifies and improves computation of various functions on trees is introduced. This technique typically requires $O(\log n)$ time using processors and $O(n)$ space on an exclusive-read exclusive-write parallel RAM.

501 citations

References
More filters
Journal ArticleDOI
TL;DR: In this paper, it was shown that the likelihood ratio test for fixed sample size can be reduced to this form, and that for large samples, a sample of size $n$ with the first test will give about the same probabilities of error as a sample with the second test.
Abstract: In many cases an optimum or computationally convenient test of a simple hypothesis $H_0$ against a simple alternative $H_1$ may be given in the following form. Reject $H_0$ if $S_n = \sum^n_{j=1} X_j \leqq k,$ where $X_1, X_2, \cdots, X_n$ are $n$ independent observations of a chance variable $X$ whose distribution depends on the true hypothesis and where $k$ is some appropriate number. In particular the likelihood ratio test for fixed sample size can be reduced to this form. It is shown that with each test of the above form there is associated an index $\rho$. If $\rho_1$ and $\rho_2$ are the indices corresponding to two alternative tests $e = \log \rho_1/\log \rho_2$ measures the relative efficiency of these tests in the following sense. For large samples, a sample of size $n$ with the first test will give about the same probabilities of error as a sample of size $en$ with the second test. To obtain the above result, use is made of the fact that $P(S_n \leqq na)$ behaves roughly like $m^n$ where $m$ is the minimum value assumed by the moment generating function of $X - a$. It is shown that if $H_0$ and $H_1$ specify probability distributions of $X$ which are very close to each other, one may approximate $\rho$ by assuming that $X$ is normally distributed.

3,760 citations

Proceedings ArticleDOI
30 Apr 1968
TL;DR: To achieve high throughput rates today's computers perform several operations simultaneously; not only are I/O operations performed concurrently with computing, but also, in multiprocessors, several computing operations are done concurrently.
Abstract: To achieve high throughput rates today's computers perform several operations simultaneously. Not only are I/O operations performed concurrently with computing, but also, in multiprocessors, several computing operations are done concurrently. A major problem in the design of such a computing system is the connecting together of the various parts of the system (the I/O devices, memories, processing units, etc.) in such a way that all the required data transfers can be accommodated. One common scheme is a high-speed bus which is time-shared by the various parts; speed of available hardware limits this scheme. Another scheme is a cross-bar switch or matrix; limiting factors here are the amount of hardware (an m × n matrix requires m × n cross-points) and the fan-in and fan-out of the hardware.

2,553 citations


"A logarithmic time sort for linear ..." refers background in this paper

  • ...The bitonic sorter of Batcher [ 4 ] achieves this bound on such networks as the cube-connected cycles network [ 131....

    [...]

Journal ArticleDOI
TL;DR: This work describes in detail how to program the cube-connected cycles for efficiently solving a large class of problems that include Fast Fourier transform, sorting, permutations, and derived algorithms.
Abstract: An interconnection pattern of processing elements, the cube-connected cycles (CCC), is introduced which can be used as a general purpose parallel processor. Because its design complies with present technological constraints, the CCC can also be used in the layout of many specialized large scale integrated circuits (VLSI). By combining the principles of parallelism and pipelining, the CCC can emulate the cube-connected machine and the shuffle-exchange network with no significant degradation of performance but with a more compact structure. We describe in detail how to program the CCC for efficiently solving a large class of problems that include Fast Fourier transform, sorting, permutations, and derived algorithms.

1,046 citations

Book
01 Jan 1984

862 citations

Proceedings ArticleDOI
11 May 1981
TL;DR: This paper shows that there exists an N-processor computer that can simulate arbitrary N- processor parallel computations with only a factor of O(log N) loss of runtime efficiency, and isolates a combinatorial problem that lies at the heart of this question.
Abstract: In this paper we isolate a combinatorial problem that, we believe, lies at the heart of this question and provide some encouragingly positive solutions to it. We show that there exists an N-processor realistic computer that can simulate arbitrary idealistic N-processor parallel computations with only a factor of O(log N) loss of runtime efficiency. The main innovation is an O(log N) time randomized routing algorithm. Previous approaches were based on sorting or permutation networks, and implied loss factors of order at least (log N)2.

694 citations