scispace - formally typeset
Proceedings ArticleDOI

Revisiting the combining synchronization technique

Reads0
Chats0
TLDR
This paper revisits the combining technique with the goal to discover where its real performance power resides and whether or how ensuring some desired properties would impact performance, and presents two new implementations of this technique which outperform by far all previous state-of-the-art combining-based and fine-grain synchronization algorithms.
Abstract
Fine-grain thread synchronization has been proved, in several cases, to be outperformed by efficient implementations of the combining technique where a single thread, called the combiner, holding a coarse-grain lock, serves, in addition to its own synchronization request, active requests announced by other threads while they are waiting by performing some form of spinning. Efficient implementations of this technique significantly reduce the cost of synchronization, so in many cases they exhibit much better performance than the most efficient finely synchronized algorithms.In this paper, we revisit the combining technique with the goal to discover where its real performance power resides and whether or how ensuring some desired properties (e.g., fairness in serving requests) would impact performance. We do so by presenting two new implementations of this technique; the first (CC-Synch) addresses systems that support coherent caches, whereas the second (DSM-Synch) works better in cacheless NUMA machines. In comparison to previous such implementations, the new implementations (1) provide bounds on the number of remote memory references (RMRs) that they perform, (2) support a stronger notion of fairness, and (3) use simpler and less basic primitives than previous approaches. In all our experiments, the new implementations outperform by far all previous state-of-the-art combining-based and fine-grain synchronization algorithms. Our experimental analysis sheds light to the questions we aimed to answer.Several modern multi-core systems organize the cores into clusters and provide fast communication within the same cluster and much slower communication across clusters. We present an hierarchical version of CC-Synch, called H-Synch which exploits the hierarchical communication nature of such systems to achieve better performance. Experiments show that H-Synch significantly outper forms previous state-of-the-art hierarchical approaches.We provide new implementations of common shared data structures (like stacks and queues) based on CC-Synch, DSM-Synch and H-Synch. Our experiments show that these implementations outperform by far all previous (fine-grain or combined-based) implementations of shared stacks and queues.

read more

Citations
More filters
Proceedings ArticleDOI

Ligra: a lightweight graph processing framework for shared memory

TL;DR: This paper presents a lightweight graph processing framework that is specific for shared-memory parallel/multicore machines, which makes graph traversal algorithms easy to write and significantly more efficient than previously reported results using graph frameworks on machines with many more cores.
Proceedings ArticleDOI

Fast concurrent queues for x86 processors

TL;DR: This paper takes a different approach, showing how to rely on fetch-and-add (F&A), a less powerful primitive that is available on x86 processors, to construct a nonblocking (lock-free) linearizable concurrent FIFO queue which outperforms combining-based implementations by 1.5x to 2.5X.
Journal ArticleDOI

Lock Cohorting: A General Technique for Designing NUMA Locks

TL;DR: This article presents lock cohorting, a general new technique for designing NUMA-aware locks that is as simple as it is powerful, and allows one to transform any spin-lock algorithm, with minimal nonintrusive changes, into a scalable NUma-aware spin-locks.
Proceedings ArticleDOI

Concurrent Data Structures for Near-Memory Computing

TL;DR: This paper is the first to examine the design of concurrent data structures for PIM, and shows two main results: (1) naive PIM data structures cannot outperform state-of-the-art concurrentData structures, and (2) novel designs for Pim data structures, using techniques such as combining, partitioning and pipelining, can outperform traditional concurrent data structure, with a significantly simpler design.
Proceedings ArticleDOI

A compiler for throughput optimization of graph algorithms on GPUs

TL;DR: This paper argues that three optimizations called throughput optimizations are key to high-performance for this application class and has implemented these optimizations in a compiler that produces CUDA code from an intermediate-level program representation called IrGL.
References
More filters
Proceedings ArticleDOI

Validity of the single processor approach to achieving large scale computing capabilities

TL;DR: In this paper, the authors argue that the organization of a single computer has reached its limits and that truly significant advances can be made only by interconnection of a multiplicity of computers in such a manner as to permit cooperative solution.
Journal ArticleDOI

Linearizability: a correctness condition for concurrent objects

TL;DR: This paper defines linearizability, compares it to other correctness conditions, presents and demonstrates a method for proving the correctness of implementations, and shows how to reason about concurrent objects, given they are linearizable.
Journal ArticleDOI

Wait-free synchronization

TL;DR: A hierarchy of objects is derived such that no object at one level has a wait-free implementation in terms of objects at lower levels, and it is shown that atomic read/write registers, which have been the focus of much recent attention, are at the bottom of the hierarchy.
Journal ArticleDOI

Algorithms for scalable synchronization on shared-memory multiprocessors

TL;DR: The principal conclusion is that contention due to synchronization need not be a problemin large-scale shared-memory multiprocessors, and the existence of scalable algorithms greatly weakens the case for costly special-purpose hardware support for synchronization, and provides protection against so-called “dance hall” architectures.
Book

Algorithms for scalable synchronization on shared-memory multiprocessors

TL;DR: In this article, the authors present a scalable algorithm for spin locks that provides reasonable latency in the absence of contention, requires only a constant amount of space per lock, and requires no hardware support other than a swap-with-memory instruction.
Related Papers (5)