Revisiting the combining synchronization technique

doi:10.1145/2145816.2145849

Proceedings ArticleDOI

Revisiting the combining synchronization technique

Panagiota Fatourou, +1 more

- Vol. 47, Iss: 8, pp 257-266

Chats0

TLDR

This paper revisits the combining technique with the goal to discover where its real performance power resides and whether or how ensuring some desired properties would impact performance, and presents two new implementations of this technique which outperform by far all previous state-of-the-art combining-based and fine-grain synchronization algorithms.

Abstract:

Fine-grain thread synchronization has been proved, in several cases, to be outperformed by efficient implementations of the combining technique where a single thread, called the combiner, holding a coarse-grain lock, serves, in addition to its own synchronization request, active requests announced by other threads while they are waiting by performing some form of spinning. Efficient implementations of this technique significantly reduce the cost of synchronization, so in many cases they exhibit much better performance than the most efficient finely synchronized algorithms.In this paper, we revisit the combining technique with the goal to discover where its real performance power resides and whether or how ensuring some desired properties (e.g., fairness in serving requests) would impact performance. We do so by presenting two new implementations of this technique; the first (CC-Synch) addresses systems that support coherent caches, whereas the second (DSM-Synch) works better in cacheless NUMA machines. In comparison to previous such implementations, the new implementations (1) provide bounds on the number of remote memory references (RMRs) that they perform, (2) support a stronger notion of fairness, and (3) use simpler and less basic primitives than previous approaches. In all our experiments, the new implementations outperform by far all previous state-of-the-art combining-based and fine-grain synchronization algorithms. Our experimental analysis sheds light to the questions we aimed to answer.Several modern multi-core systems organize the cores into clusters and provide fast communication within the same cluster and much slower communication across clusters. We present an hierarchical version of CC-Synch, called H-Synch which exploits the hierarchical communication nature of such systems to achieve better performance. Experiments show that H-Synch significantly outper forms previous state-of-the-art hierarchical approaches.We provide new implementations of common shared data structures (like stacks and queues) based on CC-Synch, DSM-Synch and H-Synch. Our experiments show that these implementations outperform by far all previous (fine-grain or combined-based) implementations of shared stacks and queues.

Revisiting the combining synchronization technique

Citations

Ligra: a lightweight graph processing framework for shared memory

Fast concurrent queues for x86 processors

Lock Cohorting: A General Technique for Designing NUMA Locks

Concurrent Data Structures for Near-Memory Computing

A compiler for throughput optimization of graph algorithms on GPUs

References

Validity of the single processor approach to achieving large scale computing capabilities

Linearizability: a correctness condition for concurrent objects

Wait-free synchronization

Algorithms for scalable synchronization on shared-memory multiprocessors

Algorithms for scalable synchronization on shared-memory multiprocessors

Related Papers (5)

Flat combining and the synchronization-parallelism tradeoff

Algorithms for scalable synchronization on shared-memory multiprocessors

Simple, fast, and practical non-blocking and blocking concurrent queue algorithms

Linearizability: a correctness condition for concurrent objects

The Art of Multiprocessor Programming