scispace - formally typeset
Search or ask a question

Showing papers on "Concurrent data structure published in 2011"


Journal ArticleDOI
26 Jan 2011
TL;DR: The basic idea is that clients pass the ghost code required to instantiate an operation's specification for a specific client scenario into the operation in a simple form of higher-order programming, which enables fully general specification of fine-grained concurrent data structures.
Abstract: Compared to coarse-grained external synchronization of operations on data structures shared between concurrent threads, fine-grained, internal synchronization can offer stronger progress guarantees and better performance However, fully specifying operations that perform internal synchronization modularly is a hard, open problem The state of the art approaches, based on linearizability or on concurrent abstract predicates, have important limitations on the expressiveness of specifications Linearizability does not support ownership transfer, and the concurrent abstract predicates-based specification approach requires hardcoding a particular usage protocol In this paper, we propose a novel approach that lifts these limitations and enables fully general specification of fine-grained concurrent data structures The basic idea is that clients pass the ghost code required to instantiate an operation's specification for a specific client scenario into the operation in a simple form of higher-order programmingWe machine-checked the theory of the paper using the Coq proof assistant Furthermore, we implemented the approach in our program verifier VeriFast and used it to verify two challenging fine-grained concurrent data structures from the literature: a multiple-compare-and-swap algorithm and a lock-coupling list

93 citations


Proceedings ArticleDOI
17 Jul 2011
TL;DR: Relaxer, a combination of predictive dynamic analysis and software testing, to help programmers write correct, highly-concurrent programs and generates many executions of these benchmarks with violations of sequential consistency, highlighting a number of bugs under relaxed memory models.
Abstract: High-performance concurrent libraries, such as lock-free data structures and custom synchronization primitives, are notoriously difficult to write correctly. Such code is often implemented without locks, instead using plain loads and stores and low-level operations like atomic compare-and-swaps and explicit memory fences. Such code must run correctly despite the relaxed memory model of the underlying compiler, virtual machine, and/or hardware. These memory models may reorder the reads and writes issued by a thread, greatly complicating parallel reasoning.We propose Relaxer, a combination of predictive dynamic analysis and software testing, to help programmers write correct, highly-concurrent programs. Our technique works in two phases. First, Relaxer examines a sequentially-consistent run of a program under test and dynamically detects potential data races. These races are used to predict possible violations of sequential consistency under alternate executions on a relaxed memory model. In the second phase, Relaxer re-executes the program with a biased random scheduler and with a conservative simulation of a relaxed memory model in order to create with high probability a predicted sequential consistency violation. These executions can be used to test whether or not a program works as expected when the underlying memory model is not sequentially consistent.We have implemented Relaxer for C and have evaluated it on several synchronization algorithms, concurrent data structures, and parallel applications. Relaxer generates many executions of these benchmarks with violations of sequential consistency, highlighting a number of bugs under relaxed memory models.

77 citations


Journal ArticleDOI
TL;DR: This paper proposes a more general definition of context-bounded analysis useful for programs with dynamic thread creation, and considers several variants based on this new definition that establish decidability and complexity results for the analysis induced by them.
Abstract: Context-bounded analysis has been shown to be both efficient and effective at finding bugs in concurrent programs. According to its original definition, context-bounded analysis explores all behaviors of a concurrent program up to some fixed number of context switches between threads. This definition is inadequate for programs that create threads dynamically because bounding the number of context switches in a computation also bounds the number of threads involved in the computation. In this paper, we propose a more general definition of context-bounded analysis useful for programs with dynamic thread creation. The idea is to bound the number of context switches for each thread instead of bounding the number of switches of all threads. We consider several variants based on this new definition, and we establish decidability and complexity results for the analysis induced by them.

73 citations


Journal ArticleDOI
26 Jan 2011
TL;DR: This paper presents the first semantics of separation logic that is sensitive to atomicity, and shows how to control this sensitivity through ownership, and develops a new rely-guarantee method that is localized to the definition of a data structure.
Abstract: Fine-grained concurrent data structures are crucial for gaining performance from multiprocessing, but their design is a subtle art. Recent literature has made large strides in verifying these data structures, using either atomicity refinement or separation logic with rely-guarantee reasoning. In this paper we show how the ownership discipline of separation logic can be used to enable atomicity refinement, and we develop a new rely-guarantee method that is localized to the definition of a data structure. We present the first semantics of separation logic that is sensitive to atomicity, and show how to control this sensitivity through ownership. The result is a logic that enables compositional reasoning about atomicity and interference, even for programs that use fine-grained synchronization and dynamic memory allocation.

37 citations


Proceedings ArticleDOI
14 Jun 2011
TL;DR: This work develops a new class of parallel data structures called Smart Data Structures that leverage online machine learning to adapt automatically and demonstrates significant improvements over the best existing algorithms under a variety of conditions.
Abstract: As multicores become prevalent, the complexity of programming is skyrocketing. One major difficulty is efficiently orchestrating collaboration among threads through shared data structures. Unfortunately, choosing and hand-tuning data structure algorithms to get good performance across a variety of machines and inputs is a herculean task to add to the fundamental difficulty of getting a parallel program correct. To help mitigate these complexities, this work develops a new class of parallel data structures called Smart Data Structures that leverage online machine learning to adapt automatically. We prototype and evaluate an open source library of Smart Data Structures for common parallel programming needs and demonstrate significant improvements over the best existing algorithms under a variety of conditions. Our results indicate that learning is a promising technique for balancing and adapting to complex, time-varying tradeoffs and achieving the best performance available.

32 citations


Book ChapterDOI
13 Dec 2011
TL;DR: It is shown empirically that the COP approach can enhance a software transactional memory (STM) framework to deliver more efficient concurrent data structures from serial source code and deliver performance comparable to that of more complex fine-grained structures.
Abstract: It is well known that guaranteeing program consistency when accessing shared data comes at the price of degraded performance and scalability. This paper initiates the investigation of consistency oblivious programming (COP). In COP, sections of concurrent code that meet certain criteria are executed without checking for consistency. However, checkpoints are added before any shared data modification to verify the algorithm was on the right track, and if not, it is re-executed in a more conservative and expensive consistent way. We show empirically that the COP approach can enhance a software transactional memory (STM) framework to deliver more efficient concurrent data structures from serial source code. In some cases the COP code delivers performance comparable to that of more complex fine-grained structures.

27 citations


Proceedings ArticleDOI
21 May 2011
TL;DR: It is proved that symmetry reduction and partial order reduction can be combined in the approach and integrate them into the model checking algorithm and is demonstrated its efficiency using a number of real-world concurrent data structure algorithms.
Abstract: Concurrent data structures are widely used but notoriously difficult to implement correctly. Linearizability is one main correctness criterion of concurrent data structure algorithms. It guarantees that a concurrent data structure appears as a sequential one to users. Unfortunately, linearizability is challenging to verify since a subtle bug may only manifest in a small portion of numerous thread interleavings. Model checking is therefore a potential primary candidate. However, current approaches of model checking linearizability suffer from severe state space explosion problem and are thus restricted in handling few threads and/or operations. This paper describes a scalable, fully automatic and general linearizability checking method based on [16] by incorporating symmetry and partial order reduction techniques. Our insights emerged from the observation that the similarity of threads using concurrent data structures causes model checking to generate large redundant equivalent portions of the state space, and the loose coupling of threads causes it to explore lots of unnecessary transition execution orders. We prove that symmetry reduction and partial order reduction can be combined in our approach and integrate them into the model checking algorithm. We demonstrate its efficiency using a number of real-world concurrent data structure algorithms.

25 citations


Book ChapterDOI
08 Sep 2011
TL;DR: This paper describes an implementation of a non-blocking concurrent hash trie based on single-word compare-and-swap instructions in a shared-memory system and shows that the implementation is linearizable and lock-free.
Abstract: This paper describes an implementation of a non-blocking concurrent hash trie based on single-word compare-and-swap instructions in a shared-memory system. Insert, lookup and remove operations modifying different parts of the hash trie can be run completely independently. Remove operations ensure that the unneeded memory is freed and that the trie is kept compact. A pseudocode for these operations is presented and a proof of correctness is given – we show that the implementation is linearizable and lock-free. Finally, benchmarks are presented that compare concurrent hash trie operations against the corresponding operations on other concurrent data structures.

25 citations


Book ChapterDOI
31 Aug 2011
TL;DR: This work presents a mechanized proof of the major correctness and progress aspects of a lock-free stack with hazard pointers, and an elegant solution to this problem is Michael's hazard pointers method.
Abstract: A significant problem of lock-free concurrent data structures in an environment without garbage collection is to ensure safe memory reclamation of objects that are removed from the data structure. An elegant solution to this problem is Michael's hazard pointers method. The formal verification of concurrent algorithms with hazard pointers is yet challenging. This work presents a mechanized proof of the major correctness and progress aspects of a lock-free stack with hazard pointers.

21 citations


Proceedings ArticleDOI
16 May 2011
TL;DR: Software techniques to enhance MPI Collective communication primitives, MPI Bcast and MPI Allreduce in virtual node mode by using cache coherent memory subsystem as the communication method within the node are proposed.
Abstract: The Blue Gene/P (BG/P) supercomputer consists of thousands of compute nodes interconnected by multiple networks. Out of these, a 3D torus equipped with direct memory access (DMA) engine is the primary network. BG/P also features a collective network which supports hardware accelerated collective operations such as broadcast and all reduce. One of the operating modes on BG/P is the virtual node mode where the four cores can be active MPI tasks, performing inter-node and intra-node communication. This paper proposes software techniques to enhance MPI Collective communication primitives, MPI Bcast and MPI Allreduce in virtual node mode by using cache coherent memory subsystem as the communication method within the node. The paper describes techniques leveraging atomic operations to design concurrent data structures such as broadcast-FIFOs to enable efficient collectives. Such mechanisms are important as we expect the core counts to rise in the future and having such data structures makes programming easier and efficient. We also demonstrate the utility of shared address space techniques for MPI collectives, wherein a process can access the peer's memory by specialized system calls. Apart from cutting down the copy costs, such techniques allow for seamless integration of network protocols with intra-node communication methods. We propose intra-node extensions to multi-color network algorithms for collectives using light weight synchronizing structures and atomic operations. Further, we demonstrate that shared address techniques allow for good load balancing and are critical for efficiently using the hardware collective network on BG/P. When compared to current approaches on the 3D torus, our optimizations provide performance up to almost 3 folds for MPI Bcast and a 33% performance gain for MPI Allreduce(in virtual node mode). We also see improvements up to 44% for MPI Bcast using the collective tree network.

21 citations


Proceedings ArticleDOI
06 Jun 2011
TL;DR: It is shown that k-FIFO queues whose implementations are based on state-of-the-art FIFo queues, which typically do not scale under high contention, provide scalability and probabilistic versions of k- FIFO queue improve scalability further but only bound semantical deviation with high probability.
Abstract: Maintaining data structure semantics of concurrent queues such as first-in first-out (FIFO) ordering requires expensive synchronization mechanisms which limit scalability. However, deviating from the original semantics of a given data structure may allow for a higher degree of scalability and yet be tolerated by many concurrent applications. We introduce the notion of a k-FIFO queue which may be out of FIFO order up to a constant k (called semantical deviation). Implementations of k-FIFO queues may be distributed and therefore be accessed unsynchronized while still being starvation-free. We show that k-FIFO queues whose implementations are based on state-of-the-art FIFO queues, which typically do not scale under high contention, provide scalability. Moreover, probabilistic versions of k-FIFO queues improve scalability further but only bound semantical deviation with high probability.

Journal ArticleDOI
TL;DR: This work proposes a highly concurrent software implementation of krmw, with only constant space overhead, and ensures that two operations delay each other only if they are within distance O(k) in the conflict graph, induced by the operations' data items.


Proceedings ArticleDOI
06 Jul 2011
TL;DR: An algorithm for parallel state space construction based on an original concurrent data structure, called a localization table, that aims at better spatial and temporal balance without sacrificing memory affinity and without incurring performance costs associated to the use of locks to ensure data consistency is proposed.
Abstract: We propose an algorithm for parallel state space construction based on an original concurrent data structure, called a localization table, that aims at better spatial and temporal balance. Our proposal is close in spirit to algorithms based on distributed hash tables, with the distinction that states are dynamically assigned to processors, i.e. we do not rely on an a-priori static partition of the state space. In our solution, every process keeps a share of the global state space. Data distribution and coordination between processes is made through the localization table, that is a lockless, thread-safe data structure that approximates the set of states being processed. The localization table is used to dynamically assign newly discovered states and can be queried to return the identity of the processor that own a given state. With this approach, we are able to consolidate a network of local hash tables into an (abstract) distributed one without sacrificing memory affinity â" data that are âlogically connectedâ and physically close to each others â and without incurring performance costs associated to the use of locks to ensure data consistency. We evaluate the performance of our algorithm on different benchmarks and compare these results with other solutions proposed in the literature and with existing verification tools.

01 Jan 2011
TL;DR: Progress is described toward a program logic for local reasoning about racy concurrent programs executing on a weak, x86-like memory model.
Abstract: Program logics are formal systems for specifying and reasoning about software programs. Most program logics make the strong assumption that all threads agree on the value of shared memory at all times. This assumption can be unsound though for programs with races, like many concurrent data structures. Verification of these difficult programs must take into account the weaker models of memory provided by the architectures on which they execute. In this paper, we describe progress toward a program logic for local reasoning about racy concurrent programs executing on a weak, x86-like memory model.

Dissertation
11 Nov 2011
TL;DR: Transactional Data Structures are introduced which are data structuresthat permit access to past versions, although not all accesses succeed and form the basis of a concurrent programming solution that supports database type transactions in memory.
Abstract: Concurrent programming is difficult and the effort is rarely rewarded by fasterexecution. The concurrency problem arises because information cannot passinstantly between processors resulting in temporal uncertainty.This thesis explores the idea that immutable data and distributed concurrencycontrol can be combined to allow scalable concurrent execution and makeconcurrent programming easier. A concurrent system that does not impose a globalordering on events lends itself to a scalable distributed implementation. Aconcurrent programming environment in which the ordering of events affecting anobject is enforced locally has intuitive concurrent semantics.This thesis introduces Transactional Data Structures which are data structuresthat permit access to past versions, although not all accesses succeed. Thesedata structures form the basis of a concurrent programming solution thatsupports database type transactions in memory. Transactional Data Structurespermit non-blocking concurrent access to familiar abstract data types such asdeques, maps, vectors and priority queues. Using these data structures aprogrammer can write a concurrent program in C without having to reason aboutlocks.The solution is evaluated by comparing the performance of a concurrent algorithmto calculate the minimum spanning tree of a graph with that of a similaralgorithm which uses Transactional Memory and by comparing a non-blockingProducer Consumer Queue with its blocking counterpart.Kimberley JarvisTransactional Data StructuresDoctor of PhilosophyThe University of Manchester11 November 2011

01 Jan 2011
TL;DR: In this paper, the authors propose a method for representing the relationship between nouns and adjectives in the form of Abbreviations and Notations (ABBE) and nouns.
Abstract: 1 Abbreviations and Notations 3

01 Jan 2011
TL;DR: This work demonstrates a generalized construction technique for concurrent data structures based on relativistic programming, taking into account the natural orderings provided by reader traversals and writer program order, and reconstructs the algorithms for existing relatvistic data structures.
Abstract: We present relativistic programming, a concurrent programming model based on shared addressing, which supports efficient, scalable operation on either uniform shared-memory or distributed sharedmemory systems. Relativistic programming provides a strong causal ordering property, allowing a series of read operations to appear as an atomic transaction that occurs entirely between two ordered write operations. This preserves the simple immutable-memory programming model available via mutual exclusion or transactional memory. Furthermore, relativistic programming provides joint-access parallelism, allowing readers to run concurrently with a writer on the same data. We demonstrate a generalized construction technique for concurrent data structures based on relativistic programming, taking into account the natural orderings provided by reader traversals and writer program order. Our construction technique specifies the precise placement of memory barriers and synchronization operations. To demonstrate our generalized approach, we reconstruct the algorithms for existing relativistic data structures, replacing the algorithm-specific reasoning with our systematic and rigorous construction. Benchmarks of the resulting relativistic read algorithms demonstrate high performance and linear scalability: relativistic resizable hash-tables demonstrate 56% better lookup performance than the current state of the art in Linux, and relativistic red-black trees show 2.5x better lookup performance than transactional memory.

Book ChapterDOI
13 Dec 2011
TL;DR: An automated technique for optimal instrumentation of multi-threaded programs for debugging and testing of concurrent data structures is proposed and a notion of observability that enables debuggers to trace back and locate errors through data-flow instrumentation is defined.
Abstract: In this paper, we propose an automated technique for optimal instrumentation of multi-threaded programs for debugging and testing of concurrent data structures. We define a notion of observability that enables debuggers to trace back and locate errors through data-flow instrumentation. Observability in a concurrent program enables a debugger to extract the value of a set of desired variables through instrumenting another (possibly smaller) set of variables. We formulate an optimization problem that aims at minimizing the size of the latter set. In order to cope with the exponential complexity of the problem, we present a SAT-based solution. Our approach is fully implemented and experimental results on popular concurrent data structures (e.g., linked lists and red-black trees) show significant performance improvement in optimally-instrumented programs using our method as compared to ad-hoc over-instrumented programs.

01 Jan 2011
TL;DR: This thesis presents the first efficient design of Quicksort for graphics processors and shows that it performs well in comparison with other available sorting methods, and present the first application of software transactional memory to graphics processors.
Abstract: The convergence of highly parallel many-core graphics processors with conventional multi-core processors is becoming a reality. To allow algorithms and data structures to scale efficiently on these new platforms, several important factors needs to be considered. (i) The algorithmic design needs to utilize the inherent parallelism of the problem at hand. Sorting, which is one of the classic computing components in computer science, has a high degree of inherent parallelism. In this thesis we present the first efficient design of Quicksort for graphics processors and show that it performs well in comparison with other available sorting methods. (ii) The work needs to be distributed efficiently across the available processing units. We present an evaluation of a set of dynamic load balancing schemes for graphics processors, comparing blocking methods with non-blocking. (iii) The required synchronization needs to be efficient, composable and easy to use. We present a methodology to easily compose the two most common operations provided by a data structure -- the insertion and deletion of elements. By exploiting a common construction found in most non-blocking data structures, we created a move operation that can atomically move elements between different types of non-blocking data structures, without requiring a specific design for each coupling. We also present, to the best of our knowledge, the first application of software transactional memory to graphics processors. Two different STM designs, one blocking and one obstruction-free, were evaluated on the task of implementing different types of common concurrent data structures on a graphics processor.

01 Jan 2011
TL;DR: The thesis of this dissertation is that cache-conscious, linearizable concurrent data structures for many-core systems will show significant performance improvements over the state of the art in concurrent data structure designs for those applications that must contend with the deleterious effects of the memory wall.
Abstract: The power wall, the instruction-level parallelism wall, and the memory wall are driving a shift in microprocessor design from implicitly parallel architectures to- wards explicitly parallel architectures. A necessary condition for peak scalability and performance on modern hardware is application execution that is aware of the memory hierarchy. The thesis of this dissertation is that cache-conscious con- current data structures for many-core systems will show significant performance improvements over the state of the art in concurrent data structure designs for those applications that must contend with the deleterious effects of the memory wall. Lock-free cache-conscious data structures that maintain the abstraction of a linearizable set have been studied previously in the context of unordered data structures. We explore novel alternatives, namely lock-free cache-conscious data structures that maintain the abstraction of a linearizable ordered set. The two primary design contributions of this dissertation are the lock-free skip tree and lock-free burst trie algorithms. In both algorithms, read-only operations are wait-free and modification operations are lock-free. The lock-free skip tree has relaxed structural properties that allow atomic operations to modify the free without invalidating the consistency of the data structure. We define the dense skip tree as a variation of the skip tree data structure, and prove cache-conscious properties of the dense skip tree. The proof techniques represent a significant departure from the methods outlined in the original skip tree paper. We show that cache-conscious, linearizable concurrent data structures have advantageous performance that can be measured across multiple architecture platforms. The improved performance arises from better treatment of the memory wall phenomenon that is ubiquitous to current multi-core systems and almost certainly will continue to affect future many-core systems.