Eraser: a dynamic data race detector for multithreaded programs
Summary (4 min read)
1 Introduction
- Multi-threading has become a common programming technique.
- For this reason, many programmers have resisted using threads.
- Called Eraser, that dynamically detects data races in multi-threaded programs.the authors.
- A locking discipline is a programming policy that ensures the absence of data races.
- Usually a potential data race is a serious error caused by failure to synchronize properly.
2.1 Improving the locking discipline
- The simple locking discipline the authors have used so far is too strict.
- There are three very common programming practices that violate the discipline yet are free from any data races: Initialization.
- Shared variables are frequently initialized without holding a lock.
- These can be safely accessed without locks.
- Read-write locks allow multiple readers to access a shared variable, but allow only a single writer to do so.
2.2 Initialization and read-sharing
- Programmers often take advantage of this observation when initializing newly allocated data.
- Unfortunately, the authors have no easy way of knowing when initialization is complete.
- When and if another thread accesses the variable, then the state changes.
- A write access from a new thread changes the state from Exclusive or Shared to the Shared-Modied state, in which is updated and races are reported, just as described in the original, simple version of the algorithm.
- The authors support for initialization makes Eraser’s checking more dependent on the scheduler than the authors would like.
2.3 Read-write locks
- Many programs use single-writer, multiple-reader locks as well as simple locks.
- The authors continue to use the state transitions of Figure 4, but when the variable enters the Shared-Modied state, the checking is slightly different:.
- That is, locks held purely in read mode are removed from the candidate set when a write occurs, as such locks held by a writer do not protect against a data race between the writer and some other reader thread.
3 Implementing Eraser
- Eraser is implemented for the DIGITAL Unix operating system on the Alpha processor, using the ATOM [Srivastava & Eustace 94] binary modification system.
- To maintain , Eraser instruments each load and store in the program.
- Eraser does not instrument loads and stores whose address mode is indirect off the stack pointer, since these are assumed to be stack references, and shared variables are assumed to be in global locations or in the heap.
- The report also includes the thread ID, memory address, type of memory access, and important register values such as the program counter and stack pointer.
- The authors have found that this information is usually sufficient for locating the source of the race.
3.1 Representing the candidate lock sets
- A naı̈ve implementation of lock sets would store a list of candidate locks for each memory location, potentially consuming many times the allocated memory of the program.
- The authors can avoid this expense by exploiting the fortunate fact that the number of distinct sets of locks observed in practice is quite small.
- The entries in the table are never deallocated or modified, so each lockset index remains valid for the lifetime of the program.
- Eraser also caches the result of each intersection, so that the fast case for set intersection is simply a table lookup.
- All the standard memory allocation routines are instrumented to allocate and initialize a shadow word for each word allocated by the program.
3.2 Performance
- Performance was not a major goal in their implementation of Eraser; consequently it has many opportunities for optimization.
- The authors estimate that half of the slowdown is due to the overhead incurred by making a procedure call at every load and store instruction; which could be eliminated by using a version of ATOM that can inline monitoring code [Scales et al. 96].
- Also, there are many opportunities for using static analysis to reduce the overhead of the monitoring code; but the authors have not explored them.
- In spite of their limited performance tuning, the authors have found that Eraser is fast enough to debug most programs, and therefore meets the most essential performance criteria.
3.3 Program annotations
- As expected, their experience with Eraser showed that it can produce false alarms.
- Part of their research was aimed at finding effective annotations to suppress false alarms without accidentally losing useful warnings.
- Many programs implement free lists or private allocators, and Eraser has no way of knowing that a privately recycled piece of memory is protected by a new set of locks.
- True data races were found that did not affect the correctness of the program.
- Some of these were intentional and others were accidental.
3.4 Race detection in an OS kernel
- The authors have begun to modify Eraser to detect races in the SPIN operating system [Bershad et al. 95].
- While the authors do not yet have results in terms of data races found, they have acquired some useful experience about implementing such a tool at the kernel level, which is different from the user level in several ways.
- In most systems, raising the interrupt level to n ensures that only interrupts of priority greater than nwill be serviced until the interrupt level is lowered.
- When the kernel sets the interrupt level to n, Eraser treats this operation as if the first n interrupt locks had all been acquired.
- The most common example is the use of semaphores to synchronize execution between a thread and an I/O device driver.
4 Experience
- The authors calibrated Eraser on a number of simple programs that contained common synchronization errors (e.g. forgot to lock, used the wrong lock, etc.) and versions of those programs with the errors corrected.
- While programming these tests, the authors accidentally introduced a race, and encouragingly, Eraser detected it.
- It also produced false alarms, which the authors were able to suppress with annotations.
- The fact that Eraser worked well on the servers is evidence that experienced programmers tend to obey the simple locking discipline even in an environment that offers many more elaborate synchronization primitives.
- In the remainder of this section the authors report on the details of their experiences with each program.
4.2 Vesta cache server
- Vesta [Digital Equipment 96b] is an advanced software configuration management system.
- Configurations are written in a specialized functional language that describes the dependencies and rules used to derive the current state of the software.
- This is correct because other threads access the log entries with the log head lock held, and threads do not maintain pointers into the log.
- The authors eliminated the report of these races by moving the EraserReuse annotations to the three Flush routines.
- The cache server uses a main server thread to wait for incoming RPC requests.
4.3 Petal
- Petal is a distributed storage system that presents its clients with a huge virtual disk implemented by a cluster of servers and physical disks [Lee & Thekkath 96].
- Petal implements a distributed consensus algorithm as well as failure detection and recovery mechanisms.
- The authors found two races where global variables containing statistics were modified without locking.
- Finally, the authors found one false alarm that they were unable to annotate away.
- GmapCh Write2 implements a join-like construct to keep the stack frame active until the threads return.
4.4 Undergraduate coursework
- As a counterpoint to their experience with mature multithreaded server programs, two of their colleagues used Eraser to examine the kinds of synchronization errors found in the homework assignments produced by their undergraduate operating systems class [Choi & Lewis 97].
- The authors report their results here to demonstrate how Eraser functions with a less sophisticated code base.
- These assignments can be roughly categorized as low-level (build locks from testand-set), thread-level (build a small threads package), synchronization-level (build semaphores and mutexes), and application-level (producer/consumer-style problems).
- Each assignment builds on the implementation of the previous assignment.
- These were caused by forgetting to take locks, taking locks during writes but not for reads, using different locks to protect the same data structure at different times, and forgetting to re-acquire locks that were released in a loop.
4.5 Effectiveness and Sensitivity
- Since Eraser uses a testing methodology it cannot prove that a program is free from data races.
- But the authors believe that Eraser works well compared to manual testing and debugging, and that Eraser’s testing is not very sensitive to the scheduler interleaving.
- The authors consulted the program history of Ni2 and reintroduced two data races that had existed in previous versions.
- The first error was an unlocked access to a reference count used to garbage collect file data structures.
- These races had existed in the Ni2 source code for several months before they were manually found and fixed by the program author.
5 Additional experience
- Each of which concerns a form of dynamic checking for synchronization errors in multi-threaded programs that the authors experimented with and believe is important and promising, but which they did not implement in Eraser.
- Using an earlier version of Eraser that detected race conditions in multi-threaded Modula-3 programs, the authors found that the Lockset algorithm reported false alarms for Trestle programs[Manasse & Nelson 91] that protected shared locations with multiple locks, because each of two readers could access the location while holding two different locks.
- This prevented the false alarms, but it is possible for this modification to cause false negatives.
- A few seconds into formsedit startup their experimental monitor detected a cycle of locks, showing that no partial order existed.
- But more work is required to catalog the sound and useful variations on the partial order discipline, and to develop annotations to suppress false alarms.
6 Conclusion
- Hardware designers have learned to design for testability.
- Programmers using threads must learn the same.
- Programmers in the area of operating systems seem to view dynamic race detection tools as esoteric and impractical.
- As the use of multi-threading expands, so will the unreliability caused by data races, unless better methods are used to eliminate them.
- The authors believe that the Lockset method implemented in Eraser is promising.
Did you find this useful? Give us your feedback
Citations
1,771 citations
Cites methods from "Eraser: a dynamic data race detecto..."
...Eraser [44] detects unprotected shared variables using a modified binary....
[...]
1,459 citations
Cites methods from "Eraser: a dynamic data race detecto..."
...The source of the error, a missing critical section, could, however, have been found automatically using the Eraser data detection algorithm....
[...]
...It immediately identified the race condition using the Eraser algorithm, and then launched the model checker on a thread window consisting of those threads involved in the race condition: the Planner and the Executive, locating the deadlock - all within 25 seconds....
[...]
...The algorithm described in [38] is relaxed to allow variables to be initialized without locks, and to be read by several threads without locks, if no-one writes....
[...]
...We have made experiments where the Eraser module in JPF generates a so-calledrace window consisting of the threads involved in a race condition....
[...]
...An example is the data race detection algorithm Eraser [38] developed at Compaq....
[...]
[...]
864 citations
822 citations
Cites background or methods from "Eraser: a dynamic data race detecto..."
...A runtime analysis (such as [144], [321], [392]), on the other hand, is less powerful than a static analysis but also produces fewer false...
[...]
...Concurrent programs suffer most from three kinds of access anomalies: data race [32], [321], atomicity violation [110],...
[...]
800 citations
Cites background from "Eraser: a dynamic data race detecto..."
...For example, data race bug detection [37, 42] checks the synchronization among accesses to one variable; some atomicity violation bug detection tools also focus on atomic regions related to one variable [23, 41]....
[...]
...(1) Concurrency bug detection Most previous concurrency bug detection research has focused on detecting data race bugs [7, 10, 31,33,37,42] and deadlock bugs [3,10,37]....
[...]
References
6,804 citations
1,054 citations
982 citations
725 citations
481 citations