There are more bugs in real-world programs than human programmers can realistically address. This paper evaluates two research questions: ``What fraction of bugs can be repaired automatically?'' and ``How much does it cost to repair a bug automatically?'' In previous work, we presented GenProg, which uses genetic programming to repair defects in off-the-shelf C programs. To answer these questions, we: (1) propose novel algorithmic improvements to GenProg that allow it to scale to large programs and find repairs 68% more often, (2) exploit GenProg's inherent parallelism using cloud computing resources to provide grounded, human-competitive cost measurements, and (3) generate a large, indicative benchmark set to use for systematic evaluations. We evaluate GenProg on 105 defects from 8 open-source programs totaling 5.1 million lines of code and involving 10,193 test cases. GenProg automatically repairs 55 of those 105 defects. To our knowledge, this evaluation is the largest available of its kind, and is often two orders of magnitude larger than previous work in terms of code or test suite size or defect count. Public cloud computing prices allow our 105 runs to be reproduced for $403; a successful repair completes in 96 minutes and costs $7.32, on average.

/pdf/a-systematic-study-of-automated-program-repair-fixing-55-out-4wah62ufu5.pdf

A systematic study of automated program repair: fixing 55 out of 105 bugs for $8 each

In this paper, we present the Habanero-Java (HJ) language developed at Rice University as an extension to the original Java-based definition of the X10 language. HJ includes a powerful set of task-parallel programming constructs that can be added as simple extensions to standard Java programs to take advantage of today's multi-core and heterogeneous architectures. The language puts a particular emphasis on the usability and safety of parallel constructs. For example, no HJ program using async, finish, isolated, and phaser constructs can create a logical deadlock cycle. In addition, the future and data-driven task variants of the async construct facilitate a functional approach to parallel programming. Finally, any HJ program written with async, finish, and phaser constructs that is data-race free is guaranteed to also be deterministic.HJ also features two key enhancements that address well known limitations in the use of Java in scientific computing --- the inclusion of complex numbers as a primitive data type, and the inclusion of array-views that support multidimensional views of one-dimensional arrays. The HJ compiler generates standard Java class-files that can run on any JVM for Java 5 or higher. The HJ runtime is responsible for orchestrating the creation, execution, and termination of HJ tasks, and features both work-sharing and work-stealing schedulers. HJ is used at Rice University as an introductory parallel programming language for second-year undergraduate students. A wide variety of benchmarks have been ported to HJ, including a full application that was originally written in Fortran 90. HJ has a rich development and runtime environment that includes integration with DrJava, the addition of a data race detection tool, and service as a target platform for the Intel Concurrent Collections coordination language

https://www.cs.rice.edu/~vs3/PDF/hj-pppj11.pdf

Habanero-Java: the new adventures of old X10

Each new generation of GPUs vastly increases the resources available to GPGPU programs. GPU programming models (like CUDA) were designed to scale to use these resources. However, we find that CUDA programs actually do not scale to utilize all available resources, with over 30% of resources going unused on average for programs of the Parboil2 suite that we used in our work. Current GPUs therefore allow concurrent execution of kernels to improve utilization. In this work, we study concurrent execution of GPU kernels using multiprogram workloads on current NVIDIA Fermi GPUs. On two-program workloads from the Parboil2 benchmark suite we find concurrent execution is often no better than serialized execution. We identify that the lack of control over resource allocation to kernels is a major serialization bottleneck. We propose transformations that convert CUDA kernels into elastic kernels which permit fine-grained control over their resource usage. We then propose several elastic-kernel aware concurrency policies that offer significantly better performance and concurrency compared to the current CUDA policy. We evaluate our proposals on real hardware using multiprogrammed workloads constructed from benchmarks in the Parboil 2 suite. On average, our proposals increase system throughput (STP) by 1.21x and improve the average normalized turnaround time (ANTT) by 3.73x for two-program workloads when compared to the current CUDA concurrency implementation.

/pdf/improving-gpgpu-concurrency-with-elastic-kernels-q66bb2uhco.pdf

Improving GPGPU concurrency with elastic kernels

During the evolution of a software system, a large number of bug reports are submitted. Locating the source code files that need to be fixed to resolve the bugs is a challenging problem. Thus, there is a need for a technique that can automatically figure out these buggy files. A number of bug localization solutions that take in a bug report and output a ranked list of files sorted based on their likelihood to be buggy have been proposed in the literature. However, the accuracy of these tools still need to be improved. In this paper, to address this need, we propose AmaLgam, a new method for locating relevant buggy files that puts together version history, similar reports, and structure. To do this, AmaLgam integrates a bug prediction technique used in Google which analyzes version history, with a bug localization technique named BugLocator which analyzes similar reports from bug report system, and the state-of-the-art bug localization technique BLUiR which considers structure. We perform a large-scale experiment on four open source projects, namely AspectJ, Eclipse, SWT and ZXing to localize more than 3,000 bugs. Compared with a history-aware bug localization solution of Sisman and Kak, our approach achieves a 46.1% improvement in terms of mean average precision (MAP). Compared with BugLocator, our approach achieves a 24.4% improvement in terms of MAP. Compared with BLUiR, our approach achieves a 16.4% improvement in terms of MAP.

/pdf/version-history-similar-report-and-structure-putting-them-25zx009g2z.pdf

Version history, similar report, and structure: putting them together for improved bug localization

Computer systems anticipated in the 2015 - 2020 timeframe are referred to as Extreme Scale because they will be built using massive multi-core processors with 100's of cores per chip. The largest capability Extreme Scale system is expected to deliver Exascale performance of the order of 10 18 operations per second. These systems pose new critical challenges for software in the areas of concurrency, energy eciency and resiliency. In this paper, we discuss the implications of the concurrency and energy eciency challenges on future software for Extreme Scale Systems. From an application viewpoint, the concurrency and energy challenges boil down to the ability to express and manage parallelism and locality by exploring a range of strong scaling and new-era weak scaling techniques. For expressing parallelism and locality, the key challenges are the ability to expose all of the intrinsic parallelism and locality in a programming model, while ensuring that this expression of parallelism and locality is portable across a range of systems. For managing parallelism and locality, the OS-related challenges include parallel scalability, spatial partitioning of OS and application functionality, direct hardware access for inter-processor communication, and asynchronous rather than interrupt-driven events, which are accompanied by runtime system challenges for scheduling, synchronization, memory management, communication, performance monitoring, and power management. We conclude by discussing the importance of software-hardware co- design in addressing the fundamental challenges for application enablement on Extreme Scale systems.

https://www.cs.rice.edu/~vs3/PDF/Sarkar-Harrod-Snavely-SciDAC-2009.pdf

Software Challenges in Extreme Scale Systems

Modern languages for shared-memory parallelism are moving from a bulk-synchronous Single Program Multiple Data (SPMD) execution model to lightweight Task Parallel execution models for improved productivity. This shift is intended to encourage programmers to express the ideal parallelism in an application at a fine granularity that is natural for the underlying domain, while delegating to the compiler and runtime system the job of extracting coarser-grained useful parallelism for a given target system. A simple and important example of this separation of concerns between ideal and useful parallelism can be found in chunking of parallel loops, where the programmer expresses ideal parallelism by declaring all iterations of a loop to be parallel and the implementation exploits useful parallelism by executing iterations of the loop in sequential chunks.Though chunking of parallel loops has been used as a standard transformation for several years, it poses some interesting challenges when the parallel loop may directly or indirectly (via procedure calls) perform synchronization operations such as barrier, signal or wait statements. In such cases, a straightforward transformation that attempts to execute a chunk of loops in sequence in a single thread may violate the semantics of the original parallel program. In this paper, we address the problem of chunking parallel loops that may contain synchronization operations. We present a transformation framework that uses a combination of transformations from past work (e.g., loop strip-mining, interchange, distribution, unswitching) to obtain an equivalent set of parallel loops that chunk together statements from multiple iterations while preserving the semantics of the original parallel program. These transformations result in reduced synchronization and scheduling overheads, thereby improving performance and scalability. Our experimental results for 11 benchmark programs on an UltraSPARC II multicore processor showed a geometric mean speedup of 0.52x for the unchunked case and 9.59x for automatic chunking using the techniques described in this paper. This wide gap underscores the importance of using these techniques in future compiler and runtime systems for programming models with lightweight parallelism.

Chunking parallel loops in the presence of synchronization

Task parallelism has increasingly become a trend with programming models such as OpenMP 3.0, Cilk, Java Concurrency, X10, Chapel and Habanero-Java (HJ) to address the requirements of multicore programmers. While task parallelism increases productivity by allowing the programmer to express multiple levels of parallelism, it can also lead to performance degradation due to increased overheads. In this article, we introduce a transformation framework for optimizing task-parallel programs with a focus on task creation and task termination operations. These operations can appear explicitly in constructs such as async, finish in X10 and HJ, task, taskwait in OpenMP 3.0, and spawn, sync in Cilk, or implicitly in composite code statements such as foreach and ateach loops in X10, forall and foreach loops in HJ, and parallel loop in OpenMP.Our framework includes a definition of data dependence in task-parallel programs, a happens-before analysis algorithm, and a range of program transformations for optimizing task parallelism. Broadly, our transformations cover three different but interrelated optimizations: (1) finish-elimination, (2) forall-coarsening, and (3) loop-chunking. Finish-elimination removes redundant task termination operations, forall-coarsening replaces expensive task creation and termination operations with more efficient synchronization operations, and loop-chunking extracts useful parallelism from ideal parallelism. All three optimizations are specified in an iterative transformation framework that applies a sequence of relevant transformations until a fixed point is reached. Further, we discuss the impact of exception semantics on the specified transformations, and extend them to handle task-parallel programs with precise exception semantics. Experimental results were obtained for a collection of task-parallel benchmarks on three multicore platforms: a dual-socket 128-thread (16-core) Niagara T2 system, a quad-socket 16-core Intel Xeon SMP, and a quad-socket 32-core Power7 SMP. We have observed that the proposed optimizations interact with each other in a synergistic way, and result in an overall geometric average performance improvement between 6.28× and 10.30×, measured across all three platforms for the benchmarks studied.

A Transformation Framework for Optimizing Task-Parallel Programs

In this paper we present an automated technique for localizing faults in data-centric programs Data-centric programs primarily interact with databases to get collections of content, process each entry in the collection(s), and output another collection or write it back to the database One or more entries in the output may be faulty In our approach, we gather the execution trace of a faulty program We use a novel, precise slicing algorithm to break the trace into multiple slices, such that each slice maps to an entry in the output collection We then compute the semantic difference between the slices that correspond to correct entries and those that correspond to incorrect ones The "diff" helps to identify potentially faulty statementsWe have implemented our approach for ABAP programs ABAP is the language used to write custom code in SAP systems It interacts heavily with databases using embedded SQL-like commands that work on collections of data On a suite of 13 faulty ABAP programs, our technique was able to identify the precise fault location in 12 cases

Fault localization for data-centric programs

There has been a proliferation of task-parallel programming systems to address the requirements of multicore programmers. Current production task-parallel systems include Cilk++, Intel Threading Building Blocks, Java Concurrency, .Net Task Parallel Library, OpenMP 3.0, and current research task-parallel languages include Cilk, Chapel, Fortress, X10, and Habanero-Java (HJ). It is desirable for the programmer to express all the parallelism intrinsic to their algorithm in their code for forward scalability and portability, but the overhead incurred by doing so can be prohibitively large in today's systems. In this paper, we address the problem of reducing the total amount of overhead incurred by a program due to excessive task creation and termination. We introduce a transformation framework to optimize task-parallel programs with finish, forall and next statements. Our approach includes elimination of redundant task creation and termination operations as well as strength reduction of termination operations (finish) to lighter-weight synchronizations (next). Experimental results were obtained on three platforms: a dual-socket 128-thread (16-core) Niagara T2 system, a quad-socket 16-way Intel Xeon SMP and a quad-socket 32-way Power7 SMP. The results showed maximum speedup of 66.7×, 11.25× and 23.1× respectively on each platform and 4.6×, 2.1× and 6.4×performance improvements respectively in geometric mean related to non-optimized parallel codes. The original benchmarks in this study were written with medium-grained parallelism; a larger relative improvement can be expected for programs written with finer-grained parallelism. However, even for the medium-grained parallel benchmarks studied in this paper, the significant improvement obtained by the transformation framework underscores the importance of the compiler optimizations introduced in this paper.

/pdf/reducing-task-creation-and-termination-overhead-in-18mewv451a.pdf

Reducing task creation and termination overhead in explicitly parallel programs

The X10 programming language is organized around the notion of places (an encapsulation of data and activities operating on the data), partitioned global address space (PGAS), and asynchronous computation and communication.This paper introduces an expressive subset of X10, Flat X10, designed to permit efficient execution across multiple single-threaded places with a simple runtime and without compromising on the productivity of X10. We present the design, implementation and evaluation of a compiler and runtime system for Flat X10. The Flat X10 compiler translates programs into C++ SPMD programs communicating using an active messaging infrastructure. It uses novel techniques to transform explicitly parallel programs into SPMD programs. The runtime system is based on IBM's LAPI (Low-level API) and is easily portable to other libraries such as GASNet and ARMCI.Our implementation realizes performance comparable to hand-written MPI programs for well-known HPC benchmarks such as Random Access, Stream, and FFT, on a Federation-based cluster of Power5 SMPs (with hundreds of processors) and the Blue Gene (with thousands of processors). Submissions based on the work presented in this paper were co-winners of the 2007 and 2008 HPC Challenge Type II Awards.

/pdf/efficient-portable-implementation-of-asynchronous-multi-h8jeybp85h.pdf

V. Krishna Nandivada

Papers

Chunking parallel loops in the presence of synchronization

A Transformation Framework for Optimizing Task-Parallel Programs

Fault localization for data-centric programs

Reducing task creation and termination overhead in explicitly parallel programs

Efficient, portable implementation of asynchronous multi-place programs