Home
/
Authors
/
Mark Gabel

Author

Mark Gabel

Other affiliations: University of Texas at Dallas

Bio: Mark Gabel is an academic researcher from University of California, Davis. The author has contributed to research in topics: Formal specification & Source code. The author has an hindex of 9, co-authored 10 publications receiving 1943 citations. Previous affiliations of Mark Gabel include University of Texas at Dallas.

Topics: Formal specification, Source code, Software bug, Code smell, Naturalness ...read more

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

On the naturalness of software

[...]

Abram Hindle¹, Earl T. Barr¹, Zhendong Su¹, Mark Gabel², Premkumar Devanbu¹ - Show less +1 more•Institutions (2)

University of California, Davis¹, University of Texas at Dallas²

02 Jun 2012

TL;DR: The conjecture that most software is also natural, in the sense that it is created by humans at work, with all the attendant constraints and limitations, and thus, like natural language, it is also likely to be repetitive and predictable is conjecture.

...read moreread less

Abstract: Natural languages like English are rich, complex, and powerful. The highly creative and graceful use of languages like English and Tamil, by masters like Shakespeare and Avvaiyar, can certainly delight and inspire. But in practice, given cognitive constraints and the exigencies of daily life, most human utterances are far simpler and much more repetitive and predictable. In fact, these utterances can be very usefully modeled using modern statistical methods. This fact has led to the phenomenal success of statistical approaches to speech recognition, natural language translation, question-answering, and text mining and comprehension. We begin with the conjecture that most software is also natural, in the sense that it is created by humans at work, with all the attendant constraints and limitations — and thus, like natural language, it is also likely to be repetitive and predictable. We then proceed to ask whether a) code can be usefully modeled by statistical language models and b) such models can be leveraged to support software engineers. Using the widely adopted n-gram model, we provide empirical evidence supportive of a positive answer to both these questions. We show that code is also very repetitive, and in fact even more so than natural languages. As an example use of the model, we have developed a simple code completion engine for Java that, despite its simplicity, already improves Eclipse's built-in completion capability. We conclude the paper by laying out a vision for future research in this area.

...read moreread less

642 citations

Journal Article•DOI•

On the naturalness of software

[...]

Abram Hindle¹, Earl T. Barr², Mark Gabel³, Zhendong Su³, Premkumar Devanbu³ - Show less +1 more•Institutions (3)

University of Alberta¹, University College London², University of California, Davis³

26 Apr 2016-Communications of The ACM

TL;DR: The conjecture that most software is also natural - in the sense that it is created by humans at work, with all the attendant constraints and limitations - and thus, like natural language, it is also likely to be repetitive and predictable is investigated.

...read moreread less

Abstract: Natural languages like English are rich, complex, and powerful. The highly creative and graceful use of languages like English and Tamil, by masters like Shakespeare and Avvaiyar, can certainly delight and inspire. But in practice, given cognitive constraints and the exigencies of daily life, most human utterances are far simpler and much more repetitive and predictable. In fact, these utterances can be very usefully modeled using modern statistical methods. This fact has led to the phenomenal success of statistical approaches to speech recognition, natural language translation, question-answering, and text mining and comprehension.We begin with the conjecture that most software is also natural, in the sense that it is created by humans at work, with all the attendant constraints and limitations---and thus, like natural language, it is also likely to be repetitive and predictable. We then proceed to ask whether (a) code can be usefully modeled by statistical language models and (b) such models can be leveraged to support software engineers. Using the widely adopted n-gram model, we provide empirical evidence supportive of a positive answer to both these questions. We show that code is also very regular, and, in fact, even more so than natural languages. As an example use of the model, we have developed a simple code completion engine for Java that, despite its simplicity, already improves Eclipse's completion capability. We conclude the paper by laying out a vision for future research in this area.

...read moreread less

572 citations

Proceedings Article•DOI•

Scalable detection of semantic clones

[...]

Mark Gabel¹, Lingxiao Jiang¹, Zhendong Su¹•Institutions (1)

University of California, Davis¹

10 May 2008

TL;DR: This paper efficiently solve the tree similarity problem to create a scalable analysis that locates significantly more clones, which are often more semantically interesting than simple copied and pasted code fragments.

...read moreread less

Abstract: Several techniques have been developed for identifying similar code fragments in programs. These similar fragments, referred to as code clones, can be used to identify redundant code, locate bugs, or gain insight into program design. Existing scalable approaches to clone detection are limited to finding program fragments that are similar only in their contiguous syntax. Other, semantics-based approaches are more resilient to differences in syntax, such as reordered statements, related statements interleaved with other unrelated statements, or the use of semantically equivalent control structures. However, none of these techniques have scaled to real world code bases. These approaches capture semantic information from Program Dependence Graphs (PDGs), program representations that encode data and control dependencies between statements and predicates. Our definition of a code clone is also based on this representation: we consider program fragments with isomorphic PDGs to be clones. In this paper, we present the first scalable clone detection algorithm based on this definition of semantic clones. Our insight is the reduction of the difficult graph similarity problem to a simpler tree similarity problem by mapping carefully selected PDG subgraphs to their related structured syntax. We efficiently solve the tree similarity problem to create a scalable analysis. We have implemented this algorithm in a practical tool and performed evaluations on several million-line open source projects, including the Linux kernel. Compared with previous approaches, our tool locates significantly more clones, which are often more semantically interesting than simple copied and pasted code fragments.

...read moreread less

343 citations

Proceedings Article•DOI•

A study of the uniqueness of source code

[...]

Mark Gabel¹, Zhendong Su¹•Institutions (1)

University of California, Davis¹

07 Nov 2010

TL;DR: The first study of the uniqueness of source code is presented, examining a collection of 6,000 software projects and measuring the degree to which each project can be `assembled' solely from portions of this corpus, thus providing a precise measure of `uniqueness' that is called syntactic redundancy.

...read moreread less

Abstract: This paper presents the results of the first study of the uniqueness of source code. We define the uniqueness of a unit of source code with respect to the entire body of written software, which we approximate with a corpus of 420 million lines of source code. Our high-level methodology consists of examining a collection of 6,000 software projects and measuring the degree to which each project can be `assembled' solely from portions of this corpus, thus providing a precise measure of `uniqueness' that we call syntactic redundancy. We parameterized our study over a variety of variables, the most important of which being the level of granularity at which we view source code. Our suite of experiments together consumed approximately four months of CPU time, providing quantitative answers to the following questions: at what levels of granularity is software unique, and at a given level of granularity, how unique is software? While we believe these questions to be of intrinsic interest, we discuss possible applications to genetic programming and developer productivity tools.

...read moreread less

220 citations

Proceedings Article•DOI•

Javert: fully automatic mining of general temporal properties from dynamic traces

[...]

Mark Gabel¹, Zhendong Su¹•Institutions (1)

University of California, Davis¹

09 Nov 2008

TL;DR: Javert as mentioned in this paper is a general specification mining framework that can learn, fully automatically, complex temporal properties from execution traces, such as the strict alternation of acquisition and release of locks, by composing instances of small generic patterns.

...read moreread less

Abstract: Program specifications are important for many tasks during software design, development, and maintenance. Among these, temporal specifications are particularly useful. They express formal correctness requirements of an application's ordering of specific actions and events during execution, such as the strict alternation of acquisition and release of locks. Despite their importance, temporal specifications are often missing, incomplete, or described only informally. Many techniques have been proposed that mine such specifications from execution traces or program source code. However, existing techniques mine only simple patterns, or they mine a single complex pattern that is restricted to a particular set of manually selected events. There is no practical, automatic technique that can mine general temporal properties from execution traces.In this paper, we present Javert, the first general specification mining framework that can learn, fully automatically, complex temporal properties from execution traces. The key insight behind Javert is that real, complex specifications can be formed by composing instances of small generic patterns, such as the alternating pattern ((ab)) and the resource usage pattern ((ab c)). In particular, Javert learns simple generic patterns and composes them using sound rules to construct large, complex specifications. We have implemented the algorithm in a practical tool and conducted an extensive empirical evaluation on several open source software projects. Our results are promising; they show that Javert is scalable, general, and precise. It discovered many interesting, nontrivial specifications in real-world code that are beyond the reach of existing automatic techniques.

...read moreread less

189 citations

Cited by

PDF

Open Access

More filters

Journal Article•DOI•

Comparison and evaluation of code clone detection techniques and tools: A qualitative approach

[...]

Chanchal K. Roy¹, James R. Cordy¹, Rainer Koschke²•Institutions (2)

Queen's University¹, University of Bremen²

01 May 2009-Science of Computer Programming

TL;DR: A qualitative comparison and evaluation of the current state-of-the-art in clone detection techniques and tools is provided, and a taxonomy of editing scenarios that produce different clone types and a qualitative evaluation of current clone detectors are evaluated.

...read moreread less

989 citations

Journal Article•DOI•

code2vec: learning distributed representations of code

[...]

Uri Alon¹, Meital Zilberstein¹, Omer Levy², Eran Yahav¹•Institutions (2)

Technion – Israel Institute of Technology¹, Facebook²

02 Jan 2019

TL;DR: A neural model for representing snippets of code as continuous distributed vectors as a single fixed-length code vector which can be used to predict semantic properties of the snippet, making it the first to successfully predict method names based on a large, cross-project corpus.

...read moreread less

Abstract: We present a neural model for representing snippets of code as continuous distributed vectors (``code embeddings''). The main idea is to represent a code snippet as a single fixed-length code vector, which can be used to predict semantic properties of the snippet. To this end, code is first decomposed to a collection of paths in its abstract syntax tree. Then, the network learns the atomic representation of each path while simultaneously learning how to aggregate a set of them. We demonstrate the effectiveness of our approach by using it to predict a method's name from the vector representation of its body. We evaluate our approach by training a model on a dataset of 12M methods. We show that code vectors trained on this dataset can predict method names from files that were unobserved during training. Furthermore, we show that our model learns useful method name vectors that capture semantic similarities, combinations, and analogies. A comparison of our approach to previous techniques over the same dataset shows an improvement of more than 75%, making it the first to successfully predict method names based on a large, cross-project corpus. Our trained model, visualizations and vector similarities are available as an interactive online demo at http://code2vec.org. The code, data and trained models are available at https://github.com/tech-srl/code2vec.

...read moreread less

849 citations

Proceedings Article•DOI•

Automatically finding patches using genetic programming

[...]

Westley Weimer¹, ThanhVu Nguyen², Claire Le Goues¹, Stephanie Forrest²•Institutions (2)

University of Virginia¹, University of New Mexico²

16 May 2009

TL;DR: A fully automated method for locating and repairing bugs in software that works on off-the-shelf legacy applications and does not require formal specifications, program annotations or special coding practices is introduced.

...read moreread less

Abstract: Automatic program repair has been a longstanding goal in software engineering, yet debugging remains a largely manual process. We introduce a fully automated method for locating and repairing bugs in software. The approach works on off-the-shelf legacy applications and does not require formal specifications, program annotations or special coding practices. Once a program fault is discovered, an extended form of genetic programming is used to evolve program variants until one is found that both retains required functionality and also avoids the defect in question. Standard test cases are used to exercise the fault and to encode program requirements. After a successful repair has been discovered, it is minimized using structural differencing algorithms and delta debugging. We describe the proposed method and report experimental results demonstrating that it can successfully repair ten different C programs totaling 63,000 lines in under 200 seconds, on average.

...read moreread less

722 citations

Proceedings Article•DOI•

On the naturalness of software

[...]

Abram Hindle¹, Earl T. Barr¹, Zhendong Su¹, Mark Gabel², Premkumar Devanbu¹ - Show less +1 more•Institutions (2)

University of California, Davis¹, University of Texas at Dallas²

02 Jun 2012

...read moreread less

642 citations

Proceedings Article•DOI•

Code completion with statistical language models

[...]

Veselin Raychev¹, Martin Vechev¹, Eran Yahav²•Institutions (2)

ETH Zurich¹, Technion – Israel Institute of Technology²

09 Jun 2014

TL;DR: The main idea is to reduce the problem of code completion to a natural-language processing problem of predicting probabilities of sentences, and design a simple and scalable static analysis that extracts sequences of method calls from a large codebase, and index these into a statistical language model.

...read moreread less

Abstract: We address the problem of synthesizing code completions for programs using APIs. Given a program with holes, we synthesize completions for holes with the most likely sequences of method calls. Our main idea is to reduce the problem of code completion to a natural-language processing problem of predicting probabilities of sentences. We design a simple and scalable static analysis that extracts sequences of method calls from a large codebase, and index these into a statistical language model. We then employ the language model to find the highest ranked sentences, and use them to synthesize a code completion. Our approach is able to synthesize sequences of calls across multiple objects together with their arguments. Experiments show that our approach is fast and effective. Virtually all computed completions typecheck, and the desired completion appears in the top 3 results in 90% of the cases.

...read moreread less

611 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse