scispace - formally typeset
Open AccessProceedings ArticleDOI

Learning Natural Coding Conventions

Reads0
Chats0
TLDR
NATURALIZE as mentioned in this paper is a framework that learns the style of a codebase and suggests revisions to improve stylistic consistency, which can even transfer knowledge about coding conventions across projects.
Abstract
Every programmer has a characteristic style, ranging from preferences about identifier naming to preferences about object relationships and design patterns. Coding conventions define a consistent syntactic style, fostering readability and hence maintainability. When collaborating, programmers strive to obey a project's coding conventions. However, one third of reviews of changes contain feedback about coding conventions, indicating that programmers do not always follow them and that project members care deeply about adherence. Unfortunately, programmers are often unaware of coding conventions because inferring them requires a global view, one that aggregates the many local decisions programmers make and identifies emergent consensus on style. We present NATURALIZE, a framework that learns the style of a codebase, and suggests revisions to improve stylistic consistency. NATURALIZE builds on recent work in applying statistical natural language processing to source code. We apply NATURALIZE to suggest natural identifier names and formatting conventions. We present four tools focused on ensuring natural code during development and release management, including code review. NATURALIZE achieves 94% accuracy in its top suggestions for identifier names and can even transfer knowledge about conventions across projects, leveraging a corpus of 10,968 open source projects. We used NATURALIZE to generate 18 patches for 5 open source projects: 14 were accepted.

read more

Content maybe subject to copyright    Report

Learning Natural Coding Conventions
Miltiadis Allamanis
Earl T. Barr
Christian Bird
?
Charles Sutton
School of Informatics
Dept. of Computer Science
?
Microsoft Research
University of Edinburgh University College London Microsoft
Edinburgh, EH8 9AB, UK London, UK Redmond, WA, USA
{m.allamanis, csutton}@ed.ac.uk e.barr@ucl.ac.uk cbird@microsoft.com
ABSTRACT
Every programmer has a characteristic style, ranging from pref-
erences about identifier naming to preferences about object rela-
tionships and design patterns. Coding conventions define a consis-
tent syntactic style, fostering readability and hence maintainability.
When collaborating, programmers strive to obey a project’s coding
conventions. However, one third of reviews of changes contain
feedback about coding conventions, indicating that programmers
do not always follow them and that project members care deeply
about adherence. Unfortunately, programmers are often unaware of
coding conventions because inferring them requires a global view,
one that aggregates the many local decisions programmers make
and identifies emergent consensus on style. We present NATURAL-
IZE, a framework that learns the style of a codebase, and suggests
revisions to improve stylistic consistency. NATURALIZE builds on
recent work in applying statistical natural language processing to
source code. We apply NATURALIZE to suggest natural identifier
names and formatting conventions. We present four tools focused on
ensuring natural code during development and release management,
including code review. NATURALIZE achieves
94
% accuracy in
its top suggestions for identifier names. We used NATURALIZE to
generate 18 patches for 5 open source projects: 14 were accepted.
Categories and Subject Descriptors:
D.2.3 [Software Engineering]: Coding Tools and Techniques
General Terms: Algorithms
Keywords: Coding conventions, naturalness of software
1. INTRODUCTION
To program is to make a series of choices, ranging from design
decisions like how to decompose a problem into functions to
the choice of identifier names and how to format the code. While
local and syntactic, the latter are important: names connect program
source to its problem domain [13, 43, 44, 68]; formatting decisions
usually capture control flow [36]. Together, naming and formatting
decisions determine the readability of a program’s source code,
increasing a codebase’s portability, its accessibility to newcomers,
its reliability, and its maintainability [55, §1.1]. Apple’s recent,
infamous bug in its handling of SSL certificates [7, 40] exemplifies
the impact that formatting can have on reliability. Maintainability is
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
FSE ’14, November 16–22, 2014, Hong Kong, Hong Kong
Copyright 2014 ACM 978-1-4503-3056-5/14/11 ...$15.00.
especially important since developers spend the majority (
80
%) of
their time maintaining code [2, §6].
A convention is “an equilibrium that everyone expects in inter-
actions that have more than one equilibrium” [74]. For us, coding
conventions arise out of the collision of the stylistic choices of
programmers. A coding convention is a syntactic restriction not
imposed by a programming language’s grammar. Nonetheless, these
choices are important enough that they are enforced by software
teams. Indeed, our investigations indicate that developers enforce
such coding conventions rigorously, with roughly one third of code
reviews containing feedback about following them (subsection 4.1).
Like the rules of society at large, coding conventions fall into
two broad categories: laws, explicitly stated and enforced rules,
and mores, unspoken common practice that emerges spontaneously.
Mores pose a particular challenge: because they arise spontaneously
from emergent consensus, they are inherently difficult to codify into
a fixed set of rules, so rule-based formatters cannot enforce them,
and even programmers themselves have difficulty adhering to all
of the implicit mores of a codebase. Furthermore, popular code
changes constantly, and these changes necessarily embody stylistic
decisions, sometimes generating new conventions and sometimes
changing existing ones. To address this, we introduce the coding
convention inference problem, the problem of automatically learning
the coding conventions consistently used in a body of source code.
Conventions are pervasive in software, ranging from preferences
about identifier names to preferences about class layout, object
relationships, and design patterns. In this paper, we focus as a first
step on local, syntactic conventions, namely, identifier naming and
formatting. These are particularly active topics of concern among
developers, for example, almost one quarter of the code reviews
that we examined contained suggestions about naming.
We introduce NATURALIZE, a framework that solves the cod-
ing convention inference problem for local conventions, offering
suggestions to increase the stylistic consistency of a codebase. NAT-
URALIZE can also be applied to infer rules for existing rule-based
formatters. NATURALIZE is descriptive, not prescriptive
1
: it learns
what programmers actually do. When a codebase does not reflect
consensus on a convention, NATURALIZE recommends nothing, be-
cause it has not learned anything with sufficient confidence to make
recommendations. The naturalness insight of Hindle
et al.
[35],
building on Gabel and Su [28], is that most short code utterances,
like natural language utterances, are simple and repetitive. Large
corpus statistical inference can discover and exploit this naturalness
to improve developer productivity and code robustness. We show
that coding conventions are natural in this sense.
1
Prescriptivism is the attempt to specify rules for correct style in
language,
e.g.
, Strunk and White [67]. Modern linguists studiously
avoid prescriptivist accounts, observing that many such rules are
routinely violated by noted writers.

Learning from local context allows NATURALIZE to learn syntac-
tic restrictions, or sub-grammars, on identifier names like camelcase
or underscore, and to unify names used in similar contexts, which
rule-based code formatters simply cannot do. Intuitively, NATU-
RALIZE works by identifying identifier names or formatting choices
that are surprising according to a probability distribution over code
text. When surprised, NATURALIZE determines if it is sufficiently
confident to suggest a renaming or reformatting that is less surpris-
ing; it unifies the surprising choice with one that is preferred in
similar contexts elsewhere in its training set. NATURALIZE is not
automatic; it assists a developer, since its suggestions, both renam-
ing and even formatting, as in Python or Apple’s aforementioned
SSL bug [7, 40], are potentially semantically disruptive and must
be considered and approved. NATURALIZEs suggestions enable a
range of new tools to improve developer productivity and code qual-
ity: 1) A pre-commit script that rejects commits that excessively
disrupt a codebase’s conventions; 2) A tool that converts the inferred
conventions into rules for use by a code formatter; 3) An Eclipse
plugin that a developer can use to check whether her changes are
unconventional; and 4) A style profiler that highlights the stylistic
inconsistencies of a code snippet for a code reviewer.
NATURALIZE draws upon a rich body of tools from statistical
natural language processing (NLP), but applies these techniques to
a different kind of problem. NLP focuses on understanding and
generating language, but does not ordinarily consider the problem
of improving existing text. The closest analog is spelling correction,
but that problem is easier because we have strong prior knowledge
about common types of spelling mistakes. An important conceptual
dimension of our suggestion problems also sets our work apart from
mainstream NLP. In code, rare names often usefully signify unusual
functionality, and need to be preserved. We call this the sympathetic
uniqueness principle (SUP): unusual names should be preserved
when they appear in unusual contexts. We achieve this by exploiting
a special token UNK that is often used to represent rare words that
do not appear in the training set. Our method incorporates SUP
through a clean, straightforward modification to the handling of
UNK. Because of the Zipfian nature of language, UNK appears
in unusual contexts and identifies unusual tokens that should be
preserved. section 4 demonstrates the effectiveness of this method at
preserving such names. Additionally, handling formatting requires
a simple, but novel, method of encoding formatting.
As NATURALIZE detects identifiers that violate code conventions
and assists in renaming, the most common refactoring [50], it is the
first tool we are aware of that uses NLP techniques to aid refactoring.
The techniques that underlie NATURALIZE are language indepen-
dent and require only identifying identifiers, keywords, and opera-
tors, a much easier task than specifying grammatical structure. Thus,
NATURALIZE is well-positioned to be useful for domain-specific or
esoteric languages for which no convention enforcing tools exist or
the increasing number of multi-language software projects such as
web applications that intermix Java, css, html, and JavaScript.
To the best of the authors’ knowledge, this work is the first to
address the coding convention inference problem, to suggest names
and formatting to increase the stylistic coherence of code, and to
provide tooling to that end. Our contributions are:
We built NATURALIZE, the first framework to solve the coding
convention inference problem for local conventions, including
identifier naming and formatting, and suggests changes to in-
crease a codebase’s adherence to its own conventions;
We offer four tools, built on NATURALIZE, all focused on release
management, an under-tooled phase of the development process.
NATURALIZE 1) achieves 94% accuracy in its top suggestions
for identifier names and 2) never drops below a mean accuracy
of 96% when making formatting suggestions; and
We demonstrate that coding conventions are important to soft-
ware teams, by showing that 1) empirically, programmers en-
force conventions heavily through code review feedback and cor-
rective commits, and 2) patches that were based on NATURAL-
IZE suggestions have been incorporated into
5
of the most pop-
ular open source Java projects on GitHub of the
18
patches
that we submitted, 14 were accepted.
Tools are available at groups.inf.ed.ac.uk/naturalize.
2. MOTIVATING EXAMPLE
Both industrial and open source developers often submit their
code for review prior to check-in [61]. Consider the example of
the class shown in Figure 1 which is part of a change submitted for
review by a Microsoft developer on February 17th, 2014. While
there is nothing functionally wrong with the class, it violates the
coding conventions of the team. A second developer reviewed the
change and suggested that
res
and
str
do not convey parameter
meaning well enough, the constructor line is much too long and
should be wrapped. In the checked in change, all of these were
addressed, with the parameter names changed to
queryResults
and queryStrings.
Consider a scenario in which the author had access to NATU-
RALIZE. The author might highlight the parameter names and ask
NATURALIZE to evaluate them. At that point it would have not only
have identified
res
and
str
as names that are inconsistent with the
naming conventions of parameters in the codebase, but would also
have suggested better names. The author may have also thought
to himself “Is the constructor on line 3 too long?” or “Should the
empty constructor body be on it’s own line and should it have a
space inside?” Here again, NATURALIZE would have provided im-
mediate, valuable answers based on the the conventions of the team.
NATURALIZE would indicate that the call to the base constructor
should be moved to the next line and indented to be consonant with
team conventions and that in this codebase empty method bodies do
not need their own lines. Furthermore it would indicate that some
empty methods contain one space between the braces while others
do not, so there is no implicit convention to follow. After querying
NATURALIZE about his stylistic choices, the author can then be
confident that his change is consistent with the norms of the team
and is more likely to be approved during review. Furthermore, by
leveraging NATURALIZE, fellow project members wouldn’t need to
be bothered by questions about conventions, nor would they need
to provide feedback about conventions during review. We have
observed that such scenarios occur in open source projects as well.
2.1 Use Cases and Tools
Coding conventions are critical during release management, which
comprises committing, reviewing, and promoting (including re-
leases) changes, either patches or branches. This is when a coder’s
idiosyncratic style, isolated in her editor during code composition,
comes into contact with the styles of others. The outcome of this
interaction strongly impacts the readability, and therefore the main-
tainability, of a codebase. Compared to other phases of the develop-
ment cycle like editing, debugging, project management, and issue
tracking, release management is under-tooled. Code conventions are
particularly pertinent here, and lead us to target three use cases: 1) a
developer preparing an individual commit or branch for review or
promotion; 2) a release engineer trying to filter out needless stylistic
diversity from the flood of changes; and 3) a reviewer wishing to
consider how well a patch or branch obeys community norms.
Any code modification has a possibility of introducing bugs [3,
51]. This is certainly true of a system, like NATURALIZE, that is
based on statistical inference, even when (as we always assume) all
of NATURALIZEs suggestions are approved by a human. Because of

1 public class ExecutionQueryResponse : ExecutionQueryResponseBasic<QueryResults>
2 {
3 public ExecutionQueryResponse(QueryResults res, IReadOnlyCollection<string> str, ExecutionStepMetrics metrics) : base(res, str, metrics) { }
4 }
Figure 1: A C# class added by a Microsoft developer that was modified due to requests by a reviewer before it was checked in.
Figure 2: A screenshot of the devstyle Eclipse plugin. The
user has requested suggestion for alternate names of the each
argument.
this risk, the gain from making a change must be worth its cost. For
this reason, our use cases focus on times when the code is already
being changed. To support our use cases, we have built four tools:
devstyle
A plugin for Eclipse IDE that gives suggestions for
identifier renaming and formatting both for a single identifier
or format point and for the identifiers and formatting in a
selection of code.
styleprofile
A code review assistant that produces a profile
that summarizes the adherence of a code snippet to the coding
conventions of a codebase and suggests renaming and format-
ting changes to make that snippet more stylistically consistent
with a project.
genrule
A rule generator for Eclipse’s code formatter that gen-
erates rules for those conventions that NATURALIZE has in-
ferred from a codebase.
stylish?
A high precision pre-commit script for
Git
that rejects
commits that have highly inconsistent and unnatural naming
or formatting within a project.
The
devstyle
plugin offers two types of suggestions, single
point suggestion under the mouse pointer and multiple point sug-
gestion via right-clicking a selection. A screenshot from
devstyle
is shown in Figure 2. For single point suggestions,
devstyle
dis-
plays a ranked list of alternatives to the selected name or format. If
devstyle
has no suggestions, it simply flashes the current name or
selection. If the user wishes, she selects one of the suggestions. If it
is an identifier renaming,
devstyle
renames all uses, within scope,
of that identifier under its previous name. This scope traversal is
possible because our use cases assume an existing and compiled
codebase. Formatting changes occur at the suggestion point. Mul-
tiple point suggestion returns a style profile, a ranked list of the
top
k
most stylistically surprising naming or formatting choices in
the current selection that could benefit from reconsideration. By
default,
k = 5
based on HCI considerations [23, 48]. To accept a
suggestion here, the user must first select a location to modify, then
select from among its top alternatives. The
styleprofile
tool
outputs a style profile.
genrule
(subsection 3.5) generates settings
for the Eclipse code formatter. Finally,
stylish?
is a filter that uses
Eclipse code formatter with the settings from
genrule
to accept or
reject a commit based on its style profile.
NATURALIZE uses an existing codebase, called a training corpus,
as a reference from which to learn conventions. Commonly, the train-
ing corpus will be the current codebase, so that NATURALIZE learns
domain-specific conventions related to the current project. Alterna-
tively, NATURALIZE comes with a pre-packaged suggestion model,
trained on a corpus of popular, vibrant projects that presumably
embody good coding conventions. Developers can use this engine
if they wish to increase their codebase’s adherence to a larger com-
munity’s consensus on best practice. Projects that are just starting
and have little or no code written can also use as the training corpus
a pre-existing codebase, for example another project in the same
organization, whose conventions the developers wish to adopt. Here,
again, we avoid normative comparison of coding conventions, and
do not force the user to specify their desired conventions explicitly.
Instead, the user specifies a training corpus, and this is used as an im-
plicit source of desired conventions. The NATURALIZE framework
and tools are available at groups.inf.ed.ac.uk/naturalize.
3. THE NATURALIZE FRAMEWORK
In this section, we introduce the generic architecture of NATU-
RALIZE, which can be applied to a wide variety of different types of
conventions and is language independent. NATURALIZE is general
and can be applied to any language for which a lexer and a parser
exist, as token sequences and abstract syntax trees (ASTs) are used
during analysis. Figure 3 illustrates its architecture. The input is
a code snippet to be naturalized. This snippet is selected based on
the user input, in a way that depends on the particular tool in ques-
tion. For example, in
devstyle
, if a user selects a local variable
for renaming, the input snippet would contain all AST nodes that
reference that variable (subsection 3.3). The output of NATURALIZE
is a short list of suggestions, which can be filtered, then presented
to the programmer. In general, a suggestion is a set of snippets that
may replace the input snippet. The list is ranked by a naturalness
score that is defined below. Alternately, the system can return a
binary value indicating whether the code is natural, so as to support
applications such as
stylish?
. The system makes no suggestion if
it deems the input snippet to be sufficiently natural, or is unable to
find good alternatives. This reduces the “Clippy effect” where users
ignore a system that makes too many bad suggestions
2
. In the next
section, we describe each element in the architecture in more detail.
Terminology
A language model (LM) is a probability distribution
over strings. Given any string
x = x
0
,x
1
...x
M
, where each
x
i
is
a token, a LM assigns a probability
P(x).
Let
G
be the grammar
of a programming language. We use
x
to denote a snippet, that
is, a string
x
such that
αxβ L (G)
for some strings
α,β
. We
primarily consider snippets that are dominated by a single node in
the file’s AST. That is, there is a node within the AST whose subtree
comprises the entire snippet and nothing else. We use
x
to denote the
input snippet to the framework, and
y, z
to denote arbitrary snippets
3
.
3.1 The Core of NATURALIZE
The architecture contains two main elements: proposers and the
scoring function. The proposers modify the input code snippet to
produce a list of suggestion candidates that can replace the input
snippet. In the example from Figure 1, each candidate replaces all
occurrences of
res
with a different name used in similar contexts
elsewhere in the project, such as
results
or
queryResults
. In
principle, many implausible suggestions could ensue, so, in practice,
proposers contain filtering logic.
A scoring function sorts these candidates according to a measure
of naturalness. Its input is a candidate snippet, and it returns a
real number measuring naturalness. Naturalness is measured with
respect to a training corpus that is provided to NATURALIZE thus
allowing us to follow our guiding principle that naturalness must
be measured with respect to a particular codebase. For example,
2
In extreme cases, such systems can be so widely mocked that they
are publicly disabled by the company’s CEO in front of a cheering
audience: http://bit.ly/pmHCwI.
3
The application of NATURALIZE to academic papers in software
engineering is left to future work.

Training Corpus
(other code from project)
Code
for Review
Candidates
Top Suggestions
Scoring
Function
(ngram
language
model, SVM)
Proposers
(rename
identifiers,
add formatting)
public void testRunReturnsResult() {
PrintStream oldOut = System.out;
System.setOut(new PrintStream(
new OutputStream() {
@Override
public void write(int arg0) throws IOException {
}
}
));
try {
TestResult result = junit.textui.TestRunner.run(new TestSuite());
assertTrue(result.wasSuccessful());
} finally {
System.setOut(oldOut);
}
}
public void testRunReturnsResult() {
PrintStream oldOut = System.out;
System.setOut(new PrintStream(
new OutputStream() {
@Override
public void write(int arg0) throws IOException {
}
}
));
try {
TestResult result = junit.textui.TestRunner.run(new TestSuite());
assertTrue(result.wasSuccessful());
} finally {
System.setOut(oldOut);
}
}
public void testRunReturnsResult() {
PrintStream oldOut = System.out;
System.setOut(new PrintStream(
new OutputStream() {
@Override
public void write(int arg0) throws IOException {
}
}
));
try {
TestResult result = junit.textui.TestRunner.run(new TestSuite());
assertTrue(result.wasSuccessful());
} finally {
System.setOut(oldOut);
}
}
public void testRunReturnsResult() {
PrintStream oldOut = System.out;
System.setOut(new PrintStream(
new OutputStream() {
@Override
public void write(int arg0) throws IOException {
}
}
));
try {
TestResult result = junit.textui.TestRunner.run(new TestSuite());
assertTrue(result.wasSuccessful());
} finally {
System.setOut(oldOut);
}
}
public void testRunReturnsResult() {
PrintStream oldOut = System.out;
System.setOut(new PrintStream(
new OutputStream() {
@Override
public void write(int arg0) throws IOException {
}
}
));
try {
TestResult result = junit.textui.TestRunner.run(new TestSuite());
assertTrue(result.wasSuccessful());
} finally {
System.setOut(oldOut);
}
}
public void testRunReturnsResult() {
PrintStream oldOut = System.out;
System.setOut(new PrintStream(
new OutputStream() {
@Override
public void write(int arg0) throws IOException {
}
}
));
try {
TestResult result = junit.textui.TestRunner.run(new TestSuite());
assertTrue(result.wasSuccessful());
} finally {
System.setOut(oldOut);
}
}
public void testRunReturnsResult() {
PrintStream oldOut = System.out;
System.setOut(new PrintStream(
new OutputStream() {
@Override
public void write(int arg0) throws IOException {
}
}
));
try {
TestResult result = junit.textui.TestRunner.run(new TestSuite());
assertTrue(result.wasSuccessful());
} finally {
System.setOut(oldOut);
}
}
Figure 3: The architecture of NATURALIZE: a framework for learning coding conventions. A contiguous snippet of code is selected
for review through the user interface. A set of proposers returns a set of candidates, which are modified versions of the snippet, e.g.,
with one local variable renamed. The candidates are ranked by a scoring function, such as an n-gram language model, which returns
a small list of top suggestions to the interface, sorted by naturalness.
the training corpus might be the set of source files
A
from the
current application. A powerful way to measure the naturalness
of a snippet is provided by statistical language modeling. We use
P
A
(y)
to indicate the probability that the language model
P
, which
has been trained on the corpus
A
, assigns to the string
y
. The
key intuition is that an LM
P
A
is trained so that it assigns high
probability to strings in the training corpus,
i.e.
, snippets with higher
log probability are more like the training corpus, and presumably
more natural. There are several key reasons why statistical language
models are a powerful approach for modeling coding conventions.
First, probability distributions provide an easy way to represent soft
constraints about conventions. This allows us to avoid many of the
pitfalls of inflexible, rule-based approaches. Second, because they
are based on a learning approach, LMs can flexibly adapt to the
conventions in a new project. Intuitively, because
P
A
assigns high
probability to strings
t A
that occur in the training corpus, it also
assigns high probability to strings that are similar to those in the
corpus. So the scoring function
s
tends to favor snippets that are
stylistically consistent with the training corpus.
We score the naturalness of a snippet y = y
1:N
as
s(y, P
A
) =
1
N
logP
A
(y); (1)
that is, we deem snippets that are more probable under the LM as
more natural in the application
A
. Equation 1 is cross-entropy multi-
plied by -
1
to make
s
a score, where
s(x) > s(y)
implies
x
is more
natural than
y
. Where it creates no confusion, we write
s(y)
, eliding
the second argument. When choosing between competing candidate
snippets
y
and
z
, we need to know not only which candidate the LM
prefers, but how “confident” it is. We measure this by a gap func-
tion
g
, which is the difference in scores
g(y, z, P) = s(y,P)s(z, P).
Because
s
is essentially a log probability,
g
is the log ratio of proba-
bilities between
y
and
z
. For example, when
g(y, z) > 0
the snippet
y
is more natural
i.e.
, less surprising according to the LM and
thus is a better suggestion candidate than z. If g(y, z) = 0 then both
snippets are equally natural.
Now we define the function
suggest(x,C,k,t)
that returns the
top candidates according to the scoring function. This function
returns a list of top candidates, or the empty list if no candidates are
sufficiently natural. The function takes four parameters: the input
snippet
x
, the list
C = (c
1
,c
2
,...c
r
)
of candidate snippets, and two
thresholds:
k N
, the maximum number of suggestions to return,
and
t R,
a minimum confidence value. The parameter
k
controls
the size of the ranked list that is returned to the user, while
t
controls
the suggestion frequency, that is, how confident NATURALIZE needs
to be before it presents any suggestions to the user. Appropriately
setting
t
allows NATURALIZE to avoid the Clippy effect by making
no suggestion rather than a low quality one. Below, we present an
automated method for selecting t.
The
suggest
function first sorts
C = (c
1
,c
2
,...c
r
)
, the candidate
list, according to
s
, so
s(c
1
) s(c
2
) . . . s(c
r
)
. Then, it trims
the list to avoid overburdening the user: it truncates
C
to include
only the top
k
elements, so that
length(C) = min{k,r}
. and removes
candidates
c
i
C
that are not sufficiently more natural than the
original snippet; formally, it removes all
c
i
from
C
where
g(c
i
,x) < t
.
Finally, if the original input snippet
x
is the highest ranked in
C
,
i.e.
, if
c
1
= x
,
suggest
ignores the other suggestions, sets
C = /0
to
decline to make a suggestion, and returns C.
Binary Decision
If an accept/reject decision on the input
x
is
required,
e.g.
, as in
stylish?
, NATURALIZE must collectively
consider all of the locations in
x
at which it could make suggestions.
We propose a score function for this binary decision that measures
how good is the best possible improvement that NATURALIZE is
able to make. Formally, let
L
be the set of locations in
x
at which
NATURALIZE is able to make suggestions, and for each
` L
, let
C
`
be the system’s set of suggestions at
`
. In general,
C
`
contains name
or formatting suggestions. Recall that
P
is the language model. We
define the score
G(x,P) = max
`L
max
cC
`
g(c,x). (2)
If
G(x,P) > T
, then NATURALIZE rejects the snippet as being
excessively unnatural. The threshold
T
controls the sensitivity of
NATURALIZE to unnatural names and formatting. As
T
increases,
fewer input snippets will be rejected, so some unnatural snippets
will slip through, but as compensation the test is less likely to reject
snippets that are in fact well-written.
Setting the Confidence Threshold
The thresholds
t
in the
suggest
function and
T
in the binary decision function are on log probabil-
ities of strings, which can be difficult for users to interpret. Fortu-
nately, these can be set automatically using the false positive rate
(FPR),
i.e.
the proportion of snippets
x
that in fact follow convention
but that the system erroneously rejects. We would like the FPR to
be as small as possible, but, unless we wish the system to make no
suggestions at all, we must accept some false positives. So we set
a maximum acceptable FPR
α
, and search for a threshold
t
or
T
that ensures that NATURALIZE’s FPR is at most
α
. The principle
is similar to statistical hypothesis testing. To make this work, we
estimate the FPR for a given
t
or
T
. To do so, we select a random
set of snippets from the training corpus,
e.g.
, random method bod-
ies, and compute the proportion of these snippets that are rejected
using
T
. Again leveraging our assumption that our training corpus
contains natural code, this proportion estimates the FPR. We use
a grid search [11] to find the greatest value of
T < α
(
t < α
), the
user-specified acceptable FPR bound.
3.2 Choices of Scoring Function
The generic framework described in subsection 3.1 can, in princi-
ple, employ a wide variety of machine learning or NLP methods for

its scoring function. Indeed, a large portion of the statistical NLP
literature focuses on probability distributions over text, including
language models, probabilistic grammars, and topic models. Very
few of these models have been applied to code; exceptions include
[4, 35, 46, 49, 53]. We choose to build on statistical language mod-
els, because previous work of Hindle
et al.
. [35] has shown that they
are particularly able to capture the naturalness of code.
The intuition behind language modeling is that since there is
an infinite number of possible strings, obviously we cannot store
a probability value for every one. Different LMs make different
simplifying assumptions to make the modeling tractable, and will
determine the types of coding conventions that we are able to infer.
One of the most effective practical LMs is the
n
-gram language
model.
N
-gram models make the assumption that the next token
can be predicted using only the previous
n 1
tokens. Formally, the
probability of a token
y
m
, conditioned on all of the previous tokens
y
1
...y
m1
, is a function only of the previous
n 1
tokens. Under
this assumption, we can write
P(y
1
...y
M
) =
M
m=1
P(y
m
|y
m1
...y
mn+1
). (3)
To use this equation we need to know the conditional probabilities
P(y
m
|y
m1
...y
mn+1
)
for each possible
n
-gram. This is a table of
V
n
numbers, where
V
is the number of possible lexemes. These
are the parameters of the model that we learn from the training
corpus. The simplest way to estimate the model parameters is to set
P(y
m
|y
m1
...y
mn+1
)
to the proportion of times that
y
m
follows
y
m1
...y
mn+1
. In practice, this simple estimator does not work
well, because it assigns zero probability to
n
-grams that do not occur
in the training corpus. Instead,
n
-gram models are trained using
smoothing methods [22]. In our work, we use Katz smoothing.
Implementation
When an
n
-gram model is used, we can compute
the gap function
g(y, z)
very efficiently. This is because when
g
is
used within
suggest
, ordinarily the strings
y
and
z
will be similar,
i.e.
,
the input snippet and a candidate revision. The key insight is that in
an
n
-gram model, the probability
P(y)
of a snippet
y = (y
1
y
2
...y
N
)
depends only on the multiset of n-grams that occur in y, that is,
NG(y) = {y
i
y
i+1
...y
i+n1
|0 i N (n 1)}. (4)
An equivalent way to write a n-gram model is
P(y) =
a
1
a
2
...a
n
NG(y)
P(a
n
|a
1
,a
2
,...a
n1
). (5)
Since the gap function is
g(y, z) = log[P(y)/P(z)],
any
n
-grams
that are members both of
NG(y)
and
NG(z)
cancel, so to compute
g
, we only need to consider those
n
-grams not in
NG(y) NG(z)
.
Intuitively, this means that, to compute the gap function
g(y, z)
, we
need to examine the
n
-grams around the locations where the snippets
y
and
z
differ. This is a very useful optimization if
y
and
z
are long
snippets that differ in only a few locations.
When training an LM, we take measures to deal with rare lexemes,
since, by definition, we do not have much data about them. We use
a preprocessing step a common strategy in language modeling
that builds a vocabulary with all the identifiers that appear more
than once in the training corpus. Let
count(v,b)
return the number
of appearances of token
v
in the codebase
b
. Then, if a token has
count(v,b) 1
we convert it to a special token, which we denote
UNK. Then we train the
n
-gram model as usual. The effect is that
the UNK token becomes a catchall that means the model expects to
see a rare token, even though it cannot be sure which one.
3.3 Suggesting Natural Names
In this section, we instantiate the core NATURALIZE framework
for the task of suggesting natural identifier names. We start by
describing the single suggestion setting. For concreteness, imagine
a user of the
devstyle
plugin, who selects an identifier and asks
devstyle
for its top suggestions. It should be easy to see how this
discussion can be generalized to the other use cases described in
subsection 2.1. Let
v
be the lexeme selected by the programmer.
This lexeme could denote a variable, a method call, or a type.
When a programmer binds a name to an identifier and then uses
it, she implicitly links together all the locations in which that name
appears. Let
L
denote this set of locations, that is, the set of locations
in the current scope in which the lexeme
v
is used. For example, if
v
denotes a local variable, then
L
v
would be the set of locations in
which that local is used. Now, the input snippet is constructed by
finding a snippet that subsumes all of the locations in
L
v
. Specifi-
cally, the input snippet is constructed by taking the lowest common
ancestor in AST of the nodes in L
v
.
The proposers for this task retrieve a set of alternative names to
v
,
which we denote
A
v
, by retrieving other names that have occurred in
the same contexts in the training set. To do this, for every location
` L
v
in the snippet
x
, we take a moving window of length
n
around
`
and copy all the
n
-grams
w
i
that contain that token. Call this set
C
v
the context set,
i.e.
, the set of
n
-grams
w
i
of
x
that contain the
token
v
. Now we find all
n
-grams in the training set that are similar
to an
n
-gram in
C
v
but that have some other lexeme substituted
for
v
. Formally, we set
A
v
as the set of all lexemes
v
0
for which
αvβ C
v
and αv
0
β occurs in the training set. This guarantees that
if we have seen a lexeme in at least one similar context, we place it
in the alternatives list. Additionally, we add to
A
v
the special UNK
token; the reason for this is explained in a moment. Once we have
constructed the set of alternative names, the candidates are a list
S
v
of snippets, one for each
v
0
A
v
, in which all occurrences of
v
in
x
are replaced with v
0
.
The scoring function can use any model
P
A
, such as the
n
-gram
model (Equation 3).
N
-gram models work well because, intuitively,
they favors names that are common in the context of the input
snippet. As we demonstrate in section 4, this does not reduce to
simply suggesting the most common names, such as
i
and
j
. For
example, suppose that the system is asked to propose a name for
res
in line 3 of Figure 1. The
n
-gram model is highly unlikely to
suggest
i
, because even though the name
i
is common, the trigram
QueryResults i , is rare.
An interesting subtlety involves names that actually should be
unique. Identifier names have a long tail, meaning that most names
are individually uncommon. It would be undesirable to replace every
rare name with common ones, as this would violate the sympathetic
uniqueness principle. Fortunately, we can handle this issue in a
subtle way: recall from subsection 3.1 that, during training of the
n
-
gram LM, we convert rare names into the special UNK token. When
we do this, UNK exists as a token in the LM, just like any other name.
We simply allow NATURALIZE to return UNK as a suggestion, just
like any other name. Returning UNK as a suggestion means that
the model expects that it would be natural to use a rare name in the
current context. The reason that this preserves rare identifiers is that
the UNK token occurs in the training corpus specifically in unusual
contexts where more common names were not used. Thus, if the
input lexeme
v
occurs in an unusual context, this context is more
likely to match that of UNK than of any of the more common tokens.
Multiple Point Suggestion
It is easy to adapt the system above to
the multiple point suggestion task. Recall (subsection 2.1) that this
task is to consider the set of identifiers that occur in a region
x
of
code selected by the user, and highlight the lexemes that are least
natural in context. For single point suggestion, the problem is to rank
different alternatives,
e.g.
, different variable names, for the same
code location, whereas for multiple point suggestion, the problem is
to rank different code locations against each other according to how

Citations
More filters
Proceedings ArticleDOI

Deep code comment generation

TL;DR: DeepCom applies Natural Language Processing (NLP) techniques to learn from a large code corpus and generates comments from learned features for better comments generation of Java methods.
Proceedings ArticleDOI

Deep learning code fragments for code clone detection

TL;DR: This work introduces learning-based detection techniques where everything for representing terms and fragments in source code is mined from the repository, and compared its approach to a traditional structure-oriented technique and found that it detected clones that were either undetected or suboptimally reported by the prominent tool Deckard.
Posted Content

A Survey of Machine Learning for Big Code and Naturalness

TL;DR: This article presents a taxonomy based on the underlying design principles of each model and uses it to navigate the literature and discuss cross-cutting and application-specific challenges and opportunities.
Posted Content

Learning to Represent Programs with Graphs

TL;DR: In this article, a Gated Graph Neural Network (GNN) is used to predict the name of a variable given its usage, and to reason about selecting the correct variable that should be used at a given program location.
Proceedings ArticleDOI

Suggesting accurate method and class names

TL;DR: A neural probabilistic language model for source code that is specifically designed for the method naming problem is introduced, and a variant of the model is introduced that is, to the knowledge, the first that can propose neologisms, names that have not appeared in the training corpus.
References
More filters
Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Journal ArticleDOI

Gradient-based learning applied to document recognition

TL;DR: In this article, a graph transformer network (GTN) is proposed for handwritten character recognition, which can be used to synthesize a complex decision surface that can classify high-dimensional patterns, such as handwritten characters.
Journal Article

Dropout: a simple way to prevent neural networks from overfitting

TL;DR: It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.
Book

Design Patterns: Elements of Reusable Object-Oriented Software

TL;DR: The book is an introduction to the idea of design patterns in software engineering, and a catalog of twenty-three common patterns, which most experienced OOP designers will find out they've known about patterns all along.
Proceedings ArticleDOI

Bleu: a Method for Automatic Evaluation of Machine Translation

TL;DR: This paper proposed a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run.
Related Papers (5)
Frequently Asked Questions (10)
Q1. What contributions have the authors mentioned in the paper "Learning natural coding conventions" ?

The authors present NATURALIZE, a framework that learns the style of a codebase, and suggests revisions to improve stylistic consistency. The authors present four tools focused on ensuring natural code during development and release management, including code review. The authors used NATURALIZE to generate 18 patches for 5 open source projects: 14 were accepted. The authors apply NATURALIZE to suggest natural identifier names and formatting conventions. NATURALIZE achieves 94 % accuracy in its top suggestions for identifier names. 

Code review is practiced heavily at Microsoft in an effort to ensure that changes are free of defects and adhere to team standards. 

After querying NATURALIZE about his stylistic choices, the author can then be confident that his change is consistent with the norms of the team and is more likely to be approved during review. 

The authors found that 2% of changes contained formatting improvements, 1% contained renamings, and 4% contained any changes to follow code conventions (which include formatting and renaming). 

Using NATURALIZE’s styleprofile, the authors identified high confidence renamings and submitted 18 of them as patches to the 5 evaluation projects that actively use GitHub. 

the input snippet is constructed by finding a snippet that subsumes all of the locations in Lv. Specifically, the input snippet is constructed by taking the lowest common ancestor in AST of the nodes in Lv. 

the authors submitted patches based on NATURALIZE suggestions (subsection 4.5) to 5 of the most popular open source projects on GitHub — of the 18 patches that the authors submitted, 12 were accepted by the core members of these projects. 

These are particularly active topics of concern among developers, for example, almost one quarter of the code reviews that the authors examined contained suggestions about naming. 

The generic framework described in subsection 3.1 can, in principle, employ a wide variety of machine learning or NLP methods forits scoring function. 

This is because, like defects, programmers notice and fix many violations themselves during development, prior to review, so reviewers must hunt for violations in a smaller set, and committed changes contain still fewer, although this number is nontrivial, as the authors show in subsection 4.4. 

Trending Questions (1)
What is style conventions?

The paper does not provide a direct answer to the query. The word "style" is mentioned in the paper in the context of coding conventions and stylistic consistency. However, the paper does not explicitly define or explain what style conventions are.