What contributions have the authors mentioned in the paper "Learning natural coding conventions" ?

The authors present NATURALIZE, a framework that learns the style of a codebase, and suggests revisions to improve stylistic consistency. The authors present four tools focused on ensuring natural code during development and release management, including code review. The authors used NATURALIZE to generate 18 patches for 5 open source projects: 14 were accepted. The authors apply NATURALIZE to suggest natural identifier names and formatting conventions. NATURALIZE achieves 94 % accuracy in its top suggestions for identifier names.

What is the purpose of code review?

Code review is practiced heavily at Microsoft in an effort to ensure that changes are free of defects and adhere to team standards.

What is the way to get the author to be confident?

After querying NATURALIZE about his stylistic choices, the author can then be confident that his change is consistent with the norms of the team and is more likely to be approved during review.

What percentage of changes contained changes to follow code conventions?

The authors found that 2% of changes contained formatting improvements, 1% contained renamings, and 4% contained any changes to follow code conventions (which include formatting and renaming).

How many renamings did the authors find useful?

Using NATURALIZE’s styleprofile, the authors identified high confidence renamings and submitted 18 of them as patches to the 5 evaluation projects that actively use GitHub.

How do the authors construct the input snippet?

the input snippet is constructed by finding a snippet that subsumes all of the locations in Lv. Specifically, the input snippet is constructed by taking the lowest common ancestor in AST of the nodes in Lv.

How many patches were accepted by the core members of these projects?

the authors submitted patches based on NATURALIZE suggestions (subsection 4.5) to 5 of the most popular open source projects on GitHub — of the 18 patches that the authors submitted, 12 were accepted by the core members of these projects.

Why do the authors find that fewer reviews are completed after a commit?

This is because, like defects, programmers notice and fix many violations themselves during development, prior to review, so reviewers must hunt for violations in a smaller set, and committed changes contain still fewer, although this number is nontrivial, as the authors show in subsection 4.4.

(Open Access) Learning Natural Coding Conventions (2014) | Miltiadis Allamanis

Q: What is the general framework used for scoring?

The generic framework described in subsection 3.1 can, in principle, employ a wide variety of machine learning or NLP methods forits scoring function.

Learning Natural Coding Conventions

Miltiadis Allamanis

†

Earl T. Barr

‡

Christian Bird

Charles Sutton

†

School of Informatics

‡

Dept. of Computer Science

Microsoft Research

University of Edinburgh University College London Microsoft

Edinburgh, EH8 9AB, UK London, UK Redmond, WA, USA

{m.allamanis, csutton}@ed.ac.uk e.barr@ucl.ac.uk cbird@microsoft.com

ABSTRACT

Every programmer has a characteristic style, ranging from pref-

erences about identiﬁer naming to preferences about object rela-

tionships and design patterns. Coding conventions deﬁne a consis-

tent syntactic style, fostering readability and hence maintainability.

When collaborating, programmers strive to obey a project’s coding

conventions. However, one third of reviews of changes contain

feedback about coding conventions, indicating that programmers

do not always follow them and that project members care deeply

about adherence. Unfortunately, programmers are often unaware of

coding conventions because inferring them requires a global view,

one that aggregates the many local decisions programmers make

and identiﬁes emergent consensus on style. We present NATURAL-

IZE, a framework that learns the style of a codebase, and suggests

revisions to improve stylistic consistency. NATURALIZE builds on

recent work in applying statistical natural language processing to

source code. We apply NATURALIZE to suggest natural identiﬁer

names and formatting conventions. We present four tools focused on

ensuring natural code during development and release management,

including code review. NATURALIZE achieves

% accuracy in

its top suggestions for identiﬁer names. We used NATURALIZE to

generate 18 patches for 5 open source projects: 14 were accepted.

Categories and Subject Descriptors:

D.2.3 [Software Engineering]: Coding Tools and Techniques

General Terms: Algorithms

Keywords: Coding conventions, naturalness of software

1. INTRODUCTION

To program is to make a series of choices, ranging from design

decisions — like how to decompose a problem into functions — to

the choice of identiﬁer names and how to format the code. While

local and syntactic, the latter are important: names connect program

source to its problem domain [13, 43, 44, 68]; formatting decisions

usually capture control ﬂow [36]. Together, naming and formatting

decisions determine the readability of a program’s source code,

increasing a codebase’s portability, its accessibility to newcomers,

its reliability, and its maintainability [55, §1.1]. Apple’s recent,

infamous bug in its handling of SSL certiﬁcates [7, 40] exempliﬁes

the impact that formatting can have on reliability. Maintainability is

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for proﬁt or commercial advantage and that copies

bear this notice and the full citation on the ﬁrst page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior speciﬁc

permission and/or a fee.

FSE ’14, November 16–22, 2014, Hong Kong, Hong Kong

especially important since developers spend the majority (

%) of

their time maintaining code [2, §6].

A convention is “an equilibrium that everyone expects in inter-

actions that have more than one equilibrium” [74]. For us, coding

conventions arise out of the collision of the stylistic choices of

programmers. A coding convention is a syntactic restriction not

imposed by a programming language’s grammar. Nonetheless, these

choices are important enough that they are enforced by software

teams. Indeed, our investigations indicate that developers enforce

such coding conventions rigorously, with roughly one third of code

reviews containing feedback about following them (subsection 4.1).

Like the rules of society at large, coding conventions fall into

two broad categories: laws, explicitly stated and enforced rules,

and mores, unspoken common practice that emerges spontaneously.

Mores pose a particular challenge: because they arise spontaneously

from emergent consensus, they are inherently difﬁcult to codify into

a ﬁxed set of rules, so rule-based formatters cannot enforce them,

and even programmers themselves have difﬁculty adhering to all

of the implicit mores of a codebase. Furthermore, popular code

changes constantly, and these changes necessarily embody stylistic

decisions, sometimes generating new conventions and sometimes

changing existing ones. To address this, we introduce the coding

convention inference problem, the problem of automatically learning

the coding conventions consistently used in a body of source code.

Conventions are pervasive in software, ranging from preferences

about identiﬁer names to preferences about class layout, object

relationships, and design patterns. In this paper, we focus as a ﬁrst

step on local, syntactic conventions, namely, identiﬁer naming and

formatting. These are particularly active topics of concern among

developers, for example, almost one quarter of the code reviews

that we examined contained suggestions about naming.

We introduce NATURALIZE, a framework that solves the cod-

ing convention inference problem for local conventions, offering

suggestions to increase the stylistic consistency of a codebase. NAT-

URALIZE can also be applied to infer rules for existing rule-based

formatters. NATURALIZE is descriptive, not prescriptive

: it learns

what programmers actually do. When a codebase does not reﬂect

consensus on a convention, NATURALIZE recommends nothing, be-

cause it has not learned anything with sufﬁcient conﬁdence to make

recommendations. The naturalness insight of Hindle

et al.

[35],

building on Gabel and Su [28], is that most short code utterances,

like natural language utterances, are simple and repetitive. Large

corpus statistical inference can discover and exploit this naturalness

to improve developer productivity and code robustness. We show

that coding conventions are natural in this sense.

Prescriptivism is the attempt to specify rules for correct style in

language,

e.g.

, Strunk and White [67]. Modern linguists studiously

avoid prescriptivist accounts, observing that many such rules are

routinely violated by noted writers.

Learning from local context allows NATURALIZE to learn syntac-

tic restrictions, or sub-grammars, on identiﬁer names like camelcase

or underscore, and to unify names used in similar contexts, which

rule-based code formatters simply cannot do. Intuitively, NATU-

RALIZE works by identifying identiﬁer names or formatting choices

that are surprising according to a probability distribution over code

text. When surprised, NATURALIZE determines if it is sufﬁciently

conﬁdent to suggest a renaming or reformatting that is less surpris-

ing; it uniﬁes the surprising choice with one that is preferred in

similar contexts elsewhere in its training set. NATURALIZE is not

automatic; it assists a developer, since its suggestions, both renam-

ing and even formatting, as in Python or Apple’s aforementioned

SSL bug [7, 40], are potentially semantically disruptive and must

be considered and approved. NATURALIZE’s suggestions enable a

range of new tools to improve developer productivity and code qual-

ity: 1) A pre-commit script that rejects commits that excessively

disrupt a codebase’s conventions; 2) A tool that converts the inferred

conventions into rules for use by a code formatter; 3) An Eclipse

plugin that a developer can use to check whether her changes are

unconventional; and 4) A style proﬁler that highlights the stylistic

inconsistencies of a code snippet for a code reviewer.

NATURALIZE draws upon a rich body of tools from statistical

natural language processing (NLP), but applies these techniques to

a different kind of problem. NLP focuses on understanding and

generating language, but does not ordinarily consider the problem

of improving existing text. The closest analog is spelling correction,

but that problem is easier because we have strong prior knowledge

about common types of spelling mistakes. An important conceptual

dimension of our suggestion problems also sets our work apart from

mainstream NLP. In code, rare names often usefully signify unusual

functionality, and need to be preserved. We call this the sympathetic

uniqueness principle (SUP): unusual names should be preserved

when they appear in unusual contexts. We achieve this by exploiting

a special token UNK that is often used to represent rare words that

do not appear in the training set. Our method incorporates SUP

through a clean, straightforward modiﬁcation to the handling of

UNK. Because of the Zipﬁan nature of language, UNK appears

in unusual contexts and identiﬁes unusual tokens that should be

preserved. section 4 demonstrates the effectiveness of this method at

preserving such names. Additionally, handling formatting requires

a simple, but novel, method of encoding formatting.

As NATURALIZE detects identiﬁers that violate code conventions

and assists in renaming, the most common refactoring [50], it is the

ﬁrst tool we are aware of that uses NLP techniques to aid refactoring.

The techniques that underlie NATURALIZE are language indepen-

dent and require only identifying identiﬁers, keywords, and opera-

tors, a much easier task than specifying grammatical structure. Thus,

NATURALIZE is well-positioned to be useful for domain-speciﬁc or

esoteric languages for which no convention enforcing tools exist or

the increasing number of multi-language software projects such as

web applications that intermix Java, css, html, and JavaScript.

To the best of the authors’ knowledge, this work is the ﬁrst to

address the coding convention inference problem, to suggest names

and formatting to increase the stylistic coherence of code, and to

provide tooling to that end. Our contributions are:

•

We built NATURALIZE, the ﬁrst framework to solve the coding

convention inference problem for local conventions, including

identiﬁer naming and formatting, and suggests changes to in-

crease a codebase’s adherence to its own conventions;

•

We offer four tools, built on NATURALIZE, all focused on release

management, an under-tooled phase of the development process.

•

NATURALIZE 1) achieves 94% accuracy in its top suggestions

for identiﬁer names and 2) never drops below a mean accuracy

of 96% when making formatting suggestions; and

•

We demonstrate that coding conventions are important to soft-

ware teams, by showing that 1) empirically, programmers en-

force conventions heavily through code review feedback and cor-

rective commits, and 2) patches that were based on NATURAL-

IZE suggestions have been incorporated into

of the most pop-

ular open source Java projects on GitHub — of the

patches

that we submitted, 14 were accepted.

Tools are available at groups.inf.ed.ac.uk/naturalize.

2. MOTIVATING EXAMPLE

Both industrial and open source developers often submit their

code for review prior to check-in [61]. Consider the example of

the class shown in Figure 1 which is part of a change submitted for

review by a Microsoft developer on February 17th, 2014. While

there is nothing functionally wrong with the class, it violates the

coding conventions of the team. A second developer reviewed the

change and suggested that

res

and

str

do not convey parameter

meaning well enough, the constructor line is much too long and

should be wrapped. In the checked in change, all of these were

addressed, with the parameter names changed to

queryResults

and queryStrings.

Consider a scenario in which the author had access to NATU-

RALIZE. The author might highlight the parameter names and ask

NATURALIZE to evaluate them. At that point it would have not only

have identiﬁed

res

and

str

as names that are inconsistent with the

naming conventions of parameters in the codebase, but would also

have suggested better names. The author may have also thought

to himself “Is the constructor on line 3 too long?” or “Should the

empty constructor body be on it’s own line and should it have a

space inside?” Here again, NATURALIZE would have provided im-

mediate, valuable answers based on the the conventions of the team.

NATURALIZE would indicate that the call to the base constructor

should be moved to the next line and indented to be consonant with

team conventions and that in this codebase empty method bodies do

not need their own lines. Furthermore it would indicate that some

empty methods contain one space between the braces while others

do not, so there is no implicit convention to follow. After querying

NATURALIZE about his stylistic choices, the author can then be

conﬁdent that his change is consistent with the norms of the team

and is more likely to be approved during review. Furthermore, by

leveraging NATURALIZE, fellow project members wouldn’t need to

be bothered by questions about conventions, nor would they need

to provide feedback about conventions during review. We have

observed that such scenarios occur in open source projects as well.

2.1 Use Cases and Tools

Coding conventions are critical during release management, which

comprises committing, reviewing, and promoting (including re-

leases) changes, either patches or branches. This is when a coder’s

idiosyncratic style, isolated in her editor during code composition,

comes into contact with the styles of others. The outcome of this

interaction strongly impacts the readability, and therefore the main-

tainability, of a codebase. Compared to other phases of the develop-

ment cycle like editing, debugging, project management, and issue

tracking, release management is under-tooled. Code conventions are

particularly pertinent here, and lead us to target three use cases: 1) a

developer preparing an individual commit or branch for review or

promotion; 2) a release engineer trying to ﬁlter out needless stylistic

diversity from the ﬂood of changes; and 3) a reviewer wishing to

consider how well a patch or branch obeys community norms.

Any code modiﬁcation has a possibility of introducing bugs [3,

51]. This is certainly true of a system, like NATURALIZE, that is

based on statistical inference, even when (as we always assume) all

of NATURALIZE’s suggestions are approved by a human. Because of

1 public class ExecutionQueryResponse : ExecutionQueryResponseBasic<QueryResults>

2 {

3 public ExecutionQueryResponse(QueryResults res, IReadOnlyCollection<string> str, ExecutionStepMetrics metrics) : base(res, str, metrics) { }

4 }

Figure 1: A C# class added by a Microsoft developer that was modiﬁed due to requests by a reviewer before it was checked in.

Figure 2: A screenshot of the devstyle Eclipse plugin. The

user has requested suggestion for alternate names of the each

argument.

this risk, the gain from making a change must be worth its cost. For

this reason, our use cases focus on times when the code is already

being changed. To support our use cases, we have built four tools:

devstyle

A plugin for Eclipse IDE that gives suggestions for

identiﬁer renaming and formatting both for a single identiﬁer

or format point and for the identiﬁers and formatting in a

selection of code.

styleprofile

A code review assistant that produces a proﬁle

that summarizes the adherence of a code snippet to the coding

conventions of a codebase and suggests renaming and format-

ting changes to make that snippet more stylistically consistent

with a project.

genrule

A rule generator for Eclipse’s code formatter that gen-

erates rules for those conventions that NATURALIZE has in-

ferred from a codebase.

stylish?

A high precision pre-commit script for

Git

that rejects

commits that have highly inconsistent and unnatural naming

or formatting within a project.

The

devstyle

plugin offers two types of suggestions, single

point suggestion under the mouse pointer and multiple point sug-

gestion via right-clicking a selection. A screenshot from

devstyle

is shown in Figure 2. For single point suggestions,

devstyle

dis-

plays a ranked list of alternatives to the selected name or format. If

devstyle

has no suggestions, it simply ﬂashes the current name or

selection. If the user wishes, she selects one of the suggestions. If it

is an identiﬁer renaming,

devstyle

renames all uses, within scope,

of that identiﬁer under its previous name. This scope traversal is

possible because our use cases assume an existing and compiled

codebase. Formatting changes occur at the suggestion point. Mul-

tiple point suggestion returns a style proﬁle, a ranked list of the

top

most stylistically surprising naming or formatting choices in

the current selection that could beneﬁt from reconsideration. By

default,

k = 5

based on HCI considerations [23, 48]. To accept a

suggestion here, the user must ﬁrst select a location to modify, then

select from among its top alternatives. The

styleprofile

tool

outputs a style proﬁle.

genrule

(subsection 3.5) generates settings

for the Eclipse code formatter. Finally,

stylish?

is a ﬁlter that uses

Eclipse code formatter with the settings from

genrule

to accept or

reject a commit based on its style proﬁle.

NATURALIZE uses an existing codebase, called a training corpus,

as a reference from which to learn conventions. Commonly, the train-

ing corpus will be the current codebase, so that NATURALIZE learns

domain-speciﬁc conventions related to the current project. Alterna-

tively, NATURALIZE comes with a pre-packaged suggestion model,

trained on a corpus of popular, vibrant projects that presumably

embody good coding conventions. Developers can use this engine

if they wish to increase their codebase’s adherence to a larger com-

munity’s consensus on best practice. Projects that are just starting

and have little or no code written can also use as the training corpus

a pre-existing codebase, for example another project in the same

organization, whose conventions the developers wish to adopt. Here,

again, we avoid normative comparison of coding conventions, and

do not force the user to specify their desired conventions explicitly.

Instead, the user speciﬁes a training corpus, and this is used as an im-

plicit source of desired conventions. The NATURALIZE framework

and tools are available at groups.inf.ed.ac.uk/naturalize.

3. THE NATURALIZE FRAMEWORK

In this section, we introduce the generic architecture of NATU-

RALIZE, which can be applied to a wide variety of different types of

conventions and is language independent. NATURALIZE is general

and can be applied to any language for which a lexer and a parser

exist, as token sequences and abstract syntax trees (ASTs) are used

during analysis. Figure 3 illustrates its architecture. The input is

a code snippet to be naturalized. This snippet is selected based on

the user input, in a way that depends on the particular tool in ques-

tion. For example, in

devstyle

, if a user selects a local variable

for renaming, the input snippet would contain all AST nodes that

reference that variable (subsection 3.3). The output of NATURALIZE

is a short list of suggestions, which can be ﬁltered, then presented

to the programmer. In general, a suggestion is a set of snippets that

may replace the input snippet. The list is ranked by a naturalness

score that is deﬁned below. Alternately, the system can return a

binary value indicating whether the code is natural, so as to support

applications such as

stylish?

. The system makes no suggestion if

it deems the input snippet to be sufﬁciently natural, or is unable to

ﬁnd good alternatives. This reduces the “Clippy effect” where users

ignore a system that makes too many bad suggestions

. In the next

section, we describe each element in the architecture in more detail.

Terminology

A language model (LM) is a probability distribution

over strings. Given any string

x = x

...x

, where each

a token, a LM assigns a probability

P(x).

Let

be the grammar

of a programming language. We use

to denote a snippet, that

is, a string

such that

αxβ ∈ L (G)

for some strings

α,β

. We

primarily consider snippets that are dominated by a single node in

the ﬁle’s AST. That is, there is a node within the AST whose subtree

comprises the entire snippet and nothing else. We use

to denote the

input snippet to the framework, and

y, z

to denote arbitrary snippets

3.1 The Core of NATURALIZE

The architecture contains two main elements: proposers and the

scoring function. The proposers modify the input code snippet to

produce a list of suggestion candidates that can replace the input

snippet. In the example from Figure 1, each candidate replaces all

occurrences of

res

with a different name used in similar contexts

elsewhere in the project, such as

results

queryResults

. In

principle, many implausible suggestions could ensue, so, in practice,

proposers contain ﬁltering logic.

A scoring function sorts these candidates according to a measure

of naturalness. Its input is a candidate snippet, and it returns a

real number measuring naturalness. Naturalness is measured with

respect to a training corpus that is provided to NATURALIZE— thus

allowing us to follow our guiding principle that naturalness must

be measured with respect to a particular codebase. For example,

In extreme cases, such systems can be so widely mocked that they

are publicly disabled by the company’s CEO in front of a cheering

audience: http://bit.ly/pmHCwI.

The application of NATURALIZE to academic papers in software

engineering is left to future work.

Training Corpus

(other code from project)

Code

for Review

Candidates

Top Suggestions

Scoring

Function

(ngram

language

model, SVM)

Proposers

(rename

identiﬁers,

add formatting)

public void testRunReturnsResult() {

PrintStream oldOut = System.out;

System.setOut(new PrintStream(

new OutputStream() {

@Override

public void write(int arg0) throws IOException {

}

));

try {

TestResult result = junit.textui.TestRunner.run(new TestSuite());

assertTrue(result.wasSuccessful());

} finally {

System.setOut(oldOut);

}

public void testRunReturnsResult() {

PrintStream oldOut = System.out;

System.setOut(new PrintStream(

new OutputStream() {

@Override

public void write(int arg0) throws IOException {

}

));

try {

TestResult result = junit.textui.TestRunner.run(new TestSuite());

assertTrue(result.wasSuccessful());

} finally {

System.setOut(oldOut);

}

public void testRunReturnsResult() {

PrintStream oldOut = System.out;

System.setOut(new PrintStream(

new OutputStream() {

@Override

public void write(int arg0) throws IOException {

}

));

try {

TestResult result = junit.textui.TestRunner.run(new TestSuite());

assertTrue(result.wasSuccessful());

} finally {

System.setOut(oldOut);

}

public void testRunReturnsResult() {

PrintStream oldOut = System.out;

System.setOut(new PrintStream(

new OutputStream() {

@Override

public void write(int arg0) throws IOException {

}

));

try {

TestResult result = junit.textui.TestRunner.run(new TestSuite());

assertTrue(result.wasSuccessful());

} finally {

System.setOut(oldOut);

}

public void testRunReturnsResult() {

PrintStream oldOut = System.out;

System.setOut(new PrintStream(

new OutputStream() {

@Override

public void write(int arg0) throws IOException {

}

));

try {

TestResult result = junit.textui.TestRunner.run(new TestSuite());

assertTrue(result.wasSuccessful());

} finally {

System.setOut(oldOut);

}

public void testRunReturnsResult() {

PrintStream oldOut = System.out;

System.setOut(new PrintStream(

new OutputStream() {

@Override

public void write(int arg0) throws IOException {

}

));

try {

TestResult result = junit.textui.TestRunner.run(new TestSuite());

assertTrue(result.wasSuccessful());

} finally {

System.setOut(oldOut);

}

public void testRunReturnsResult() {

PrintStream oldOut = System.out;

System.setOut(new PrintStream(

new OutputStream() {

@Override

public void write(int arg0) throws IOException {

}

));

try {

TestResult result = junit.textui.TestRunner.run(new TestSuite());

assertTrue(result.wasSuccessful());

} finally {

System.setOut(oldOut);

}

Figure 3: The architecture of NATURALIZE: a framework for learning coding conventions. A contiguous snippet of code is selected

for review through the user interface. A set of proposers returns a set of candidates, which are modiﬁed versions of the snippet, e.g.,

with one local variable renamed. The candidates are ranked by a scoring function, such as an n-gram language model, which returns

a small list of top suggestions to the interface, sorted by naturalness.

the training corpus might be the set of source ﬁles

from the

current application. A powerful way to measure the naturalness

of a snippet is provided by statistical language modeling. We use

(y)

to indicate the probability that the language model

, which

has been trained on the corpus

, assigns to the string

. The

key intuition is that an LM

is trained so that it assigns high

probability to strings in the training corpus,

i.e.

, snippets with higher

log probability are more like the training corpus, and presumably

more natural. There are several key reasons why statistical language

models are a powerful approach for modeling coding conventions.

First, probability distributions provide an easy way to represent soft

constraints about conventions. This allows us to avoid many of the

pitfalls of inﬂexible, rule-based approaches. Second, because they

are based on a learning approach, LMs can ﬂexibly adapt to the

conventions in a new project. Intuitively, because

assigns high

probability to strings

t ∈ A

that occur in the training corpus, it also

assigns high probability to strings that are similar to those in the

corpus. So the scoring function

tends to favor snippets that are

stylistically consistent with the training corpus.

We score the naturalness of a snippet y = y

1:N

s(y, P

) =

logP

(y); (1)

that is, we deem snippets that are more probable under the LM as

more natural in the application

. Equation 1 is cross-entropy multi-

plied by -

to make

a score, where

s(x) > s(y)

implies

is more

natural than

. Where it creates no confusion, we write

s(y)

, eliding

the second argument. When choosing between competing candidate

snippets

and

, we need to know not only which candidate the LM

prefers, but how “conﬁdent” it is. We measure this by a gap func-

tion

, which is the difference in scores

g(y, z, P) = s(y,P)−s(z, P).

Because

is essentially a log probability,

is the log ratio of proba-

bilities between

and

. For example, when

g(y, z) > 0

the snippet

is more natural —

i.e.

, less surprising according to the LM — and

thus is a better suggestion candidate than z. If g(y, z) = 0 then both

snippets are equally natural.

Now we deﬁne the function

suggest(x,C,k,t)

that returns the

top candidates according to the scoring function. This function

returns a list of top candidates, or the empty list if no candidates are

sufﬁciently natural. The function takes four parameters: the input

snippet

, the list

C = (c

,...c

)

of candidate snippets, and two

thresholds:

k ∈ N

, the maximum number of suggestions to return,

and

t ∈ R,

a minimum conﬁdence value. The parameter

controls

the size of the ranked list that is returned to the user, while

controls

the suggestion frequency, that is, how conﬁdent NATURALIZE needs

to be before it presents any suggestions to the user. Appropriately

setting

allows NATURALIZE to avoid the Clippy effect by making

no suggestion rather than a low quality one. Below, we present an

automated method for selecting t.

The

suggest

function ﬁrst sorts

C = (c

,...c

)

, the candidate

list, according to

, so

s(c

) ≥ s(c

) ≥ . . . ≥ s(c

)

. Then, it trims

the list to avoid overburdening the user: it truncates

to include

only the top

elements, so that

length(C) = min{k,r}

. and removes

candidates

∈ C

that are not sufﬁciently more natural than the

original snippet; formally, it removes all

from

where

g(c

,x) < t

Finally, if the original input snippet

is the highest ranked in

i.e.

, if

= x

suggest

ignores the other suggestions, sets

C = /0

decline to make a suggestion, and returns C.

Binary Decision

If an accept/reject decision on the input

required,

e.g.

, as in

stylish?

, NATURALIZE must collectively

consider all of the locations in

at which it could make suggestions.

We propose a score function for this binary decision that measures

how good is the best possible improvement that NATURALIZE is

able to make. Formally, let

be the set of locations in

at which

NATURALIZE is able to make suggestions, and for each

` ∈ L

, let

be the system’s set of suggestions at

. In general,

contains name

or formatting suggestions. Recall that

is the language model. We

deﬁne the score

G(x,P) = max

`∈L

max

c∈C

g(c,x). (2)

G(x,P) > T

, then NATURALIZE rejects the snippet as being

excessively unnatural. The threshold

controls the sensitivity of

NATURALIZE to unnatural names and formatting. As

increases,

fewer input snippets will be rejected, so some unnatural snippets

will slip through, but as compensation the test is less likely to reject

snippets that are in fact well-written.

Setting the Conﬁdence Threshold

The thresholds

in the

suggest

function and

in the binary decision function are on log probabil-

ities of strings, which can be difﬁcult for users to interpret. Fortu-

nately, these can be set automatically using the false positive rate

(FPR),

i.e.

the proportion of snippets

that in fact follow convention

but that the system erroneously rejects. We would like the FPR to

be as small as possible, but, unless we wish the system to make no

suggestions at all, we must accept some false positives. So we set

a maximum acceptable FPR

, and search for a threshold

that ensures that NATURALIZE’s FPR is at most

. The principle

is similar to statistical hypothesis testing. To make this work, we

estimate the FPR for a given

. To do so, we select a random

set of snippets from the training corpus,

e.g.

, random method bod-

ies, and compute the proportion of these snippets that are rejected

using

. Again leveraging our assumption that our training corpus

contains natural code, this proportion estimates the FPR. We use

a grid search [11] to ﬁnd the greatest value of

T < α

(

t < α

), the

user-speciﬁed acceptable FPR bound.

3.2 Choices of Scoring Function

The generic framework described in subsection 3.1 can, in princi-

ple, employ a wide variety of machine learning or NLP methods for

its scoring function. Indeed, a large portion of the statistical NLP

literature focuses on probability distributions over text, including

language models, probabilistic grammars, and topic models. Very

few of these models have been applied to code; exceptions include

[4, 35, 46, 49, 53]. We choose to build on statistical language mod-

els, because previous work of Hindle

et al.

. [35] has shown that they

are particularly able to capture the naturalness of code.

The intuition behind language modeling is that since there is

an inﬁnite number of possible strings, obviously we cannot store

a probability value for every one. Different LMs make different

simplifying assumptions to make the modeling tractable, and will

determine the types of coding conventions that we are able to infer.

One of the most effective practical LMs is the

-gram language

model.

-gram models make the assumption that the next token

can be predicted using only the previous

n −1

tokens. Formally, the

probability of a token

, conditioned on all of the previous tokens

...y

m−1

, is a function only of the previous

n −1

tokens. Under

this assumption, we can write

P(y

...y

) =

∏

m=1

P(y

m−1

...y

m−n+1

). (3)

To use this equation we need to know the conditional probabilities

P(y

m−1

...y

m−n+1

)

for each possible

-gram. This is a table of

numbers, where

is the number of possible lexemes. These

are the parameters of the model that we learn from the training

corpus. The simplest way to estimate the model parameters is to set

P(y

m−1

...y

m−n+1

)

to the proportion of times that

follows

m−1

...y

m−n+1

. In practice, this simple estimator does not work

well, because it assigns zero probability to

-grams that do not occur

in the training corpus. Instead,

-gram models are trained using

smoothing methods [22]. In our work, we use Katz smoothing.

Implementation

When an

-gram model is used, we can compute

the gap function

g(y, z)

very efﬁciently. This is because when

used within

suggest

, ordinarily the strings

and

will be similar,

i.e.

the input snippet and a candidate revision. The key insight is that in

-gram model, the probability

P(y)

of a snippet

y = (y

...y

)

depends only on the multiset of n-grams that occur in y, that is,

NG(y) = {y

i+1

...y

i+n−1

|0 ≤i ≤N −(n −1)}. (4)

An equivalent way to write a n-gram model is

P(y) =

∏

...a

∈NG(y)

P(a

,...a

n−1

). (5)

Since the gap function is

g(y, z) = log[P(y)/P(z)],

any

-grams

that are members both of

NG(y)

and

NG(z)

cancel, so to compute

, we only need to consider those

-grams not in

NG(y) ∩NG(z)

Intuitively, this means that, to compute the gap function

g(y, z)

, we

need to examine the

-grams around the locations where the snippets

and

differ. This is a very useful optimization if

and

are long

snippets that differ in only a few locations.

When training an LM, we take measures to deal with rare lexemes,

since, by deﬁnition, we do not have much data about them. We use

a preprocessing step — a common strategy in language modeling

— that builds a vocabulary with all the identiﬁers that appear more

than once in the training corpus. Let

count(v,b)

return the number

of appearances of token

in the codebase

. Then, if a token has

count(v,b) ≤ 1

we convert it to a special token, which we denote

UNK. Then we train the

-gram model as usual. The effect is that

the UNK token becomes a catchall that means the model expects to

see a rare token, even though it cannot be sure which one.

3.3 Suggesting Natural Names

In this section, we instantiate the core NATURALIZE framework

for the task of suggesting natural identiﬁer names. We start by

describing the single suggestion setting. For concreteness, imagine

a user of the

devstyle

plugin, who selects an identiﬁer and asks

devstyle

for its top suggestions. It should be easy to see how this

discussion can be generalized to the other use cases described in

subsection 2.1. Let

be the lexeme selected by the programmer.

This lexeme could denote a variable, a method call, or a type.

When a programmer binds a name to an identiﬁer and then uses

it, she implicitly links together all the locations in which that name

appears. Let

denote this set of locations, that is, the set of locations

in the current scope in which the lexeme

is used. For example, if

denotes a local variable, then

would be the set of locations in

which that local is used. Now, the input snippet is constructed by

ﬁnding a snippet that subsumes all of the locations in

. Speciﬁ-

cally, the input snippet is constructed by taking the lowest common

ancestor in AST of the nodes in L

The proposers for this task retrieve a set of alternative names to

which we denote

, by retrieving other names that have occurred in

the same contexts in the training set. To do this, for every location

` ∈L

in the snippet

, we take a moving window of length

around

and copy all the

-grams

that contain that token. Call this set

the context set,

i.e.

, the set of

-grams

that contain the

token

. Now we ﬁnd all

-grams in the training set that are similar

to an

-gram in

but that have some other lexeme substituted

for

. Formally, we set

as the set of all lexemes

for which

αvβ ∈C

and αv

β occurs in the training set. This guarantees that

if we have seen a lexeme in at least one similar context, we place it

in the alternatives list. Additionally, we add to

the special UNK

token; the reason for this is explained in a moment. Once we have

constructed the set of alternative names, the candidates are a list

of snippets, one for each

∈ A

, in which all occurrences of

are replaced with v

The scoring function can use any model

, such as the

-gram

model (Equation 3).

-gram models work well because, intuitively,

they favors names that are common in the context of the input

snippet. As we demonstrate in section 4, this does not reduce to

simply suggesting the most common names, such as

and

. For

example, suppose that the system is asked to propose a name for

res

in line 3 of Figure 1. The

-gram model is highly unlikely to

suggest

, because even though the name

is common, the trigram

“QueryResults i ,” is rare.

An interesting subtlety involves names that actually should be

unique. Identiﬁer names have a long tail, meaning that most names

are individually uncommon. It would be undesirable to replace every

rare name with common ones, as this would violate the sympathetic

uniqueness principle. Fortunately, we can handle this issue in a

subtle way: recall from subsection 3.1 that, during training of the

gram LM, we convert rare names into the special UNK token. When

we do this, UNK exists as a token in the LM, just like any other name.

We simply allow NATURALIZE to return UNK as a suggestion, just

like any other name. Returning UNK as a suggestion means that

the model expects that it would be natural to use a rare name in the

current context. The reason that this preserves rare identiﬁers is that

the UNK token occurs in the training corpus speciﬁcally in unusual

contexts where more common names were not used. Thus, if the

input lexeme

occurs in an unusual context, this context is more

likely to match that of UNK than of any of the more common tokens.

Multiple Point Suggestion

It is easy to adapt the system above to

the multiple point suggestion task. Recall (subsection 2.1) that this

task is to consider the set of identiﬁers that occur in a region

code selected by the user, and highlight the lexemes that are least

natural in context. For single point suggestion, the problem is to rank

different alternatives,

e.g.

, different variable names, for the same

code location, whereas for multiple point suggestion, the problem is

to rank different code locations against each other according to how

Learning Natural Coding Conventions

Figures

Citations

Deep code comment generation

Deep learning code fragments for code clone detection

A Survey of Machine Learning for Big Code and Naturalness

Learning to Represent Programs with Graphs

Suggesting accurate method and class names

References

ImageNet Classification with Deep Convolutional Neural Networks

Gradient-based learning applied to document recognition

Dropout: a simple way to prevent neural networks from overfitting

Design Patterns: Elements of Reusable Object-Oriented Software

Bleu: a Method for Automatic Evaluation of Machine Translation

Related Papers (5)

Learning natural coding conventions

Recommending rename refactorings

Syntactic Identifier Conciseness and Consistency

A Neural Model for Method Name Generation from Functional Description

An empirical investigation of how and why developers rename identifiers

Frequently Asked Questions (10)

Q1. What contributions have the authors mentioned in the paper "Learning natural coding conventions" ?

Q2. What is the purpose of code review?

Q3. What is the way to get the author to be confident?

Q4. What percentage of changes contained changes to follow code conventions?

Q5. How many renamings did the authors find useful?

Q6. How do the authors construct the input snippet?

Q7. How many patches were accepted by the core members of these projects?

Q8. How many of the code reviews that the authors examined contained suggestions about naming?

Q9. What is the general framework used for scoring?

Q10. Why do the authors find that fewer reviews are completed after a commit?

Trending Questions (1)