NATURALIZE as mentioned in this paper is a framework that learns the style of a codebase and suggests revisions to improve stylistic consistency, which can even transfer knowledge about coding conventions across projects.
Abstract:
Every programmer has a characteristic style, ranging from preferences about identifier naming to preferences about object relationships and design patterns. Coding conventions define a consistent syntactic style, fostering readability and hence maintainability. When collaborating, programmers strive to obey a project's coding conventions. However, one third of reviews of changes contain feedback about coding conventions, indicating that programmers do not always follow them and that project members care deeply about adherence. Unfortunately, programmers are often unaware of coding conventions because inferring them requires a global view, one that aggregates the many local decisions programmers make and identifies emergent consensus on style. We present NATURALIZE, a framework that learns the style of a codebase, and suggests revisions to improve stylistic consistency. NATURALIZE builds on recent work in applying statistical natural language processing to source code. We apply NATURALIZE to suggest natural identifier names and formatting conventions. We present four tools focused on ensuring natural code during development and release management, including code review. NATURALIZE achieves 94% accuracy in its top suggestions for identifier names and can even transfer knowledge about conventions across projects, leveraging a corpus of 10,968 open source projects. We used NATURALIZE to generate 18 patches for 5 open source projects: 14 were accepted.
TL;DR: DeepCom applies Natural Language Processing (NLP) techniques to learn from a large code corpus and generates comments from learned features for better comments generation of Java methods.
TL;DR: This work introduces learning-based detection techniques where everything for representing terms and fragments in source code is mined from the repository, and compared its approach to a traditional structure-oriented technique and found that it detected clones that were either undetected or suboptimally reported by the prominent tool Deckard.
TL;DR: This article presents a taxonomy based on the underlying design principles of each model and uses it to navigate the literature and discuss cross-cutting and application-specific challenges and opportunities.
TL;DR: In this article, a Gated Graph Neural Network (GNN) is used to predict the name of a variable given its usage, and to reason about selecting the correct variable that should be used at a given program location.
TL;DR: A neural probabilistic language model for source code that is specifically designed for the method naming problem is introduced, and a variant of the model is introduced that is, to the knowledge, the first that can propose neologisms, names that have not appeared in the training corpus.
TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
TL;DR: In this article, a graph transformer network (GTN) is proposed for handwritten character recognition, which can be used to synthesize a complex decision surface that can classify high-dimensional patterns, such as handwritten characters.
TL;DR: It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.
TL;DR: The book is an introduction to the idea of design patterns in software engineering, and a catalog of twenty-three common patterns, which most experienced OOP designers will find out they've known about patterns all along.
TL;DR: This paper proposed a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run.
Q1. What contributions have the authors mentioned in the paper "Learning natural coding conventions" ?
The authors present NATURALIZE, a framework that learns the style of a codebase, and suggests revisions to improve stylistic consistency. The authors present four tools focused on ensuring natural code during development and release management, including code review. The authors used NATURALIZE to generate 18 patches for 5 open source projects: 14 were accepted. The authors apply NATURALIZE to suggest natural identifier names and formatting conventions. NATURALIZE achieves 94 % accuracy in its top suggestions for identifier names.
Q2. What is the purpose of code review?
Code review is practiced heavily at Microsoft in an effort to ensure that changes are free of defects and adhere to team standards.
Q3. What is the way to get the author to be confident?
After querying NATURALIZE about his stylistic choices, the author can then be confident that his change is consistent with the norms of the team and is more likely to be approved during review.
Q4. What percentage of changes contained changes to follow code conventions?
The authors found that 2% of changes contained formatting improvements, 1% contained renamings, and 4% contained any changes to follow code conventions (which include formatting and renaming).
Q5. How many renamings did the authors find useful?
Using NATURALIZE’s styleprofile, the authors identified high confidence renamings and submitted 18 of them as patches to the 5 evaluation projects that actively use GitHub.
Q6. How do the authors construct the input snippet?
the input snippet is constructed by finding a snippet that subsumes all of the locations in Lv. Specifically, the input snippet is constructed by taking the lowest common ancestor in AST of the nodes in Lv.
Q7. How many patches were accepted by the core members of these projects?
the authors submitted patches based on NATURALIZE suggestions (subsection 4.5) to 5 of the most popular open source projects on GitHub — of the 18 patches that the authors submitted, 12 were accepted by the core members of these projects.
Q8. How many of the code reviews that the authors examined contained suggestions about naming?
These are particularly active topics of concern among developers, for example, almost one quarter of the code reviews that the authors examined contained suggestions about naming.
Q9. What is the general framework used for scoring?
The generic framework described in subsection 3.1 can, in principle, employ a wide variety of machine learning or NLP methods forits scoring function.
Q10. Why do the authors find that fewer reviews are completed after a commit?
This is because, like defects, programmers notice and fix many violations themselves during development, prior to review, so reviewers must hunt for violations in a smaller set, and committed changes contain still fewer, although this number is nontrivial, as the authors show in subsection 4.4.
The paper does not provide a direct answer to the query. The word "style" is mentioned in the paper in the context of coding conventions and stylistic consistency. However, the paper does not explicitly define or explain what style conventions are.