scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

On the naturalness of software

TL;DR: The conjecture that most software is also natural, in the sense that it is created by humans at work, with all the attendant constraints and limitations, and thus, like natural language, it is also likely to be repetitive and predictable is conjecture.
Abstract: Natural languages like English are rich, complex, and powerful. The highly creative and graceful use of languages like English and Tamil, by masters like Shakespeare and Avvaiyar, can certainly delight and inspire. But in practice, given cognitive constraints and the exigencies of daily life, most human utterances are far simpler and much more repetitive and predictable. In fact, these utterances can be very usefully modeled using modern statistical methods. This fact has led to the phenomenal success of statistical approaches to speech recognition, natural language translation, question-answering, and text mining and comprehension. We begin with the conjecture that most software is also natural, in the sense that it is created by humans at work, with all the attendant constraints and limitations — and thus, like natural language, it is also likely to be repetitive and predictable. We then proceed to ask whether a) code can be usefully modeled by statistical language models and b) such models can be leveraged to support software engineers. Using the widely adopted n-gram model, we provide empirical evidence supportive of a positive answer to both these questions. We show that code is also very repetitive, and in fact even more so than natural languages. As an example use of the model, we have developed a simple code completion engine for Java that, despite its simplicity, already improves Eclipse's built-in completion capability. We conclude the paper by laying out a vision for future research in this area.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
02 Jan 2019
TL;DR: A neural model for representing snippets of code as continuous distributed vectors as a single fixed-length code vector which can be used to predict semantic properties of the snippet, making it the first to successfully predict method names based on a large, cross-project corpus.
Abstract: We present a neural model for representing snippets of code as continuous distributed vectors (``code embeddings''). The main idea is to represent a code snippet as a single fixed-length code vector, which can be used to predict semantic properties of the snippet. To this end, code is first decomposed to a collection of paths in its abstract syntax tree. Then, the network learns the atomic representation of each path while simultaneously learning how to aggregate a set of them. We demonstrate the effectiveness of our approach by using it to predict a method's name from the vector representation of its body. We evaluate our approach by training a model on a dataset of 12M methods. We show that code vectors trained on this dataset can predict method names from files that were unobserved during training. Furthermore, we show that our model learns useful method name vectors that capture semantic similarities, combinations, and analogies. A comparison of our approach to previous techniques over the same dataset shows an improvement of more than 75%, making it the first to successfully predict method names based on a large, cross-project corpus. Our trained model, visualizations and vector similarities are available as an interactive online demo at http://code2vec.org. The code, data and trained models are available at https://github.com/tech-srl/code2vec.

849 citations


Cites background from "On the naturalness of software"

  • ...…Context-Free Grammars [Allamanis and Sutton 2014; Bielik et al. 2016; Gvero and Kuncak 2015; Maddison and Tarlow 2014], n-grams [Allamanis et al. 2014; Allamanis and Sutton 2013; Hindle et al. 2012; Nguyen et al. 2013; Raychev et al. 2014] have been used for programming languages in the past....

    [...]

  • ...The problem of predicting program properties by learning from big code has seen tremendous interest and progress in recent years [Allamanis et al. 2014; Allamanis and Sutton 2013; Bielik et al. 2016; Hindle et al. 2012; Raychev et al. 2016]....

    [...]

  • ...Indeed, many existing approaches also represent source code as a token stream [Allamanis et al. 2014, 2016; Allamanis and Sutton 2013; Hindle et al. 2012; Movshovitz-Attias and Cohen 2013; White et al. 2015]....

    [...]

Proceedings ArticleDOI
09 Jun 2014
TL;DR: The main idea is to reduce the problem of code completion to a natural-language processing problem of predicting probabilities of sentences, and design a simple and scalable static analysis that extracts sequences of method calls from a large codebase, and index these into a statistical language model.
Abstract: We address the problem of synthesizing code completions for programs using APIs. Given a program with holes, we synthesize completions for holes with the most likely sequences of method calls. Our main idea is to reduce the problem of code completion to a natural-language processing problem of predicting probabilities of sentences. We design a simple and scalable static analysis that extracts sequences of method calls from a large codebase, and index these into a statistical language model. We then employ the language model to find the highest ranked sentences, and use them to synthesize a code completion. Our approach is able to synthesize sequences of calls across multiple objects together with their arguments. Experiments show that our approach is fast and effective. Virtually all computed completions typecheck, and the desired completion appears in the top 3 results in 90% of the cases.

611 citations

Proceedings ArticleDOI
14 May 2016
TL;DR: This paper proposes to leverage a powerful representation-learning algorithm, deep learning, to learn semantic representation of programs automatically from source code, using Deep Belief Network to automatically learn semantic features from token vectors extracted from programs' Abstract Syntax Trees.
Abstract: Software defect prediction, which predicts defective code regions, can help developers find bugs and prioritize their testing efforts. To build accurate prediction models, previous studies focus on manually designing features that encode the characteristics of programs and exploring different machine learning algorithms. Existing traditional features often fail to capture the semantic differences of programs, and such a capability is needed for building accurate prediction models. To bridge the gap between programs' semantics and defect prediction features, this paper proposes to leverage a powerful representation-learning algorithm, deep learning, to learn semantic representation of programs automatically from source code. Specifically, we leverage Deep Belief Network (DBN) to automatically learn semantic features from token vectors extracted from programs' Abstract Syntax Trees (ASTs). Our evaluation on ten open source projects shows that our automatically learned semantic features significantly improve both within-project defect prediction (WPDP) and cross-project defect prediction (CPDP) compared to traditional features. Our semantic features improve WPDP on average by 14.7% in precision, 11.5% in recall, and 14.2% in F1. For CPDP, our semantic features based approach outperforms the state-of-the-art technique TCA+ with traditional features by 8.9% in F1.

570 citations


Cites methods from "On the naturalness of software"

  • ...For CPDP, our semantic features based approach outperforms the state-of-the-art technique TCA+ with traditional features by 8.9% in F1....

    [...]

Proceedings Article
Lili Mou1, Ge Li1, Lu Zhang1, Tao Wang2, Zhi Jin1 
12 Feb 2016
TL;DR: In this article, a tree-based convolutional neural network (TBCNN) is proposed for programming language processing, in which a convolution kernel is designed over programs' abstract syntax trees to capture structural information.
Abstract: Programming language processing (similar to natural language processing) is a hot research topic in the field of software engineering; it has also aroused growing interest in the artificial intelligence community. However, different from a natural language sentence, a program contains rich, explicit, and complicated structural information. Hence, traditional NLP models may be inappropriate for programs. In this paper, we propose a novel tree-based convolutional neural network (TBCNN) for programming language processing, in which a convolution kernel is designed over programs' abstract syntax trees to capture structural information. TBCNN is a generic architecture for programming language processing; our experiments show its effectiveness in two different program analysis tasks: classifying programs according to functionality, and detecting code snippets of certain patterns. TBCNN outperforms baseline methods, including several neural models for NLP.

551 citations

Proceedings ArticleDOI
28 May 2018
TL;DR: DeepCom applies Natural Language Processing (NLP) techniques to learn from a large code corpus and generates comments from learned features for better comments generation of Java methods.
Abstract: During software maintenance, code comments help developers comprehend programs and reduce additional time spent on reading and navigating source code. Unfortunately, these comments are often mismatched, missing or outdated in the software projects. Developers have to infer the functionality from the source code. This paper proposes a new approach named DeepCom to automatically generate code comments for Java methods. The generated comments aim to help developers understand the functionality of Java methods. DeepCom applies Natural Language Processing (NLP) techniques to learn from a large code corpus and generates comments from learned features. We use a deep neural network that analyzes structural information of Java methods for better comments generation. We conduct experiments on a large-scale Java corpus built from 9,714 open source projects from GitHub. We evaluate the experimental results on a machine translation metric. Experimental results demonstrate that our method DeepCom outperforms the state-of-the-art by a substantial margin.

541 citations

References
More filters
Book
28 May 1999
TL;DR: This foundational text is the first comprehensive introduction to statistical natural language processing (NLP) to appear and provides broad but rigorous coverage of mathematical and linguistic foundations, as well as detailed discussion of statistical methods, allowing students and researchers to construct their own implementations.
Abstract: Statistical approaches to processing natural language text have become dominant in recent years This foundational text is the first comprehensive introduction to statistical natural language processing (NLP) to appear The book contains all the theory and algorithms needed for building NLP tools It provides broad but rigorous coverage of mathematical and linguistic foundations, as well as detailed discussion of statistical methods, allowing students and researchers to construct their own implementations The book covers collocation finding, word sense disambiguation, probabilistic parsing, information retrieval, and other applications

9,295 citations

ReportDOI
TL;DR: As a result of this grant, the researchers have now published on CDROM a corpus of over 4 million words of running text annotated with part-of- speech (POS) tags, which includes a fully hand-parsed version of the classic Brown corpus.
Abstract: : As a result of this grant, the researchers have now published oil CDROM a corpus of over 4 million words of running text annotated with part-of- speech (POS) tags, with over 3 million words of that material assigned skeletal grammatical structure This material now includes a fully hand-parsed version of the classic Brown corpus About one half of the papers at the ACL Workshop on Using Large Text Corpora this past summer were based on the materials generated by this grant

8,377 citations


"On the naturalness of software" refers methods in this paper

  • ...Manually annotated code corpora may be well-worth the investment (by analogy with the Penn Tree Bank [37]) and can be constructed using a volunteer community, or perhaps via market mechanisms like the Mechanical Turk [38]....

    [...]

Proceedings ArticleDOI
06 Apr 2008
TL;DR: Although micro-task markets have great potential for rapidly collecting user measurements at low costs, it is found that special care is needed in formulating tasks in order to harness the capabilities of the approach.
Abstract: User studies are important for many aspects of the design process and involve techniques ranging from informal surveys to rigorous laboratory studies. However, the costs involved in engaging users often requires practitioners to trade off between sample size, time requirements, and monetary costs. Micro-task markets, such as Amazon's Mechanical Turk, offer a potential paradigm for engaging a large number of users for low time and monetary costs. Here we investigate the utility of a micro-task market for collecting user measurements, and discuss design considerations for developing remote micro user evaluation tasks. Although micro-task markets have great potential for rapidly collecting user measurements at low costs, we found that special care is needed in formulating tasks in order to harness the capabilities of the approach.

2,052 citations


"On the naturalness of software" refers methods in this paper

  • ...Manually annotated code corpora may be well-worth the investment (by analogy with the Penn Tree Bank [37]) and can be constructed using a volunteer community, or perhaps via market mechanisms like the Mechanical Turk [38]....

    [...]

Book
18 Jan 2010
TL;DR: This introductory text to statistical machine translation (SMT) provides all of the theories and methods needed to build a statistical machine translator, such as Google Language Tools and Babelfish, and the companion website provides open-source corpora and tool-kits.
Abstract: This introductory text to statistical machine translation (SMT) provides all of the theories and methods needed to build a statistical machine translator, such as Google Language Tools and Babelfish. In general, statistical techniques allow automatic translation systems to be built quickly for any language-pair using only translated texts and generic software. With increasing globalization, statistical machine translation will be central to communication and commerce. Based on courses and tutorials, and classroom-tested globally, it is ideal for instruction or self-study, for advanced undergraduates and graduate students in computer science and/or computational linguistics, and researchers in natural language processing. The companion website provides open-source corpora and tool-kits.

1,538 citations


"On the naturalness of software" refers background in this paper

  • ..., Koehn [3], §7) gives good results for software corpora....

    [...]

Book
01 Jan 1984
TL;DR: This anthology of essays from the inventor of literate programming also contains excerpts from the programs for TEX and METAFONT and CWEB, a system for Literate programming in C and related languages.
Abstract: From the Publisher: Including early papers on structured programming, as well as the Computer Journal article that launched its study, this anthology of essays from the inventor of literate programming also contains excerpts from the programs for TEX and METAFONT and CWEB, a system for literate programming in C and related languages.

1,486 citations